Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Plotting graph using Seaborn | Python
Next article icon

Python | Titanic Data EDA using Seaborn

Last Updated : 22 May, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

What is EDA? 
Exploratory Data Analysis (EDA) is a method used to analyze and summarize datasets. Majority of the EDA techniques involve the use of graphs.

Titanic Dataset – 
It is one of the most popular datasets used for understanding machine learning basics. It contains information of all the passengers aboard the RMS Titanic, which unfortunately was shipwrecked. This dataset can be used to predict whether a given passenger survived or not. 

The csv file can be downloaded from Kaggle.

Code: Loading data using Pandas 

Python3




#importing pandas library
import pandas as pd
  
#loading data
titanic = pd.read_csv('...\input\train.csv')
 
 

Seaborn: 
It is a python library used to statistically visualize data. Seaborn, built over Matplotlib, provides a better interface and ease of usage. It can be installed using the following command, 
pip3 install seaborn

Code: Printing data head  

Python3




# View first five rows of the dataset
titanic.head()
 
 

Output : 

Code: Checking the NULL values 

Python3




titanic.isnull().sum()
 
 

Output : 

The columns having null values are: Age, Cabin, Embarked. They need to be filled up with appropriate values later on.

Features: The titanic dataset has roughly the following types of features:

  • Categorical/Nominal: Variables that can be divided into multiple categories but having no order or priority. 
    Eg. Embarked (C = Cherbourg; Q = Queenstown; S = Southampton)
  • Binary: A subtype of categorical features, where the variable has only two categories. 
    Eg: Sex (Male/Female)
  • Ordinal: They are similar to categorical features but they have an order(i.e can be sorted). 
    Eg. Pclass (1, 2, 3)
  • Continuous: They can take up any value between the minimum and maximum values in a column. 
    Eg. Age, Fare
  • Count: They represent the count of a variable. 
    Eg. SibSp, Parch
  • Useless: They don’t contribute to the final outcome of an ML model. Here, PassengerId, Name, Cabin and Ticket might fall into this category.

Code: Graphical Analysis 

Python3




import seaborn as sns
import matplotlib.pyplot as plt
  
# Countplot
sns.catplot(x ="Sex", hue ="Survived", 
kind ="count", data = titanic)
 
 

Output :

Just by observing the graph, it can be approximated that the survival rate of men is around 20% and that of women is around 75%. Therefore, whether a passenger is a male or a female plays an important role in determining if one is going to survive.

Code : Pclass (Ordinal Feature) vs Survived  

Python3




# Group the dataset by Pclass and Survived and then unstack them
group = titanic.groupby(['Pclass', 'Survived'])
pclass_survived = group.size().unstack()
  
# Heatmap - Color encoded 2D representation of data.
sns.heatmap(pclass_survived, annot = True, fmt ="d")
 
 

Output: 

It helps in determining if higher-class passengers had more survival rate than the lower class ones or vice versa. Class 1 passengers have a higher survival chance compared to classes 2 and 3. It implies that Pclass contributes a lot to a passenger’s survival rate.

Code : Age (Continuous Feature) vs Survived  

Python3




# Violinplot Displays distribution of data 
# across all levels of a category.
sns.violinplot(x ="Sex", y ="Age", hue ="Survived", 
data = titanic, split = True)
 
 

Output : 

This graph gives a summary of the age range of men, women and children who were saved. The survival rate is –  

  • Good for children.
  • High for women in the age range 20-50.
  • Less for men as the age increases.

Since Age column is important, the missing values need to be filled, either by using the Name column(ascertaining age based on salutation – Mr, Mrs etc.) or by using a regressor. 
After this step, another column – Age_Range (based on age column) can be created and the data can be analyzed again. 

Code : Factor plot for Family_Size (Count Feature) and Family Size.  

Python3




# Adding a column Family_Size
titanic['Family_Size'] = 0
titanic['Family_Size'] = titanic['Parch']+titanic['SibSp']
  
# Adding a column Alone
titanic['Alone'] = 0
titanic.loc[titanic.Family_Size == 0, 'Alone'] = 1
  
# Factorplot for Family_Size
sns.factorplot(x ='Family_Size', y ='Survived', data = titanic)
  
# Factorplot for Alone
sns.factorplot(x ='Alone', y ='Survived', data = titanic)
 
 


Family_Size denotes the number of people in a passenger’s family. It is calculated by summing the SibSp and Parch columns of a respective passenger. Also, another column Alone is added to check the chances of survival of a lone passenger against the one with a family.

Important observations – 

  • If a passenger is alone, the survival rate is less.
  • If the family size is greater than 5, chances of survival decrease considerably.

Code : Bar Plot for Fare (Continuous Feature)  

Python3




# Divide Fare into 4 bins
titanic['Fare_Range'] = pd.qcut(titanic['Fare'], 4)
  
# Barplot - Shows approximate values based 
# on the height of bars.
sns.barplot(x ='Fare_Range', y ='Survived', 
data = titanic)
 
 

Output : 

Fare denotes the fare paid by a passenger. As the values in this column are continuous, they need to be put in separate bins(as done for Age feature) to get a clear idea. It can be concluded that if a passenger paid a higher fare, the survival rate is more.

Code: Categorical Count Plots for Embarked Feature  

Python3




# Countplot
sns.catplot(x ='Embarked', hue ='Survived', 
kind ='count', col ='Pclass', data = titanic)
 
 


Some notable observations are: 

  • Majority of the passengers boarded from S. So, the missing values can be filled with S.
  • Majority of class 3 passengers boarded from Q.
  • S looks lucky for class 1 and 2 passengers compared to class 3.

Conclusion :  

  • The columns that can be dropped are: 
    • PassengerId, Name, Ticket, Cabin: They are strings, cannot be categorized and don’t contribute much to the outcome. 
    • Age, Fare: Instead, the respective range columns are retained.
  • The titanic data can be analyzed using many more graph techniques and also more column correlations, than, as described in this article.
  • Once the EDA is completed, the resultant dataset can be used for predictions.


Next Article
Plotting graph using Seaborn | Python

S

samyuktashegde
Improve
Article Tags :
  • Machine Learning
  • Technical Scripter
  • python
  • Technical Scripter 2019
Practice Tags :
  • Machine Learning
  • python

Similar Reads

  • Data Visualization with Seaborn - Python
    Data visualization can be done by seaborn and it can transform complex datasets into clear visual representations making it easier to understand, identify trends and relationships within the data. This article will guide you through various plotting functions available in Seaborn. Getting Started wi
    13 min read
  • Plotting graph using Seaborn | Python
    This article will introduce you to graphing in Python with Seaborn, which is the most popular statistical visualization library in Python. Installation: The easiest way to install seaborn is to use pip. Type following command in terminal: pip install seaborn OR, you can download it from here and ins
    8 min read
  • Python Seaborn - Strip plot illustration using Catplot
    Seaborn is a data visualization library that is based on matplotlib. A high-level interface is provided by seaborn for drawing informative and attractive statistical graphics. Seaborn Catplot is a new addition to seaborn that makes plotting easier and involves categorical variables. It is used to sh
    2 min read
  • Spaceship Titanic Project using Machine Learning - Python
    If you are a machine learning enthusiast you must have done the Titanic project in which you would have predicted whether a person will survive or not.  Spaceship Titanic Project using Machine Learning in PythonIn this article, we will try to solve one such problem which is a slightly modified versi
    9 min read
  • Data Visualization with Seaborn Line Plot
    Prerequisite: SeabornMatplotlib  Presenting data graphically to emit some information is known as data visualization. It basically is an image to help a person interpret what the data represents and study it and its nature in detail. Dealing with large scale data row-wise is an extremely tedious tas
    4 min read
  • Python | Linear Regression using sklearn
    Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models
    3 min read
  • Make Violinplot with data points using Seaborn
    A violin plot plays a similar activity that is pursued through whisker or box plot do. As it shows several quantitative data across one or more categorical variables. It can be an effective and attractive way to show multiple data at several units. A “wide-form” Data Frame helps to maintain each num
    3 min read
  • Titanic Survival Prediction Using Machine Learning
    The sinking of the RMS Titanic in 1912 remains one of the most infamous maritime disasters in history, leading to significant loss of life. Over 1,500 passengers and crew perished that fateful night. Understanding the factors that contributed to survival can provide valuable insights into safety pro
    9 min read
  • Bulk Insert to Pandas DataFrame Using SQLAlchemy - Python
    Let's start with SQLAlchemy, a Python library that allows communication with databases(MySQL, PostgreSQL etc.) and Python. This library is used as an Object Relational Mapper tool that translates Python classes to tables in relational databases and automatically converts function calls to SQL statem
    3 min read
  • Logistic Regression using PySpark Python
    In this tutorial series, we are going to cover Logistic Regression using Pyspark. Logistic Regression is one of the basic ways to perform classification (don’t be confused by the word “regression”). Logistic Regression is a classification method. Some examples of classification are: Spam detectionDi
    3 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences