Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Exploratory Data Analysis in R Programming
Next article icon

Exploratory Data Analysis in Python | Set 1

Last Updated : 22 Jan, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

This article provides a comprehensive guide to performing Exploratory Data Analysis (EDA) using Python focusing on the use of NumPy and Pandas for data manipulation and analysis.

Step 1: Setting Up Environment

To perform EDA in Python we need to import several libraries that provide powerful tools for data manipulation and statistical analysis.

Python
import numpy as np  import pandas as pd  import seaborn as sns  import matplotlib.pyplot as plt      from scipy.stats import trim_mean  

Step 2: Loading and Inspecting the Dataset

In this step we load a dataset using Pandas and explore its structure. We can check the type of data and print the first and last 10 records to get a idea of the dataset .

Python
data = pd.read_csv("state.csv")   print ("Type : ", type(data), "\n\n")   print ("Head -- \n", data.head(10))   print ("\n Tail -- \n", data.tail(10))  

Output :

Type : class 'pandas.core.frame.DataFrame'


Head --
State Population Murder.Rate Abbreviation
0 Alabama 4779736 5.7 AL
1 Alaska 710231 5.6 AK
2 Arizona 6392017 4.7 AZ
3 Arkansas 2915918 5.6 AR
4 California 37253956 4.4 CA
5 Colorado 5029196 2.8 CO
6 Connecticut 3574097 2.4 CT
7 Delaware 897934 5.8 DE
8 Florida 18801310 5.8 FL
9 Georgia 9687653 5.7 GA


Tail --
State Population Murder.Rate Abbreviation
40 South Dakota 814180 2.3 SD
41 Tennessee 6346105 5.7 TN
42 Texas 25145561 4.4 TX
43 Utah 2763885 2.3 UT
44 Vermont 625741 1.6 VT
45 Virginia 8001024 4.1 VA
46 Washington 6724540 2.5 WA
47 West Virginia 1852994 4.0 WV
48 Wisconsin 5686986 2.9 WI
49 Wyoming 563626 2.7 WY

Step 3: Adding and Modifying Columns

Derived columns are new columns created from existing ones. For example here we are converting the population into millions to make it more readable.

Python
data['PopulationInMillions'] = data['Population']/1000000  print (data.head(5)) 

Output :

        State  Population  Murder.Rate Abbreviation  PopulationInMillions
0 Alabama 4779736 5.7 AL 4.779736
1 Alaska 710231 5.6 AK 0.710231
2 Arizona 6392017 4.7 AZ 6.392017
3 Arkansas 2915918 5.6 AR 2.915918
4 California 37253956 4.4 CA 37.253956

Sometimes, you may need to rename columns when column names contain special characters or spaces which cause issues in data manipulation. To do this we use .rename() function.

Python
data.rename(columns ={'Murder.Rate': 'MurderRate'}, inplace = True)  list(data) 

Output :

['State', 'Population', 'MurderRate', 'Abbreviation']

Step 4: Describing the Data

Using describe() provides a summary of the dataset which includes count, mean, standard deviation and more for each numerical column.

Python
df.describe() 

Output :

The info() method in pandas provides a summary of the dataset includes number of rows , column names, data types of each column, and the memory usage of the entire dataframe. It helps to quickly understand the structure and size of the dataset.

Python
data.info() 

Output :


RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
State 50 non-null object
Population 50 non-null int64
Murder.Rate 50 non-null float64
Abbreviation 50 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 1.6+ KB

Step 5: Calculating Central Tendencies

Understanding the central tendencies of your data helps you summarize it effectively. In this step we will calculate different central tendency measures such as the mean, trimmed mean, weighted mean, and median for the dataset’s numerical columns.

1. Mean

The mean is the average value of a dataset. It’s calculated by summing all values and dividing by the number of values. In pandas it can be with help of mean() function.

Python
Population_mean = data.Population.mean()  print ("Population Mean : ", Population_mean)     MurderRate_mean = data.MurderRate.mean()  print ("\nMurderRate Mean : ", MurderRate_mean)  

Output:

Population Mean :  6162876.3

MurderRate Mean : 4.066

2. Trimmed Mean

Trimmed mean calculates the average by removing a certain percentage of the highest and lowest values in the dataset. This helps reduce the impact of outliers or extreme values that could skew the overall mean.

Python
population_TM = trim_mean(data.Population, 0.1)  print ("Population trimmed mean: ", population_TM)     murder_TM = trim_mean(data.MurderRate, 0.1)  print ("\nMurderRate trimmed mean: ", murder_TM)  

Output :

Population trimmed mean:  4783697.125

MurderRate trimmed mean: 3.9450000000000003

3. Weighted Mean

A weighted mean assigns different weights to different data points. Here we calculate the murder rate weighted by the population meaning larger states have more influence on the mean.

Python
murderRate_WM = np.average(data.MurderRate, weights = data.Population)  print ("Weighted MurderRate Mean: ", murderRate_WM)  

Output :

Weighted MurderRate Mean:  4.445833981123393

4. Median

The median is the middle value when the data is sorted and it is useful for understanding the central tendency especially when the data has outliers.

Python
Population_median = data.Population.median()  print ("Population median : ", Population_median)     MurderRate_median = data.MurderRate.median()  print ("\nMurderRate median : ", MurderRate_median)  

Output :

Population median :  4436369.5

MurderRate median : 4.0

In conclusion you have learn how to use Pandas to perform various EDA tasks such as loading data, inspecting data types, adding and modifying columns, and calculating key statistics like mean, median, and trimmed mean.



Next Article
Exploratory Data Analysis in R Programming
author
mohit gupta_omg :)
Improve
Article Tags :
  • AI-ML-DS
  • Machine Learning
  • AI-ML-DS With Python
  • ML-EDA
Practice Tags :
  • Machine Learning

Similar Reads

  • Exploratory Data Analysis in Python | Set 2
    In the previous article, we have discussed some basic techniques to analyze the data, now let's see the visual techniques. Let's see the basic techniques - # Loading Libraries import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from scipy.stats import trim_me
    4 min read
  • EDA | Exploratory Data Analysis in Python
    Exploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
    6 min read
  • What is Exploratory Data Analysis?
    Exploratory Data Analysis (EDA) is a important step in data science as it visualizing data to understand its main features, find patterns and discover how different parts of the data are connected. In this article, we will see more about Exploratory Data Analysis (EDA). Why Exploratory Data Analysis
    8 min read
  • Exploratory Data Analysis in R Programming
    Exploratory Data Analysis or EDA is a statistical approach or technique for analyzing data sets to summarize their important and main characteristics generally by using some visual aids. The EDA approach can be used to gather knowledge about the following aspects of data. Main characteristics or fea
    11 min read
  • Exploratory Data Analysis on Iris Dataset
    Exploratory Data Analysis (EDA) is a technique to analyze data using some visual Techniques. With this technique, we can get detailed information about the statistical summary of the data. We will also be able to deal with the duplicates values, outliers, and also see some trends or patterns present
    9 min read
  • Sequential Data Analysis in Python
    Sequential data, often referred to as ordered data, consists of observations arranged in a specific order. This type of data is not necessarily time-based; it can represent sequences such as text, DNA strands, or user actions. In this article, we are going to explore, sequential data analysis, it's
    8 min read
  • Exploratory Data Analysis (EDA) - Types and Tools
    Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore data, and possibly formulate hypotheses that might cause new data collection and experiments. EDA focuses more narrowly on checking assumptions required for model fitting and hypothesis testing. It also checks
    6 min read
  • Steps for Mastering Exploratory Data Analysis | EDA Steps
    Mastering exploratory data analysis (EDA) is crucial for understanding your data, identifying patterns, and generating insights that can inform further analysis or decision-making. Data is the lifeblood of cutting-edge groups, and the capability to extract insights from records has become a crucial
    15+ min read
  • 10 Python Pandas tips to make data analysis faster
    Data analysis using Python's Pandas library is a powerful process, and its efficiency can be enhanced with specific tricks and techniques. These Python tips will make our code concise, readable, and efficient. The adaptability of Pandas makes it an efficient tool for working with structured data. Wh
    15 min read
  • Data Analysis with Python
    In this article, we will discuss how to do data analysis with Python. We will discuss all sorts of data analysis i.e. analyzing numerical data with NumPy, Tabular data with Pandas, data visualization Matplotlib, and Exploratory data analysis. Data Analysis With Python Data Analysis is the technique
    15+ min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences