Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
SQL for Data Analysis
Next article icon

Exploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and Seaborn

Last Updated : 26 Dec, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Exploratory Data Analysis (EDA) serves as the foundation of any data science project. It is an essential step where data scientists investigate datasets to understand their structure, identify patterns, and uncover insights. Data preparation involves several steps, including cleaning, transforming, and exploring data to make it suitable for analysis.

Why EDA important in Data Science?

To effectively work with data, it’s essential to first understand the nature and structure of data. EDA helps answer critical questions about the dataset and guides the necessary preprocessing steps before applying any algorithms. For instance:

  • What type of data do we have? Are we working with numbers, text, or dates?
  • Are there outliers? These are unusual values that are very different from the rest.
  • Is anything missing? Are some parts of the dataset empty or incomplete?

Imagine you’re working with a student performance dataset. If some rows are missing test scores, or the names of subjects are inconsistently spelled (e.g., "Math" and "Mathematics"), you’ll need to address these issues before proceeding. EDA helps to identify such problems and clean the data to ensure reliable analysis.

Now, we will understand core packages for exploratory data analysis (EDA), including NumPy, Pandas, Seaborn, and Matplotlib.

1. NumPy for Numerical Operations

NumPy is used for working with numerical data in Python.

  • Handles Large Datasets Efficiently: NumPy allows to work with large, multi-dimensional arrays and matrices of numerical data. Provides functions for performing mathematical operations such as linear algebra, statistical analysis.
  • Facilitates Data Transformation: Helps in sorting, reshaping, and aggregating data.

Example : Let’s consider a simple example where we analyze the distribution of a dataset containing exam scores for students using numpy:

Python
import numpy as np  # Dataset: Exam scores scores = np.array([45, 50, 55, 60, 65, 70, 75, 80, 200])  # Note: One extreme value (200)  # Calculate basic statistics mean_score = np.mean(scores) median_score = np.median(scores) std_dev_score = np.std(scores)  print(f"Mean: {mean_score}, Median: {median_score}, Standard Deviation: {std_dev_score}") 

Output
Mean: 77.77777777777777, Median: 65.0, Standard Deviation: 44.541560561838764 

This example demonstrates how NumPy can quickly compute statistics. We can also detect anomalies in data using z-score. Now follow below resources for in-depth understanding.

  • Introduction to NumPy
  • Basics of NumPy Arrays
    • Data types and type casting
    • Accessing and Modifying Data - Indexing and slicing
  • Broadcasting - Perform operations on arrays with different shapes
  • Linear algebra operations: Solving Mathematical Problems
  • Saving and loading NumPy arrays

2. Pandas for Data Manipulation

Built on top of NumPy, Pandas excels at handling tabular data (data organized in rows and columns) through its core data structures: Series (1D) and DataFrame (2D). Pandas simplifies the process of working with structured data by:

  • Easy loading and saving of datasets in formats like CSV, Excel, SQL, or JSON:
    • Read Dataset with Pandas
    • Save DataFrame as CSV file for further use
    • Reading from JSON files into Pandas DataFrame
    • Working with Excel files
  • Data Processing with Pandas
  • Slicing rows with pandas Indexing
  • Data Aggregation and Grouping
  • Working with Date and Time

3. Matplotlib for Data Visualization

Matplotlib brings us data visualizations, it is a powerful and versatile open-source plotting library for Python, designed to help users visualize data in a variety of formats.

  • Introduction to Matplotlib
  • Pyplot in Matplotlib
  • Matplotlib – Axes Class
  • Matplotlib for 3D Plotting
  • Exploratory Data Analysis with matplotlib

4. Seaborn for Statistical Data Visualization

Seaborn is built on top of Matplotlib and is specifically designed for statistical data visualization. It provides a high-level interface for drawing attractive and informative statistical graphics.

  • Introduction to Seaborn
  • Types Of Seaborn Plots
  • Pairplot function in seaborn
  • FacetGrid in Seaborn
  • Time Series Visualization with Seaborn : Line Plot

Complete EDA Workflow Using NumPy, Pandas, and Seaborn

Let's implement complete workflow for performing EDA: starting with numerical analysis using NumPy and Pandas, followed by insightful visualizations using Seaborn to make data-driven decisions effectively.

  • Performing EDA with Numpy and Pandas - Set 1
  • After analysis : Visualizing with seaborn - Set 2

For more hands-on implementation - Explore projects below:

  • Titanic Data EDA using Seaborn
  • Uber Rides Data Analysis
  • Zomato Data Analysis Using Python
  • Global Covid-19 Data Analysis and Visualizations
  • iPhone Sales Analysis
  • Google Search Analysis

Web Scraping For EDA

Now, what is Web-scraping? : It is the automated process of extracting data from websites for later on analysis.

  • How to Extract Weather Data from Google in Python?
  • Movies Review Scraping And Analysis
  • Product Price Scraping and Analysis
  • News Scraping and Analysis
  • Real-time Share Price scrapping and analysis



Next Article
SQL for Data Analysis

A

anurag702
Improve
Article Tags :
  • Data Science
  • Data Analysis
  • AI-ML-DS
  • AI-ML-DS With Python

Similar Reads

    Data Analysis (Analytics) Tutorial
    Data Analytics is a process of examining, cleaning, transforming and interpreting data to discover useful information, draw conclusions and support decision-making. It helps businesses and organizations understand their data better, identify patterns, solve problems and improve overall performance.
    4 min read

    Prerequisites for Data Analysis

    Exploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and Seaborn
    Exploratory Data Analysis (EDA) serves as the foundation of any data science project. It is an essential step where data scientists investigate datasets to understand their structure, identify patterns, and uncover insights. Data preparation involves several steps, including cleaning, transforming,
    4 min read
    SQL for Data Analysis
    SQL (Structured Query Language) is a powerful tool for data analysis, allowing users to efficiently query and manipulate data stored in relational databases. Whether you are working with sales, customer or financial data, SQL helps extract insights and perform complex operations like aggregation, fi
    6 min read
    Python | Math operations for Data analysis
    Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.There are some important math operations that can be performed on a pandas series to si
    2 min read
    Python - Data visualization tutorial
    Data visualization is a crucial aspect of data analysis, helping to transform analyzed data into meaningful insights through graphical representations. This comprehensive tutorial will guide you through the fundamentals of data visualization using Python. We'll explore various libraries, including M
    7 min read
    Free Public Data Sets For Analysis
    Data analysis is a crucial aspect of modern decision-making processes across various domains, including business, academia, healthcare, and government. However, obtaining high-quality datasets for analysis can be challenging and costly. Fortunately, there are numerous free public datasets available
    5 min read

    Data Analysis Libraries

    Pandas Tutorial
    Pandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
    6 min read
    NumPy Tutorial - Python Library
    NumPy (short for Numerical Python ) is one of the most fundamental libraries in Python for scientific computing. It provides support for large, multi-dimensional arrays and matrices along with a collection of mathematical functions to operate on arrays.At its core it introduces the ndarray (n-dimens
    3 min read
    Data Analysis with SciPy
    Scipy is a Python library useful for solving many mathematical equations and algorithms. It is designed on the top of Numpy library that gives more extension of finding scientific mathematical formulae like Matrix Rank, Inverse, polynomial equations, LU Decomposition, etc. Using its high-level funct
    6 min read

    Understanding the Data

    What is Data ?
    Data is a word we hear everywhere nowadays. In general, data is a collection of facts, information, and statistics and this can be in various forms such as numbers, text, sound, images, or any other format.In this article, we will learn about What is Data, the Types of Data, Importance of Data, and
    9 min read
    Understanding Data Attribute Types | Qualitative and Quantitative
    When we talk about data mining , we usually discuss knowledge discovery from data. To learn about the data, it is necessary to discuss data objects, data attributes, and types of data attributes. Mining data includes knowing about data, finding relations between data. And for this, we need to discus
    6 min read
    Univariate, Bivariate and Multivariate data and its analysis
    In this article,we will be discussing univariate, bivariate, and multivariate data and their analysis. Univariate data: Univariate data refers to a type of data in which each observation or data point corresponds to a single variable. In other words, it involves the measurement or observation of a s
    5 min read
    Attributes and its Types in Data Analytics
    In this article, we are going to discuss attributes and their various types in data analytics. We will also cover attribute types with the help of examples for better understanding. So let's discuss them one by one. What are Attributes?Attributes are qualities or characteristics that describe an obj
    4 min read

    Loading the Data

    Pandas Read CSV in Python
    CSV files are the Comma Separated Files. It allows users to load tabular data into a DataFrame, which is a powerful structure for data manipulation and analysis. To access data from the CSV file, we require a function read_csv() from Pandas that retrieves data in the form of the data frame. Here’s a
    6 min read
    Export Pandas dataframe to a CSV file
    When working on a Data Science project one of the key tasks is data management which includes data collection, cleaning and storage. Once our data is cleaned and processed it’s essential to save it in a structured format for further analysis or sharing.A CSV (Comma-Separated Values) file is a widely
    2 min read
    Pandas - Parsing JSON Dataset
    JSON (JavaScript Object Notation) is a popular way to store and exchange data especially used in web APIs and configuration files. Pandas provides tools to parse JSON data and convert it into structured DataFrames for analysis. In this guide we will explore various ways to read, manipulate and norma
    2 min read
    Exporting Pandas DataFrame to JSON File
    Pandas a powerful Python library for data manipulation provides the to_json() function to convert a DataFrame into a JSON file and the read_json() function to read a JSON file into a DataFrame.In this article we will explore how to export a Pandas DataFrame to a JSON file with detailed explanations
    2 min read
    Working with Excel files using Pandas
    Excel sheets are very instinctive and user-friendly, which makes them ideal for manipulating large datasets even for less technical folks. If you are looking for places to learn to manipulate and automate stuff in Excel files using Python, look no further. You are at the right place.In this article,
    7 min read

    Data Cleaning

    What is Data Cleaning?
    Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies within a dataset. This crucial step in the data management and data science pipeline ensures that the data is accurate, consistent, and
    12 min read
    ML | Overview of Data Cleaning
    Data cleaning is a important step in the machine learning (ML) pipeline as it involves identifying and removing any missing duplicate or irrelevant data. The goal of data cleaning is to ensure that the data is accurate, consistent and free of errors as raw data is often noisy, incomplete and inconsi
    13 min read
    Best Data Cleaning Techniques for Preparing Your Data
    Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets to improve their quality, accuracy, and reliability for analysis or other applications. It involves several steps aimed at detecting and r
    6 min read

    Handling Missing Data

    Working with Missing Data in Pandas
    In Pandas, missing data occurs when some values are missing or not collected properly and these missing values are represented as:None: A Python object used to represent missing values in object-type arrays.NaN: A special floating-point value from NumPy which is recognized by all systems that use IE
    5 min read
    Drop rows from Pandas dataframe with missing values or NaN in columns
    We are given a Pandas DataFrame that may contain missing values, also known as NaN (Not a Number), in one or more columns. Our task is to remove the rows that have these missing values to ensure cleaner and more accurate data for analysis. For example, if a row contains NaN in any specified column,
    4 min read
    Count NaN or missing values in Pandas DataFrame
    In this article, we will see how to Count NaN or missing values in Pandas DataFrame using isnull() and sum() method of the DataFrame. 1. DataFrame.isnull() MethodDataFrame.isnull() function detect missing values in the given object. It return a boolean same-sized object indicating if the values are
    3 min read
    ML | Handling Missing Values
    Missing values are a common issue in machine learning. This occurs when a particular variable lacks data points, resulting in incomplete information and potentially harming the accuracy and dependability of your models. It is essential to address missing values efficiently to ensure strong and impar
    12 min read
    Working with Missing Data in Pandas
    In Pandas, missing data occurs when some values are missing or not collected properly and these missing values are represented as:None: A Python object used to represent missing values in object-type arrays.NaN: A special floating-point value from NumPy which is recognized by all systems that use IE
    5 min read
    ML | Handle Missing Data with Simple Imputer
    SimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder. It is implemented by the use of the SimpleImputer() method which takes the following arguments : missing_values : The missing_
    2 min read
    How to handle missing values of categorical variables in Python?
    Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. Often we come across datasets in which some values are missing from the columns. This causes problems when we apply a machine learning model to the dataset. This increases the cha
    4 min read
    Replacing missing values using Pandas in Python
    Dataset is a collection of attributes and rows. Data set can have missing data that are represented by NA in Python and in this article, we are going to replace missing values in this article We consider this data set: Dataset data set In our data contains missing values in quantity, price, bought,
    2 min read

    Outliers Detection

    Box Plot
    Box Plot is a graphical method to visualize data distribution for gaining insights and making informed decisions. Box plot is a type of chart that depicts a group of numerical data through their quartiles. In this article, we are going to discuss components of a box plot, how to create a box plot, u
    7 min read
    Detect and Remove the Outliers using Python
    Outliers are data points that deviate significantly from other data points in a dataset. They can arise from a variety of factors such as measurement errors, rare events or natural variations in the data. If left unchecked it can distort data analysis, skew statistical results and impact machine lea
    8 min read
    Z score for Outlier Detection - Python
    Z score (or standard score) is an important concept in statistics. It helps to understand if a data value is greater or smaller than the mean and how far away it is from the mean. More specifically, the Z score tells how many standard deviations away a data point is from the mean. Z score = (x -mean
    3 min read
    Clustering-Based approaches for outlier detection in data mining
    Clustering Analysis is the process of dividing a set of data objects into subsets. Each subset is a cluster such that objects are similar to each other. The set of clusters obtained from clustering analysis can be referred to as Clustering. For example: Segregating customers in a Retail market as a
    6 min read

    Exploratory Data Analysis

    What is Exploratory Data Analysis?
    Exploratory Data Analysis (EDA) is a important step in data science as it visualizing data to understand its main features, find patterns and discover how different parts of the data are connected. In this article, we will see more about Exploratory Data Analysis (EDA).Why Exploratory Data Analysis
    8 min read
    EDA - Exploratory Data Analysis in Python
    Exploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
    6 min read

    Time Series Data Analysis

    Time Series Analysis & Visualization in Python
    Time series data consists of sequential data points recorded over time which is used in industries like finance, pharmaceuticals, social media and research. Analyzing and visualizing this data helps us to find trends and seasonal patterns for forecasting and decision-making. In this article, we will
    6 min read
    What is a trend in time series?
    Time series data is a sequence of data points that measure some variable over ordered period of time. It is the fastest-growing category of databases as it is widely used in a variety of industries to understand and forecast data patterns. So while preparing this time series data for modeling it's i
    3 min read
    Basic DateTime Operations in Python
    Python has an in-built module named DateTime to deal with dates and times in numerous ways. In this article, we are going to see basic DateTime operations in Python. There are six main object classes with their respective components in the datetime module mentioned below: datetime.datedatetime.timed
    12 min read
    How to deal with missing values in a Timeseries in Python?
    It is common to come across missing values when working with real-world data. Time series data is different from traditional machine learning datasets because it is collected under varying conditions over time. As a result, different mechanisms can be responsible for missing records at different tim
    9 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences