Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Python Tutorial
  • Interview Questions
  • Python Quiz
  • Python Glossary
  • Python Projects
  • Practice Python
  • Data Science With Python
  • Python Web Dev
  • DSA with Python
  • Python OOPs
Open In App
Next Article:
Data Processing with Pandas
Next article icon

Data Processing with Pandas

Last Updated : 07 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Data Processing is an important part of any task that includes data-driven work. It helps us to provide meaningful insights from the data. As we know Python is a widely used programming language, and there are various libraries and tools available for data processing.

In this article, we are going to see Data Processing in Python, Loading, Printing rows and Columns, Data frame summary, Missing data values Sorting and Merging Data Frames, Applying Functions, and Visualizing Dataframes.

Table of Content

  • What is Data Processing in Python?
  • What is Pandas?
  • Loading Data in Pandas DataFrame
  • Printing rows of the Data
  • Printing the column names of the DataFrame
  • Summary of Data Frame
  • Descriptive Statistical Measures of a DataFrame
  • Missing Data Handing
  • Sorting DataFrame values
  • Merge Data Frames
  • Apply Function
  • By using the lambda operator
  • Visualizing DataFrame
  • Conclusion

What is Data Processing in Python?

Data processing in Python refers to manipulating, transforming, and analyzing data by using Python. It contains a series of operations that aim to change raw data into structured data. or meaningful insights. By converting raw data into meaningful insights it makes it suitable for analysis, visualization, or other applications.Python provides several libraries and tools that facilitate efficient data processing, making it a popular choice for working with diverse datasets.

Refer to this article - Introduction to Data Processing

What is Pandas?

Pandas is a powerful, fast, and open-source library built on NumPy. It is used for data manipulation and real-world data analysis in Python. Easy handling of missing data, Flexible reshaping and pivoting of data sets, and size mutability make pandas a great tool for performing data manipulation and handling the data efficiently.

Loading Data in Pandas DataFrame

Reading CSV file using pd.read_csv and loading data into a data frame. Import pandas as using pd for the shorthand. You can download the data from here.

Python
#Importing pandas library import pandas as pd  #Loading data into a DataFrame data_frame=pd.read_csv('Mall_Customers.csv') 

Printing rows of the Data

By default, data_frame.head() displays the first five rows and data_frame.tail() displays last five rows. If we want to get first 'n' number of rows then we use, data_frame.head(n) similar is the syntax to print the last n rows of the data frame.

Python
#displaying first five rows display(data_frame.head())  #displaying last five rows display(data_frame.tail()) 

Output:

   CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
195 196 Female 35 120 79
196 197 Female 45 126 28
197 198 Male 32 126 74
198 199 Male 32 137 18
199 200 Male 30 137 83
[4]
0s
# Program to print all the column name of the dataframe
print(list(data_frame.columns))

Printing the column names of the DataFrame

Python
# Program to print all the column name of the dataframe print(list(data_frame.columns)) 

Output:

['CustomerID', 'Genre', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)']

Summary of Data Frame

The functions info() prints the summary of a DataFrame that includes the data type of each column, RangeIndex (number of rows), columns, non-null values, and memory usage.

Python
data_frame.info() 

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 200 non-null int64
1 Genre 200 non-null object
2 Age 200 non-null int64
3 Annual Income (k$) 200 non-null int64
4 Spending Score (1-100) 200 non-null int64
dtypes: int64(4), object(1)
memory usage: 7.9+ KB

Descriptive Statistical Measures of a DataFrame

The describe() function outputs descriptive statistics which include those that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. For numeric data, the result’s index will include count, mean, std, min, and max as well as lower, 50, and upper percentiles. For object data (e.g. strings), the result’s index will include count, unique, top, and freq. 

Python
data_frame.describe() 

Output:

       CustomerID         Age  Annual Income (k$)  Spending Score (1-100)
count 200.000000 200.000000 200.000000 200.000000
mean 100.500000 38.850000 60.560000 50.200000
std 57.879185 13.969007 26.264721 25.823522
min 1.000000 18.000000 15.000000 1.000000
25% 50.750000 28.750000 41.500000 34.750000
50% 100.500000 36.000000 61.500000 50.000000
75% 150.250000 49.000000 78.000000 73.000000
max 200.000000 70.000000 137.000000 99.000000

Missing Data Handing

Find missing values in the dataset

The isnull( ) detects the missing values and returns a boolean object indicating if the values are NA. The values which are none or empty get mapped to true values and not null values get mapped to false values.

Python
data_frame.isnull( ) 

Output:

     CustomerID  Genre    Age  Annual Income (k$)  Spending Score (1-100)
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
.. ... ... ... ... ...
195 False False False False False
196 False False False False False
197 False False False False False
198 False False False False False
199 False False False False False
[200 rows x 5 columns]
[8]
0s

Find the number of missing values in the dataset

To find out the number of missing values in the dataset, use data_frame.isnull( ).sum( ). In the below example, the dataset doesn't contain any null values. Hence, each column's output is 0.

Python
data_frame.isnull().sum() 

Output:

CustomerID                0
Genre 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64

Removing missing values

The data_frame.dropna( ) function removes columns or rows which contains atleast one missing values.

data_frame = data_frame.dropna()

By default, data_frame.dropna( ) drops the rows where at least one element is missing. data_frame.dropna(axis = 1) drops the columns where at least one element is missing.

Fill in missing values

We can fill null values using data_frame.fillna( ) function.

 data_frame = data_frame.fillna(value)

But by using the above format all the null values will get filled with the same values. To fill different values in the different columns we can use.

data_frame[col] = data_frame[col].fillna(value)

Row and column manipulations

Removing rows

By using the drop(index) function we can drop the row at a particular index. If we want to replace the data_frame with the row removed then add inplace = True in the drop function.

Python
#Removing 4th indexed value from the dataframe data_frame.drop(4).head() 

Output:

 CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
5 6 Female 22 17 76
[ ]

This function can also be used to remove the columns of a data frame by adding the attribute axis =1 and providing the list of columns we would like to remove.

Renaming rows

The rename function can be used to rename the rows or columns of the data frame.

Python
data_frame.rename({0:"First",1:"Second"}) 

Output:

        CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)
First 1 Male 19 15 39
Second 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
... ... ... ... ... ...
195 196 Female 35 120 79
196 197 Female 45 126 28
197 198 Male 32 126 74
198 199 Male 32 137 18
199 200 Male 30 137 83
[200 rows x 5 columns]

Adding new columns

Python
#Creates a new column with all the values equal to 1 data_frame['NewColumn'] = 1 data_frame.head() 

Output:

CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)  \
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
NewColumn
0 1
1 1
2 1
3 1
4 1

Sorting DataFrame values

 Sort by column

The sort_values( ) are the values of the column whose name is passed in the by attribute in the ascending order by default we can set this attribute to false to sort the array in the descending order.

Python
data_frame.sort_values(by='Age', ascending=False).head() 

Output:

CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)  \
70 71 Male 70 49 55
60 61 Male 70 46 56
57 58 Male 69 44 46
90 91 Female 68 59 55
67 68 Female 68 48 48
NewColumn
70 1
60 1
57 1
90 1
67 1

 Sort by multiple columns

Python
data_frame.sort_values(by=['Age','Annual Income (k$)']).head(10) 

Output:

CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)  \
33 34 Male 18 33 92
65 66 Male 18 48 59
91 92 Male 18 59 41
114 115 Female 18 65 48
0 1 Male 19 15 39
61 62 Male 19 46 55
68 69 Male 19 48 59
111 112 Female 19 63 54
113 114 Male 19 64 46
115 116 Female 19 65 50
NewColumn
33 1
65 1
91 1
114 1
0 1
61 1
68 1
111 1
113 1
115 1

Merge Data Frames

The merge() function in pandas is used for all standard database join operations. Merge operation on data frames will join two data frames based on their common column values. Let's create a data frame.

Python
#Creating dataframe1 df1 = pd.DataFrame({         'Name':['Jeevan', 'Raavan', 'Geeta', 'Bheem'],          'Age':[25, 24, 52, 40],          'Qualification':['Msc', 'MA', 'MCA', 'Phd']}) df1 

Output:

Name  Age Qualification
0 Jeevan 25 Msc
1 Raavan 24 MA
2 Geeta 52 MCA
3 Bheem 40 Phd

Now we will create another data frame.

Python
#Creating dataframe2 df2 = pd.DataFrame({'Name':['Jeevan', 'Raavan', 'Geeta', 'Bheem'],                     'Salary':[100000, 50000, 20000, 40000]}) df2 

Output:

Name  Salary
0 Jeevan 100000
1 Raavan 50000
2 Geeta 20000
3 Bheem 40000

Now. let's merge these two data frames created earlier.

Python
#Merging two dataframes df = pd.merge(df1, df2) df 

Output:

     Name  Age Qualification  Salary
0 Jeevan 25 Msc 100000
1 Raavan 24 MA 50000
2 Geeta 52 MCA 20000
3 Bheem 40 Phd 40000

Apply Function

By defining a function beforehand

The apply( ) function is used to iterate over a data frame. It can also be used with lambda functions.

Python
# Apply function def fun(value):     if value > 70:         return "Yes"     else:         return "No"  data_frame['Customer Satisfaction'] = data_frame['Spending Score (1-100)'].apply(fun) data_frame.head(10) 

Output:

   CustomerID   Genre  Age  Annual Income (k$)  Spending Score (1-100)  \
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
5 6 Female 22 17 76
6 7 Female 35 18 6
7 8 Female 23 18 94
8 9 Male 64 19 3
9 10 Female 30 19 72
NewColumn Customer Satisfaction
0 1 No
1 1 Yes
2 1 No
3 1 Yes
4 1 No
5 1 Yes
6 1 No
7 1 Yes
8 1 No
9 1 Yes

By using the lambda operator

This syntax is generally used to apply log transformations and normalize the data to bring it in the range of 0 to 1 for particular columns of the data.

Python
const = data_frame['Age'].max() data_frame['Age'] = data_frame['Age'].apply(lambda x: x/const) data_frame.head() 

Output:

   CustomerID   Genre       Age  Annual Income (k$)  Spending Score (1-100)  \
0 1 Male 0.271429 15 39
1 2 Male 0.300000 15 81
2 3 Female 0.285714 16 6
3 4 Female 0.328571 16 77
4 5 Female 0.442857 17 40
NewColumn Customer Satisfaction
0 1 No
1 1 Yes
2 1 No
3 1 Yes
4 1 No

Visualizing DataFrame

Scatter plot

The plot( ) function is used to make plots of the data frames.

Python
# Visualization data_frame.plot(x ='CustomerID', y='Spending Score (1-100)',kind = 'scatter') 

Output:

Scatter plot of the Customer Satisfaction column
Scatter plot of the Customer Satisfaction column

Histogram

The plot.hist( ) function is used to make plots of the data frames.

Python
data_frame.plot.hist() 

Output:

Histogram for the distribution of the data
Histogram for the distribution of the data

Conclusion

There are other functions as well of pandas data frame but the above mentioned are some of the common ones generally used for handling large tabular data. One can refer to the pandas documentation as well to explore more about the functions mentioned above.


Next Article
Data Processing with Pandas

G

greeshmanalla
Improve
Article Tags :
  • Python
Practice Tags :
  • python

Similar Reads

    Pandas DataFrame.to_string-Python
    Pandas is a powerful Python library for data manipulation, with DataFrame as its key two-dimensional, labeled data structure. It allows easy formatting and readable display of data. DataFrame.to_string() function in Pandas is specifically designed to render a DataFrame into a console-friendly tabula
    5 min read
    Read And Write Tabular Data using Pandas
    Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental, high-level building block for doing practical, real-world data analysis in Python.The two primary d
    3 min read
    Streamlined Data Ingestion with Pandas
    Data Ingestion is the process of, transferring data, from varied sources to an approach, where it can be analyzed, archived, or utilized by an establishment. The usual steps, involved in this process, are drawing out data, from its current place, converting the data, and, finally loading it, in a lo
    9 min read
    Python | Pandas Series.data
    Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas series is a One-dimensional ndarray with axis labels. The labels need not be un
    2 min read
    Reading rpt files with Pandas
    In most cases, we usually have a CSV file to load the data from, but there are other formats such as JSON, rpt, TSV, etc. that can be used to store data. Pandas provide us with the utility to load data from them. In this article, we'll see how we can load data from an rpt file with the use of Pandas
    2 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences