Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Python Tutorial
  • Interview Questions
  • Python Quiz
  • Python Glossary
  • Python Projects
  • Practice Python
  • Data Science With Python
  • Python Web Dev
  • DSA with Python
  • Python OOPs
Open In App
Next Article:
Cleaning data with dropna in Pyspark
Next article icon

Cleaning data with dropna in Pyspark

Last Updated : 19 Jul, 2021
Comments
Improve
Suggest changes
Like Article
Like
Report

While dealing with a big size Dataframe which consists of many rows and columns they also consist of many NULL or None values at some row or column, or some of the rows are totally NULL or None. So in this case, if we apply an operation on the same Dataframe that contains many NULL or None values then we will not get the correct or desired output from that Dataframe. For getting the correct output from the Dataframe we have to clean it, which means we have to make Dataframe free of NULL or None values. 

So in this article, we will learn how to clean the Dataframe. For cleaning the Dataframe we are using dropna() function. This function is used to drop the NULL values from the Dataframe on the basis of a given parameter.

Syntax: df.dropna(how="any", thresh=None, subset=None)

where, df is the Dataframe

Parameter:

  • how: This parameter is used to determine if the row or column has to remove or not.
    • 'any' - If any of the value in Dataframe is NULL then drop that row or column.
    • 'all' - If all the values of particular row or column is NULL then drop.
  • thresh: If non NULL values of particular row or column is less than thresh value then drop that row or column.
  • subset: If the given subset column contains any of the null value then dop that row or column.

To drop the null values using the dropna method, first, we will create a Pyspark dataframe and then apply this.

Python
# importing necessary libraries from pyspark.sql import SparkSession  # function to create new SparkSession def create_session():     spk = SparkSession.builder \         .master("local") \         .appName("Employee_detail.com") \         .getOrCreate()     return spk   def create_df(spark, data, schema):     df1 = spark.createDataFrame(data, schema)     return df1   if __name__ == "__main__":      # calling function to create SparkSession     spark = create_session()      input_data = [(1, "Shivansh", "Data Scientist", "Noida"),                   (2, None, "Software Developer", None),                   (3, "Swati", "Data Analyst", "Hyderabad"),                   (4, None, None, "Noida"),                   (5, "Arpit", "Android Developer", "Banglore"),                   (6, "Ritik", None, None),                   (None, None, None, None)]     schema = ["Id", "Name", "Job Profile", "City"]      # calling function to create dataframe     df = create_df(spark, input_data, schema)     df.show() 

Output:

Example 1: Cleaning data with dropna using any parameter in PySpark.

In the below code we have passed the how="any" parameter in the dropna() function which means that if there are any row or column which has any of the Null values then we are dropping that row or column from the Dataframe. 

Python
# if any row having any Null # value we are dropping that  # rows df = df.dropna(how="any") df.show() 

Output:

Example 2: Cleaning data with dropna using all parameters in PySpark.

In the below code, we have passed the how="all" parameter in the dropna() function which means that if there are all row or column which has all the Null values then we are dropping that particular row or column from the Dataframe.

Python
# if any row having all Null  #  values we are dropping that  # rows. df = df.dropna(how="all") df.show() 

Output: 

Example 3: Cleaning data with dropna using thresh parameter in PySpark.

In the below code, we have passed the thresh=2 parameter in the dropna() function which means that if there are any rows or columns which is having fewer than non-NULL values than thresh values then we are dropping that row or column from the Dataframe. 

Python
# if thresh value is not # satisfied then dropping  # that row df = df.dropna(thresh=2) df.show() 

Output: 

Example 4: Cleaning data with dropna using subset parameter in PySpark.

In the below code, we have passed the subset='City' parameter in the dropna() function which is the column name in respective of City column if any of the NULL value present in that column then we are dropping that row from the Dataframe. 

Python
# if the subset column any value  # is NULL then dropping that row df = df.dropna(subset="City") df.show() 

Output: 

Example 5: Cleaning data with dropna using thresh and subset parameter in PySpark.

In the below code, we have passed (thresh=2, subset=("Id","Name","City")) parameter in the dropna() function, so the NULL values will drop when the thresh=2 and subset=("Id","Name","City") these both conditions will be satisfied means among these three columns dropna function checks whether thresh=2 is also satisfying or not, if satisfied then drop that particular row or column. 

Python
# if thresh value is satisfied with subset  # column then dropping that row df = df.dropna(thresh=2,subset=("Id","Name","City")) df.show() 

Output: 


 


Next Article
Cleaning data with dropna in Pyspark

S

srishivansh5404
Improve
Article Tags :
  • Python
  • Write From Home
  • Python-Pyspark
Practice Tags :
  • python

Similar Reads

    Drop Rows in PySpark DataFrame with Condition
    In this article, we are going to drop the rows in PySpark dataframe. We will be considering most common conditions like dropping rows with Null values, dropping duplicate rows, etc. All these conditions use different functions and we will discuss them in detail.We will cover the following topics:Dro
    4 min read
    Pyspark - Converting JSON to DataFrame
    In this article, we are going to convert JSON String to DataFrame in Pyspark. Method 1: Using read_json() We can read JSON files using pandas.read_json. This method is basically used to read JSON files through pandas. Syntax: pandas.read_json("file_name.json") Here we are going to use this JSON file
    1 min read
    Creating a PySpark DataFrame
    PySpark helps in processing large datasets using its DataFrame structure. In this article, we will see different methods to create a PySpark DataFrame. It starts with initialization of SparkSession which serves as the entry point for all PySpark applications which is shown below:from pyspark.sql imp
    5 min read
    Drop rows containing specific value in PySpark dataframe
    In this article, we are going to drop the rows with a specific value in pyspark dataframe. Creating dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n
    2 min read
    Select columns in PySpark dataframe
    In this article, we will learn how to select columns in PySpark dataframe. Function used: In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats.  Syntax: dataframe_name.select( columns_names ) Note: We
    4 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences