Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Python Tutorial
  • Interview Questions
  • Python Quiz
  • Python Glossary
  • Python Projects
  • Practice Python
  • Data Science With Python
  • Python Web Dev
  • DSA with Python
  • Python OOPs
Open In App
Next Article:
Filter PySpark DataFrame Columns with None or Null Values
Next article icon

How to drop all columns with null values in a PySpark DataFrame ?

Last Updated : 01 May, 2022
Comments
Improve
Suggest changes
Like Article
Like
Report

The pyspark.sql.DataFrameNaFunctions class in PySpark has many methods to deal with NULL/None values, one of which is the drop() function, which is used to remove/delete rows containing NULL values in DataFrame columns. You can also use df.dropna(), as shown in this article. You may drop all rows in any, all, single, multiple, and chosen columns using the drop() method. When you need to sanitize data before processing it, this function is quite useful. Any column with an empty value when reading a file into the PySpark DataFrame API returns NULL on the DataFrame. To drop rows in RDBMS SQL, you must check each column for null values, but the PySpark drop() method is more powerful since it examines all columns for null values and drops the rows.

PySpark drop() Syntax 

The drop() method in PySpark has three optional arguments that may be used to eliminate NULL values from single, any, all, or numerous DataFrame columns. Because drop() is a transformation method, it produces a new DataFrame after removing rows/records from the current Dataframe.

drop(how='any', thresh=None, subset=None)

All of these settings are optional.

  • how – This accepts any or all values. Drop a row if it includes NULLs in any column by using the 'any' operator. Drop a row only if all columns contain NULL values if you use the 'all' option. The default value is 'any'.
  • thresh – This is an int quantity; rows with less than thresh hold non-null values are dropped. 'None' is the default.
  • subset – This is used to select the columns that contain NULL values. 'None' is the default.

Implementation

Before we begin, let's read a CSV file into a DataFrame. PySpark assigns null values to empty String and Integer columns when there are no values on those rows.

CSV Used:

 
Python3
import pyspark.sql.functions as sqlf from pyspark.sql import SparkSession import findspark  findspark.init()   spark: SparkSession = SparkSession.builder \     .master("local[1]") \     .appName("SparkByExamples.com") \     .getOrCreate()  filePath = "example1.csv" df = spark.read.options(header='true', inferSchema='true') \           .csv(filePath)  df.printSchema() df.show(truncate=False) 

This results in the output shown below, name and city have null values, as you can see.

 


Drop Columns with NULL Values

Python3
def dropNullColumns(df):     """     This function drops columns containing all null values.     :param df: A PySpark DataFrame     """      null_counts = df.select([sqlf.count(sqlf.when(sqlf.col(c).isNull(), c)).alias(         c) for c in df.columns]).collect()[0].asDict()  # 1     col_to_drop = [k for k, v in null_counts.items() if v > 0]  # 2     df = df.drop(*col_to_drop)  # 3      return df 
 

We're using the pyspark's select method in the first line, which projects a group of expressions and returns a new dataframe. The collection of expressions included in brackets will be evaluated and a new dataframe will be created as a result. The expression counts the number of null values in each column and then can use the collect method to retrieve the data from the dataframe and create a dict with the column names and the number of nulls in each.

We're only filtering out columns with null values greater than 0 in the second line, which basically means any column with null values.

After figuring out the columns containing null values, we used the drop function in the third line and finally returned the dataframe.

Example:

CSV Used:

 
Python3
import pyspark.sql.functions as sqlf from pyspark.sql import SparkSession import findspark  findspark.init()   spark: SparkSession = SparkSession.builder \     .master("local[1]") \     .appName("SparkByExamples.com") \     .getOrCreate()  filePath = "/content/swimming_pool.csv" df = spark.read.options(header='true', inferSchema='true') \           .csv(filePath)  df.printSchema() df.show(truncate=False) 
 

After using dropNullColumns function - 

 

Next Article
Filter PySpark DataFrame Columns with None or Null Values

S

sp3768546
Improve
Article Tags :
  • Python
Practice Tags :
  • python

Similar Reads

  • How to Drop Columns with NaN Values in Pandas DataFrame?
    Nan(Not a number) is a floating-point value which can't be converted into other data type expect to float. In data analysis, Nan is the unnecessary value which must be removed in order to analyze the data set properly. In this article, we will discuss how to remove/drop columns having Nan values in
    3 min read
  • Add a column with the literal value in PySpark DataFrame
    In this article, we are going to see how to add a column with the literal value in PySpark Dataframe. Creating dataframe for demonstration: [GFGTABS] Python3 # import SparkSession from the pyspark from pyspark.sql import SparkSession # build and create the # SparkSession with name "lit_value
    3 min read
  • Filter PySpark DataFrame Columns with None or Null Values
    Many times while working on PySpark SQL dataframe, the dataframes contains many NULL/None values in columns, in many of the cases before performing any of the operations of the dataframe firstly we have to handle the NULL/None values in order to get the desired result or output, we have to filter th
    4 min read
  • PySpark DataFrame - Drop Rows with NULL or None Values
    Sometimes while handling data inside a dataframe we may get null values. In order to clean the dataset we have to remove all the null values in the dataframe. So in this article, we will learn how to drop rows with NULL or None Values in PySpark DataFrame.  Function Used  In pyspark the drop() funct
    5 min read
  • Add new column with default value in PySpark dataframe
    In this article, we are going to see how to add a new column with a default value in PySpark Dataframe. The three ways to add a column to PandPySpark as DataFrame with Default Value. Using pyspark.sql.DataFrame.withColumn(colName, col)Using pyspark.sql.DataFrame.select(*cols)Using pyspark.sql.SparkS
    3 min read
  • How to display notnull rows and columns in a Python dataframe?
    In Python, not null rows and columns mean the rows and columns which have Nan values, especially in the Pandas library. To display not null rows and columns in a python data frame we are going to use different methods as dropna(), notnull(), loc[]. dropna() : This function is used to remove rows and
    3 min read
  • How to Add Multiple Columns in PySpark Dataframes ?
    In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes.  Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi [GFGTABS] Python3 # import pandas to read json file import pandas as pd # importing module import pyspark # importing s
    2 min read
  • How to Drop Rows with NaN Values in Pandas DataFrame?
    In Pandas missing values are represented as NaN (Not a Number) which can lead to inaccurate analyses. One common approach to handling missing data is to drop rows containing NaN values using pandas. Below are some methods that can be used: Method 1: Using dropna()The dropna() method is the most stra
    2 min read
  • How to rename multiple columns in PySpark dataframe ?
    In this article, we are going to see how to rename multiple columns in PySpark Dataframe. Before starting let's create a dataframe using pyspark: [GFGTABS] Python3 # importing module import pyspark from pyspark.sql.functions import col # importing sparksession from pyspark.sql module from pyspark.sq
    2 min read
  • How to add a new column to a PySpark DataFrame ?
    In this article, we will discuss how to add a new column to PySpark Dataframe. Create the first data frame for demonstration: Here, we will be creating the sample data frame which we will be used further to demonstrate the approach purpose. [GFGTABS] Python3 # importing module import pyspark # impor
    9 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences