Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Python Tutorial
  • Interview Questions
  • Python Quiz
  • Python Glossary
  • Python Projects
  • Practice Python
  • Data Science With Python
  • Python Web Dev
  • DSA with Python
  • Python OOPs
Open In App
Next Article:
How to Install PySpark in Jupyter Notebook
Next article icon

How to use Is Not Null in PySpark

Last Updated : 10 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

In data processing, handling null values is a crucial task to ensure the accuracy and reliability of the analysis. PySpark, the Python API for Apache Spark, provides powerful methods to handle null values efficiently. In this article, we will go through how to use the isNotNull method in PySpark to filter out null values from the data.

The isNotNull Method in PySpark

The isNotNull method in PySpark is used to filter rows in a DataFrame based on whether the values in a specified column are not null. This method is particularly useful when dealing with large datasets where null values can impact the accuracy of your results. This method returns a Column type consisting of Boolean values, which are True for non-null values and False for null values. By using isNotNull, you can ensure that only rows with valid data are included in your analysis.

Syntax:

DataFrame.filter(Column.isNotNull())

Simple Example to Implement isNotNull Method in Pyspark

To use the isNotNull the method in PySpark, you typically apply it to a DataFrame column and then use the filter function to retain only the rows that satisfy the condition.

In this example, we are taking a DataFrame with some null values. Then we use the isNotNull method to filter out any rows where the column 'data' contains null.

Python
from pyspark.sql import SparkSession from pyspark.sql.functions import col  # Initialize a Spark session spark = SparkSession.builder.appName("isNotNullExample").getOrCreate()  # Create a DataFrame data = [("James", None), ("Anna", 30), ("Julia", 25)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns)  # Filter rows where Age is not null df_filtered = df.filter(col("Age").isNotNull())  # Show the result df_filtered.show() 

Output:

+-----+---+
| Name|Age|
+-----+---+
| Anna| 30|
| Julia| 25|
+-----+---+

Another Example to Implement isNotNull Method

Step 1: Initialize Spark Session

First, you need to initialize a Spark session. This is the entry point for using Spark functionality.

Python
from pyspark.sql import SparkSession  # Create a Spark session spark = SparkSession.builder \     .appName("Example of isNotNull in PySpark") \     .getOrCreate() 

Step 2: Create a Sample DataFrame

Next, create a sample DataFrame that contains some null values.

Python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType from pyspark.sql import Row  # Define schema schema = StructType([     StructField("id", IntegerType(), True),     StructField("name", StringType(), True),     StructField("age", IntegerType(), True) ])  # Create sample data data = [     Row(id=1, name="Alice", age=30),     Row(id=2, name=None, age=25),     Row(id=3, name="Bob", age=None),     Row(id=None, name="Charlie", age=35) ]  # Create DataFrame df = spark.createDataFrame(data, schema) df.show() 

Step 3: Use isNotNull to Filter Data

Now, use the isNotNull method to filter out rows where specific columns have null values. For example, let's filter out rows where the name column is null.

Python
from pyspark.sql.functions import col  # Filter DataFrame where 'name' is not null filtered_df = df.filter(col("name").isNotNull()) filtered_df.show() 

Step 4: Filter Multiple Columns

You can also filter out rows where multiple columns are not null by combining conditions with the & operator.

Python
# Filter DataFrame where 'name' and 'age' are not null filtered_df_multiple = df.filter(col("name").isNotNull() & col("age").isNotNull()) filtered_df_multiple.show() 

Complete Code

Here is the complete code combining all the steps:

Python
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType from pyspark.sql import Row from pyspark.sql.functions import col  # Create a Spark session spark = SparkSession.builder \     .appName("Example of isNotNull in PySpark") \     .getOrCreate()  # Define schema schema = StructType([     StructField("id", IntegerType(), True),     StructField("name", StringType(), True),     StructField("age", IntegerType(), True) ])  # Create sample data data = [     Row(id=1, name="Alice", age=30),     Row(id=2, name=None, age=25),     Row(id=3, name="Bob", age=None),     Row(id=None, name="Charlie", age=35) ]  # Create DataFrame df = spark.createDataFrame(data, schema) print("Original DataFrame:") df.show()  # Filter DataFrame where 'name' is not null filtered_df = df.filter(col("name").isNotNull()) print("Filtered DataFrame (name is not null):") filtered_df.show()  # Filter DataFrame where 'name' and 'age' are not null filtered_df_multiple = df.filter(col("name").isNotNull() & col("age").isNotNull()) print("Filtered DataFrame (name and age are not null):") filtered_df_multiple.show() 

Output

Original DataFrame:
+----+-------+----+
| id| name| age|
+----+-------+----+
| 1| Alice| 30|
| 2| NULL| 25|
| 3| Bob|NULL|
|NULL|Charlie| 35|
+----+-------+----+

Filtered DataFrame (name is not null):
+----+-------+----+
| id| name| age|
+----+-------+----+
| 1| Alice| 30|
| 3| Bob|NULL|
|NULL|Charlie| 35|
+----+-------+----+

Filtered DataFrame (name and age are not null):
+----+-------+---+
| id| name|age|
+----+-------+---+
| 1| Alice| 30|
|NULL|Charlie| 35|
+----+-------+---+

Q: Can isNotNull be used with multiple columns?

Yes, you can chain multiple isNotNull checks across different columns using logical operators like & (and).

Q: What happens if I use isNotNull on a DataFrame with no null values?

If there are no null values in the column, isNotNull will return the original DataFrame.

Q: Is isNotNull the only way to check for non-null values?

No, PySpark also offers the na.drop() function, which can be used to drop rows based on null values across multiple columns.


Next Article
How to Install PySpark in Jupyter Notebook

M

monuro08eb
Improve
Article Tags :
  • Python
Practice Tags :
  • python

Similar Reads

  • How to use Is Not in PySpark
    Null values are undefined or empty data present in a dataframe. These null values may be added due to some errors in data transfer or technical glitches. We should identify null values and make necessary changes in the Dataframe to address null values. In this article, we will learn about the usage
    4 min read
  • How to Install PySpark in Kaggle
    PySpark is the Python API for powerful distributed computing framework called Apache Spark. Among its many usage areas, I would say it majorly includes big data processing, machine learning, and real-time analytics. Running PySpark within the hosted environment of Kaggle would be super great if you
    4 min read
  • How to insert NULL value in SQLAlchemy?
    In this article, we will see how to insert NULL values into a PostgreSQL database using SQLAlchemy in Python. For demonstration purposes first, let us create a sample table using SQLAlchemy as shown below Creating a table using SQLAlchmey in PostgreSQL:Import necessary functions from SQLAlchemy pack
    2 min read
  • How to Use NULL Values Inside NOT IN Clause in SQL?
    In SQL, NULL holds a special status as it represents the absence of a value, making it fundamentally different from regular values. Unlike numbers or strings, NULL cannot be directly compared using operators like = or !=. This special status often leads to unexpected behavior in SQL queries, especia
    4 min read
  • How to Install PySpark in Jupyter Notebook
    PySpark is a Python library for Apache Spark, a powerful framework for big data processing and analytics. Integrating PySpark with Jupyter Notebook provides an interactive environment for data analysis with Spark. In this article, we will know how to install PySpark in Jupyter Notebook. Setting Up J
    2 min read
  • How to join on multiple columns in Pyspark?
    In this article, we will discuss how to join multiple columns in PySpark Dataframe using Python. Let's create the first dataframe: [GFGTABS] Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giv
    3 min read
  • SQL IS NOT NULL Operator
    In SQL, the IS NOT NULL operator is a powerful logical operator used to filter data by identifying rows with non-NULL values in specified columns. This operator works opposite to the IS NULL operator, returning TRUE for rows where the value is not NULL. It is typically used with the WHERE clause and
    5 min read
  • UDF to sort list in PySpark
    The most useful feature of Spark SQL used to create a reusable function in Pyspark is known as UDF or User defined function in Python. The column type of the Pyspark can be String, Integer, Array, etc. There occurs some situations in which you have got ArrayType column in Pyspark data frame and you
    3 min read
  • How to sort by value in PySpark?
    In this article, we are going to sort by value in PySpark. Creating RDD for demonstration: [GFGTABS] Python from pyspark.sql import SparkSession, Row # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # create 2 Rows with 3 columns da
    2 min read
  • How to check String is null in Scala?
    In this article, we will learn how to check if a string is null in Scala. In Scala, you can check if a string is null using the following methods: Table of Content 1. Using the == operator:2. Using the eq method (recommended):3. Using Pattern Matching:1. Using the == operator:[GFGTABS] Scala object
    2 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences