How to use Is Not Null in PySpark

Last Updated : 10 Jul, 2024

In data processing, handling null values is a crucial task to ensure the accuracy and reliability of the analysis. PySpark, the Python API for Apache Spark, provides powerful methods to handle null values efficiently. In this article, we will go through how to use the isNotNull method in PySpark to filter out null values from the data.

The isNotNull Method in PySpark

The isNotNull method in PySpark is used to filter rows in a DataFrame based on whether the values in a specified column are not null. This method is particularly useful when dealing with large datasets where null values can impact the accuracy of your results. This method returns a Column type consisting of Boolean values, which are True for non-null values and False for null values. By using isNotNull, you can ensure that only rows with valid data are included in your analysis.

Syntax:

DataFrame.filter(Column.isNotNull())

Simple Example to Implement isNotNull Method in Pyspark

To use the isNotNull the method in PySpark, you typically apply it to a DataFrame column and then use the filter function to retain only the rows that satisfy the condition.

In this example, we are taking a DataFrame with some null values. Then we use the isNotNull method to filter out any rows where the column 'data' contains null.

Python

from pyspark.sql import SparkSession from pyspark.sql.functions import col  # Initialize a Spark session spark = SparkSession.builder.appName("isNotNullExample").getOrCreate()  # Create a DataFrame data = [("James", None), ("Anna", 30), ("Julia", 25)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns)  # Filter rows where Age is not null df_filtered = df.filter(col("Age").isNotNull())  # Show the result df_filtered.show()

Output:

+-----+---+
| Name|Age|
+-----+---+
|  Anna|  30|
|    Julia|  25|
+-----+---+

Another Example to Implement isNotNull Method

Step 1: Initialize Spark Session

First, you need to initialize a Spark session. This is the entry point for using Spark functionality.

Python

from pyspark.sql import SparkSession  # Create a Spark session spark = SparkSession.builder \     .appName("Example of isNotNull in PySpark") \     .getOrCreate()

Step 2: Create a Sample DataFrame

Next, create a sample DataFrame that contains some null values.

Python

from pyspark.sql.types import StructType, StructField, StringType, IntegerType from pyspark.sql import Row  # Define schema schema = StructType([     StructField("id", IntegerType(), True),     StructField("name", StringType(), True),     StructField("age", IntegerType(), True) ])  # Create sample data data = [     Row(id=1, name="Alice", age=30),     Row(id=2, name=None, age=25),     Row(id=3, name="Bob", age=None),     Row(id=None, name="Charlie", age=35) ]  # Create DataFrame df = spark.createDataFrame(data, schema) df.show()

Step 3: Use `isNotNull` to Filter Data

Now, use the isNotNull method to filter out rows where specific columns have null values. For example, let's filter out rows where the name column is null.

Python

from pyspark.sql.functions import col  # Filter DataFrame where 'name' is not null filtered_df = df.filter(col("name").isNotNull()) filtered_df.show()

Step 4: Filter Multiple Columns

You can also filter out rows where multiple columns are not null by combining conditions with the & operator.

Python

# Filter DataFrame where 'name' and 'age' are not null filtered_df_multiple = df.filter(col("name").isNotNull() & col("age").isNotNull()) filtered_df_multiple.show()

Complete Code

Here is the complete code combining all the steps:

Python

from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType from pyspark.sql import Row from pyspark.sql.functions import col  # Create a Spark session spark = SparkSession.builder \     .appName("Example of isNotNull in PySpark") \     .getOrCreate()  # Define schema schema = StructType([     StructField("id", IntegerType(), True),     StructField("name", StringType(), True),     StructField("age", IntegerType(), True) ])  # Create sample data data = [     Row(id=1, name="Alice", age=30),     Row(id=2, name=None, age=25),     Row(id=3, name="Bob", age=None),     Row(id=None, name="Charlie", age=35) ]  # Create DataFrame df = spark.createDataFrame(data, schema) print("Original DataFrame:") df.show()  # Filter DataFrame where 'name' is not null filtered_df = df.filter(col("name").isNotNull()) print("Filtered DataFrame (name is not null):") filtered_df.show()  # Filter DataFrame where 'name' and 'age' are not null filtered_df_multiple = df.filter(col("name").isNotNull() & col("age").isNotNull()) print("Filtered DataFrame (name and age are not null):") filtered_df_multiple.show()

Output

Original DataFrame:
+----+-------+----+
|  id|   name| age|
+----+-------+----+
|   1|  Alice|  30|
|   2|   NULL|  25|
|   3|    Bob|NULL|
|NULL|Charlie|  35|
+----+-------+----+

Filtered DataFrame (name is not null):
+----+-------+----+
|  id|   name| age|
+----+-------+----+
|   1|  Alice|  30|
|   3|    Bob|NULL|
|NULL|Charlie|  35|
+----+-------+----+

Filtered DataFrame (name and age are not null):
+----+-------+---+
|  id|   name|age|
+----+-------+---+
|   1|  Alice| 30|
|NULL|Charlie| 35|
+----+-------+---+

Q: Can isNotNull be used with multiple columns?

Yes, you can chain multiple isNotNull checks across different columns using logical operators like & (and).

Q: What happens if I use isNotNull on a DataFrame with no null values?

If there are no null values in the column, isNotNull will return the original DataFrame.

Q: Is isNotNull the only way to check for non-null values?

No, PySpark also offers the na.drop() function, which can be used to drop rows based on null values across multiple columns.

How to Install PySpark in Jupyter Notebook

monuro08eb

Improve

Article Tags :

Python

Practice Tags :

python

How to use Is Not Null in PySpark

The isNotNull Method in PySpark

Simple Example to Implement isNotNull Method in Pyspark

Another Example to Implement isNotNull Method

Step 1: Initialize Spark Session

Step 2: Create a Sample DataFrame

Step 3: Use isNotNull to Filter Data

Step 4: Filter Multiple Columns

Complete Code

Similar Reads

Step 3: Use `isNotNull` to Filter Data