Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Python Tutorial
  • Interview Questions
  • Python Quiz
  • Python Glossary
  • Python Projects
  • Practice Python
  • Data Science With Python
  • Python Web Dev
  • DSA with Python
  • Python OOPs
Open In App
Next Article:
How to get name of dataframe column in PySpark ?
Next article icon

Get number of rows and columns of PySpark dataframe

Last Updated : 13 Sep, 2021
Comments
Improve
Suggest changes
Like Article
Like
Report

In this article, we will discuss how to get the number of rows and the number of columns of a PySpark dataframe. For finding the number of rows and number of columns we will use count() and columns() with len() function respectively.

  • df.count(): This function is used to extract number of rows from the Dataframe.
  • df.distinct().count(): This functions is used to extract distinct number rows which are not duplicate/repeating in the Dataframe.
  • df.columns(): This function is used to extract the list of columns names present in the Dataframe.
  • len(df.columns): This function is used to count number of items present in the list.

Example 1: Get the number of rows and number of columns of dataframe in pyspark.

Python
# importing necessary libraries from pyspark.sql import SparkSession  # function to create SparkSession def create_session():   spk = SparkSession.builder \       .master("local") \       .appName("Products.com") \       .getOrCreate()   return spk  # function to create Dataframe def create_df(spark,data,schema):   df1 = spark.createDataFrame(data,schema)   return df1  # main function if __name__ == "__main__":    # calling function to create SparkSession   spark = create_session()        input_data = [(1,"Direct-Cool Single Door Refrigerator",12499),           (2,"Full HD Smart LED TV",49999),           (3,"8.5 kg Washing Machine",69999),           (4,"T-shirt",1999),           (5,"Jeans",3999),           (6,"Men's Running Shoes",1499),           (7,"Combo Pack Face Mask",999)]    schm = ["Id","Product Name","Price"]    # calling function to create dataframe   df = create_df(spark,input_data,schm)   df.show()    # extracting number of rows from the Dataframe   row = df.count()      # extracting number of columns from the Dataframe   col = len(df.columns)    # printing   print(f'Dimension of the Dataframe is: {(row,col)}')   print(f'Number of Rows are: {row}')   print(f'Number of Columns are: {col}') 

Output:

Explanation:

  • For counting the number of rows we are using the count() function df.count() which extracts the number of rows from the Dataframe and storing it in the variable named as 'row'
  • For counting the number of columns we are using df.columns() but as this function returns the list of columns names, so for the count the number of items present in the list we are using len() function in which we are passing df.columns() this gives us the total number of columns and store it in the variable named as 'col'.

Example 2: Getting the Distinct number of rows and columns of Dataframe.

Python
# importing necessary libraries from pyspark.sql import SparkSession  # function to create SparkSession def create_session():   spk = SparkSession.builder \       .master("local") \       .appName("Student_report.com") \       .getOrCreate()   return spk  # function to create Dataframe def create_df(spark,data,schema):   df1 = spark.createDataFrame(data,schema)   return df1  # main function if __name__ == "__main__":    # calling function to create SparkSession   spark = create_session()        input_data = [(1,"Shivansh","Male",20,80),           (2,"Arpita","Female",18,66),           (3,"Raj","Male",21,90),           (4,"Swati","Female",19,91),           (5,"Arpit","Male",20,50),           (6,"Swaroop","Male",23,65),           (6,"Swaroop","Male",23,65),           (6,"Swaroop","Male",23,65),           (7,"Reshabh","Male",19,70),           (7,"Reshabh","Male",19,70),           (8,"Dinesh","Male",20,75),           (9,"Rohit","Male",21,85),           (9,"Rohit","Male",21,85),           (10,"Sanjana","Female",22,87)]    schm = ["Id","Name","Gender","Age","Percentage"]    # calling function to create dataframe   df = create_df(spark,input_data,schm)   df.show()    # extracting number of distinct rows    # from the Dataframe   row = df.distinct().count()      # extracting total number of rows from   # the Dataframe   all_rows = df.count()      # extracting number of columns from the   # Dataframe   col = len(df.columns)    # printing   print(f'Dimension of the Dataframe is: {(row,col)}')   print(f'Distinct Number of Rows are: {row}')   print(f'Total Number of Rows are: {all_rows}')   print(f'Number of Columns are: {col}') 

 
 

Output:


 


 

Explanation:


 

  • For counting the number of distinct rows we are using distinct().count() function which extracts the number of distinct rows from the Dataframe and storing it in the variable named as 'row'
  • For counting the number of columns we are using df.columns() but as this functions returns the list of column names, so for the count the number of items present in the list we are using len() function in which we are passing df.columns() this gives us the total number of columns and store it in the variable named as 'col'


 

Example 3: Getting the number of columns using dtypes function.


 

In the example, after creating the Dataframe we are counting a number of rows using count() function and for counting the number of columns here we are using dtypes function. Since we know that dtypes function returns the list of tuples that contains the column name and datatype of the columns. So for every column, there is the tuple that contains the name and datatype of the column, from the list we are just counting the tuples The number of tuples is equal to the number of columns so this is also the one way to get the number of columns using dtypes function.


 

Python
# importing necessary libraries from pyspark.sql import SparkSession  # function to create SparkSession def create_session():   spk = SparkSession.builder \       .master("local") \       .appName("Student_report.com") \       .getOrCreate()   return spk  # function to create Dataframe def create_df(spark,data,schema):   df1 = spark.createDataFrame(data,schema)   return df1  # main function if __name__ == "__main__":    # calling function to create SparkSession   spark = create_session()        input_data = [(1,"Shivansh","Male",20,80),           (2,"Arpita","Female",18,66),           (3,"Raj","Male",21,90),           (4,"Swati","Female",19,91),           (5,"Arpit","Male",20,50),           (6,"Swaroop","Male",23,65),           (7,"Reshabh","Male",19,70),           (8,"Dinesh","Male",20,75),           (9,"Rohit","Male",21,85),           (10,"Sanjana","Female",22,87)]    schm = ["Id","Name","Gender","Age","Percentage"]    # calling function to create dataframe   df = create_df(spark,input_data,schm)   df.show()    # extracting number of rows from the Dataframe   row = df.count()   # extracting number of columns from the Dataframe using dtypes function   col = len(df.dtypes)      # printing   print(f'Dimension of the Dataframe is: {(row,col)}')   print(f'Number of Rows are: {row}')   print(f'Number of Columns are: {col}') 

 
 

Output:


 


 

Example 4: Getting the dimension of the PySpark Dataframe by converting PySpark Dataframe to Pandas Dataframe.


 

In the example code, after creating the Dataframe, we are converting the PySpark Dataframe to Pandas Dataframe using toPandas() function by writing df.toPandas(). After converting the dataframe we are using Pandas function shape for getting the dimension of the Dataframe. This shape function returns the tuple, so for printing the number of row and column individually.


 

Python
# importing necessary libraries from pyspark.sql import SparkSession  # function to create SparkSession def create_session():   spk = SparkSession.builder \       .master("local") \       .appName("Student_report.com") \       .getOrCreate()   return spk  # function to create Dataframe def create_df(spark,data,schema):   df1 = spark.createDataFrame(data,schema)   return df1  # main function if __name__ == "__main__":    # calling function to create SparkSession   spark = create_session()        input_data = [(1,"Shivansh","Male",20,80),           (2,"Arpita","Female",18,66),           (3,"Raj","Male",21,90),           (4,"Swati","Female",19,91),           (5,"Arpit","Male",20,50),           (6,"Swaroop","Male",23,65),           (7,"Reshabh","Male",19,70),           (8,"Dinesh","Male",20,75),           (9,"Rohit","Male",21,85),           (10,"Sanjana","Female",22,87)]    schm = ["Id","Name","Gender","Age","Percentage"]    # calling function to create dataframe   df = create_df(spark,input_data,schm)   df.show()    # converting PySpark df to Pandas df using   # toPandas() function   new_df = df.toPandas()      # using Pandas shape function for getting the   # dimension of the df   dimension = new_df.shape    # printing   print("Dimension of the Dataframe is: ",dimension)   print(f'Number of Rows are: {dimension[0]}')   print(f'Number of Columns are: {dimension[1]}') 

 
 

Output:


 


 


Next Article
How to get name of dataframe column in PySpark ?
author
srishivansh5404
Improve
Article Tags :
  • Python
  • Python-Pyspark
Practice Tags :
  • python

Similar Reads

  • Get the number of rows and number of columns in Pandas Dataframe
    Pandas provide data analysts a variety of pre-defined functions to Get the number of rows and columns in a data frame. In this article, we will learn about the syntax and implementation of few such functions. Method 1: Using df.axes() Method axes() method in pandas allows to get the number of rows a
    3 min read
  • Count the number of rows and columns of Pandas dataframe
    In this article, we'll see how we can get the count of the total number of rows and columns in a Pandas DataFrame. There are different methods by which we can do this. Let's see all these methods with the help of examples. Example 1: We can use the dataframe.shape to get the count of rows and column
    2 min read
  • Count number of rows and columns in Pandas dataframe
    In Pandas understanding number of rows and columns in a DataFrame is important for knowing structure of our dataset. Whether we're cleaning the data, performing calculations or visualizing results finding shape of the DataFrame is one of the initial steps. In this article, we'll explore various ways
    3 min read
  • How to get name of dataframe column in PySpark ?
    In this article, we will discuss how to get the name of the Dataframe column in PySpark.  To get the name of the columns present in the Dataframe we are using the columns function through this function we will get the list of all the column names present in the Dataframe. Syntax: df.columns We can a
    3 min read
  • How to Iterate over rows and columns in PySpark dataframe
    In this article, we will discuss how to iterate rows and columns in PySpark dataframe. Create the dataframe for demonstration: [GFGTABS] Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving
    6 min read
  • Get current number of partitions of a DataFrame - Pyspark
    In this article, we are going to learn how to get the current number of partitions of a data frame using Pyspark in Python. In many cases, we need to know the number of partitions in large data frames. Sometimes we have partitioned the data and we need to verify if it has been correctly partitioned
    6 min read
  • Count number of columns of a Pandas DataFrame
    Let's discuss how to count the number of columns of a Pandas DataFrame. Lets first make a dataframe. Example: [GFGTABS] Python3 # Import Required Libraries import pandas as pd import numpy as np # Create a dictionary for the dataframe dict = {'Name': ['Sukritin', 'Sumit Tyagi
    2 min read
  • How to delete columns in PySpark dataframe ?
    In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: [GFGTABS] Python3 #
    2 min read
  • Merge two DataFrames with different amounts of columns in PySpark
    In this article, we will discuss how to perform union on two dataframes with different amounts of columns in PySpark in Python. Let's consider the first dataframe Here we are having 3 columns named id, name, and address. [GFGTABS] Python3 # importing module import pyspark # import when and lit funct
    6 min read
  • Filtering rows based on column values in PySpark dataframe
    In this article, we are going to filter the rows based on column values in PySpark dataframe. Creating Dataframe for demonstration:[GFGTABS] Python3 # importing module import spark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and givin
    2 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences