Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • DSA
  • Practice Problems
  • Python
  • C
  • C++
  • Java
  • Courses
  • Machine Learning
  • DevOps
  • Web Development
  • System Design
  • Aptitude
  • Projects
Open In App
Next Article:
How to Check the Schema of DataFrame in Scala?
Next article icon

How to Join Two DataFrame in Scala?

Last Updated : 09 Apr, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Scala stands for scalable language. It is a statically typed language although unlike other statically typed languages like C, C++, or Java, it doesn't require type information while writing the code. The type verification is done at the compile time. Static typing allows us to build safe systems by default. Smart built-in checks and actionable error messages, combined with thread-safe data structures and collections, prevent many tricky bugs before the program first runs.

Understanding Dataframe and Spark

A DataFrame is a data structure in the Spark Language. Spark is used to develop distributed products i.e. a code that can be run on many machines at the same time.

  1. The main purpose of such products is to process large data for business analysis.
  2. The DataFrame is a tabular structure that can store structured and semi-structured data.
  3. For unstructured data, we need to modify it to fit into the data frame.
  4. Dataframes are built on the core API of Spark called RDDs to provide type-safety, optimization, and other things.

Building Sample DataFrames

Let us build two sample DataFrame to perform join upon in Scala.

Scala
import org.apache.spark.sql.SparkSession  object joindfs{   def main(args:Array[String]) {     val spark: SparkSession = SparkSession.builder().master("local[1]").getOrCreate()      val class_columns = Seq("Id", "Name")     val class_data    = Seq((1, "Dhruv"), (2, "Akash"), (3, "Aayush"))     val class_df = spark.createDataFrame(class_data).toDF(class_columns:_*)      val result_column = Seq("Id", "Subject", "Score")     val result_data   = Seq((1, "Maths", 98), (2, "Maths", 99), (3, "Maths", 94), (1, "Physics", 95), (2, "Physics", 97), (3, "Physics", 99))     val result_df = spark.createDataFrame(result_data).toDF(result_column:_*)      class_df.show()     result_df.show()   } } 

Output:

file
class_df
file
result_df

Explanation:

Here we have formed two dataframes.

  1. The first one is the class dataframe which contains the information about students in a classroom.
  2. The second one is the result dataframe which contains the marks of students in Maths and Physics.
  3. We will form a combined dataframe that will contain both student and result information.

Let us see how to join these dataframes now.

Joining DataFrames

Use df.join()

We can join one dataframe with another using the join statement. Let us see various examples of joins below.

Example 1: Specify join columns as a String or Sequence of String

If the column name in both dataframes are same we can simply write the names of the columns on which we want to join.

Scala
val joined_df = class_df.join(result_df, Seq("id")) // OR val joined_df = class_df.join(result_df, "id") joined_df.show() 

Output:

file
Joined_df

As it can be seen above the dataframes were joined by specifying the name of the common column.

Example 2: Specify join condition using expressions

If the column names are not same then we can use the join expressions to specify the match condition. The code for this type of join is as follows:

Scala
import spark.implicits._ val joined_df = class_df.join(result_df, class_df("id") === result_df("id")).select(result_df("Id"), $"Name", $"Subject", $"Score") joined_df.show() 

Output:

file
joined_df

As seen above the join is performed using the expression.

We can also specify the join condition in both of the above examples. Let us try to do a left join of the class dataframe with the result dataframe. We will remove the last id from the result dataframe to verify that the left join is actually performed. The code for the left join is as follows:

Scala
val result_filtered_df = result_df.filter("id in (1,2)") val joined_df = class_df.join(result_filtered_df, "Id", "left") joined_df.show() 

Output:

file
left join

As seen above, the left join was performed successfully. The missing record for results was filled with NULL values. Similarly, we can perform left join with example 2 as well.

Using SQL

We can also join the two dataframes using sql. To do that we will first need to convert the dataframes to views. We will then join the views and store the result to another dataframe. Let us see how to perform the join using SQL.

Scala
class_df.createOrReplaceTempView("class_df_view") result_df.createOrReplaceTempView("result_df_view") var joined_df = spark.sql("SELECT cl.Id, Name, Subject, Score FROM class_df_view cl INNER JOIN result_df_view rs ON cl.Id = rs.ID") joined_df.show() 

Output:

file
joined_dJ

As seen above the views were joined to create a new dataframe. This method helps those familiar with the SQL syntax and allows for easy migration from SQL projects. Although the views are extra but since they are temporary they will be deleted after the session ends.

Conclusion

In this article we have seen how to join the two dataframes in scala. Majorly, this can be done either using the scala join function or the SQL syntax. The scala join function further can be called in two ways, using strings or expressions. The strings method can be used if the column names are common in both the dataframes. In that case, the join columns can be specified using a list of strings. In the case, the column names are not common we can use the expressions to specify the join condition. The sql method creates temporary views from the dataframes and performs the join on them. It then creates another dataframe from the result of join as the joined dataframe.


Next Article
How to Check the Schema of DataFrame in Scala?

D

dvsingla28
Improve
Article Tags :
  • Scala

Similar Reads

  • How to print dataframe in Scala?
    Scala stands for scalable language. It was developed in 2003 by Martin Odersky. It is an object-oriented language that provides support for functional programming approach as well. Everything in scala is an object e.g. - values like 1,2 can invoke functions like toString(). Scala is a statically typ
    4 min read
  • How to Merge Two Pandas DataFrames on Index
    Merging two pandas DataFrames on their index is necessary when working with datasets that share the same row identifiers but have different columns. The core idea is to align the rows of both DataFrames based on their indices, combining the respective columns into one unified DataFrame. To merge two
    3 min read
  • How to check dataframe size in Scala?
    In this article, we will learn how to check dataframe size in Scala. To check the size of a DataFrame in Scala, you can use the count() function, which returns the number of rows in the DataFrame. Here's how you can do it: Syntax: val size = dataframe.count() Example #1: [GFGTABS] Scala import org.a
    2 min read
  • How to Check the Schema of DataFrame in Scala?
    With DataFrames in Apache Spark using Scala, you could check the schema of a DataFrame and get to know its structure with column types. The schema contains data types and names of columns that are available in a DataFrame. Apache Spark is a powerful distributed computing framework used for processin
    3 min read
  • How to check dataframe is empty in Scala?
    In this article, we will learn how to check dataframe is empty or not in Scala. we can check if a DataFrame is empty by using the isEmpty method or by checking the count of rows. Syntax: val isEmpty = dataframe.isEmpty OR, val isEmpty = dataframe.count() == 0 Here's how you can do it: Example #1: us
    2 min read
  • PySpark Join Types - Join Two DataFrames
    In this article, we are going to see how to join two dataframes in Pyspark using Python. Join is used to combine two or more dataframes based on columns in the dataframe. Syntax: dataframe1.join(dataframe2,dataframe1.column_name ==  dataframe2.column_name,"type")  where, dataframe1 is the first data
    13 min read
  • How to Convert RDD to Dataframe in Spark Scala?
    This article focuses on discussing ways to convert rdd to dataframe in Spark Scala. Table of Content RDD and DataFrame in SparkConvert Using createDataFrame MethodConversion Using toDF() Implicit MethodConclusionFAQsRDD and DataFrame in SparkRDD and DataFrame are Spark's two primary methods for hand
    6 min read
  • How to combine two DataFrames in Pandas?
    While working with data, there are multiple times when you would need to combine data from multiple sources. For example, you may have one DataFrame that contains information about a customer, while another DataFrame contains data about their transaction history. If you want to analyze this data tog
    3 min read
  • How to Find Matching Rows in Two Pandas DataFrames
    Let's learn how to find matching rows in two dataframes using Pandas. Find Matching Rows Using merge()merge() function is one of the most commonly used methods for finding matching rows in two DataFrames. It performs a SQL-style inner join, returning rows where matching values exist in both DataFram
    4 min read
  • How to re-partition pyspark dataframe in Python
    Are you a data science or machine learning enthusiast who likes to play with data? Have you ever got the need to repartition the Pyspark dataset you got? Got confused, about how to fulfill the demand? Don't worry! In this article, we will discuss the re-partitioning of the Pyspark data frame in Pyth
    3 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences