Pandas Dataframe Difference
Last Updated : 16 Dec, 2024
When working with multiple DataFrames, you might want to compute the differences between them, such as identifying rows that are in one DataFrame but not in another. Pandas provides various ways to compute the difference between DataFrames, whether it's comparing rows, columns, or entire DataFrames. This is useful in data analysis, especially when you need to track changes between datasets over time or compare two similar datasets.
In this article, we will explore methods to find the difference between DataFrames using Pandas.
Python import pandas as pd # Create DataFrames for Dataset 1 and Dataset 2 data1 = {'Name': ['John', 'Alice', 'Bob', 'Eve'], 'Age': [25, 30, 22, 35], 'Gender': ['Male', 'Female', 'Male', 'Female']} df1 = pd.DataFrame(data1) data2 = {'Name': ['John', 'Alice', 'Charlie', 'Eve'], 'Age': [25, 32, 28, 35], 'Gender': ['Male', 'Female', 'Male', 'Female']} df2 = pd.DataFrame(data2)
Finding Rows in One DataFrame but Not in Another
The most common way to find the difference between DataFrames is to identify rows that are in one DataFrame but not in the other. This can be done using the merge() method with the indicator=True option or by using isin() method.
- Use merge() with indicator=True to identify differences.
Python # Merge the DataFrames with the 'indicator' flag to track the source of each row merged_df = pd.merge(df1, df2, how='outer', indicator=True) # Find rows that are only in df1 but not in df2 diff_df1 = merged_df[merged_df['_merge'] == 'left_only'] print(diff_df1) # Find rows that are only in df2 but not in df1 diff_df2 = merged_df[merged_df['_merge'] == 'right_only'] print(diff_df2)
The merge() method is used with the indicator=True flag to add a new column (_merge) that shows whether a row is only in df1, only in df2, or in both.We then filter for rows where _merge is 'left_only' (rows unique to df1) or 'right_only' (rows unique to df2).
Finding the Difference in Values (Element-wise)
If you want to find the difference between corresponding elements in two DataFrames, you can subtract one DataFrame from another. This works for numerical data and compares corresponding values row-wise and column-wise.
Python # Subtract df2 from df1 (numerical columns only) df_diff = df1.select_dtypes(include=['number']) - df2.select_dtypes(include=['number']) print(df_diff)
select_dtypes(include=['number']) method selects only the numerical columns for subtraction.Subtraction of corresponding values in df1 and df2 produces a new DataFrame with the element-wise differences.
Using isin to Find Values Not Shared Between DataFrames
The isin() method is another powerful tool to compare rows between DataFrames. It allows you to filter for rows in one DataFrame that do not appear in the other.
Python # Find rows in df1 that are not in df2 df_diff = df1[~df1['Name'].isin(df2['Name'])] print(df_diff)
The isin() method checks if each value in the Name column of df1 is present in the Name column of df2. The tilde (~) negates the result, meaning we filter for rows in df1 whose Name does not exist in df2.
Comparing DataFrame Indexes
You may also want to compare the indexes of two DataFrames to see if they are the same or different. You can use the .index attribute to compare indexes between DataFrames.
Python # Compare indexes between df1 and df2 index_diff = df1.index.difference(df2.index) print(index_diff)
The difference() method returns the indexes that are present in df1 but not in df2. This is useful when you want to check whether the row labels (indexes) are the same across DataFrames.
Summary:
Pandas provides multiple methods for finding the difference between DataFrames, each suited for specific use cases:
- merge() with the indicator=True flag is great for finding rows that differ between DataFrames.
- Subtraction is useful for comparing numerical values element-wise.
- isin() is helpful for filtering rows that are not shared between DataFrames.
- difference() can be used to compare DataFrame indexes.
These techniques can be combined and customized to suit a variety of data comparison tasks in your analysis workflow.
Related Articles:
Similar Reads
Pandas Merge Dataframe Merging DataFrames is a common operation when working with multiple datasets in Pandas. The `merge()` function allows you to combine two DataFrames based on a common column or index. In this article, we will explore how to merge DataFrames using various options and techniques.We will load the datase
5 min read
Python | Pandas dataframe.diff() Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.diff() is used to find the first discrete difference of objects over
2 min read
Difference of two columns in Pandas dataframe Difference of two columns in pandas dataframe in Python is carried out by using following methods : Method #1 : Using â -â operator. Python3 import pandas as pd # Create a DataFrame df1 = { 'Name':['George','Andrea','micheal', 'maggie','Ravi','Xien','Jalpa'], 'score1':[62,47,55,74,32,77,86], 'score2
2 min read
Different ways to create Pandas Dataframe It is the most commonly used Pandas object. The pd.DataFrame() function is used to create a DataFrame in Pandas. There are several ways to create a Pandas Dataframe in Python.Example: Creating a DataFrame from a DictionaryPythonimport pandas as pd # initialize data of lists. data = {'Name': ['Tom',
7 min read
Pandas DataFrame A Pandas DataFrame is a two-dimensional table-like structure in Python where data is arranged in rows and columns. Itâs one of the most commonly used tools for handling data and makes it easy to organize, analyze and manipulate data. It can store different types of data such as numbers, text and dat
10 min read