Skip to content

Pandas DataFrame.dropna() Method

Pandas dataframe.drop_duplicates()

Last Updated : 25 Nov, 2024

Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe allows to remove duplicate rows from a DataFrame, either based on all columns or specific ones in python.

By default, drop_duplicates() scans the entire DataFrame for duplicate rows and removes all subsequent occurrences, retaining only the first instance being the simple and efficient method. Let’s see a quick example:

Python

import pandas as pd data = {     "Name": ["Alice", "Bob", "Alice", "David"],     "Age": [25, 30, 25, 40],     "City": ["NY", "LA", "NY", "Chicago"] } df = pd.DataFrame(data) display(df)  # Removing duplicates unique_df = df.drop_duplicates() display(unique_df)

Output:

Pandas-dataframe-drop-duplicates

Pandas dataframe.drop_duplicates()

This example demonstrates how duplicate rows are removed while retaining the first occurrence using pandas.DataFrame.drop_duplicates() since it’s commonly used and recommended.

dataframe.drop_duplicates() Syntax in Python :

Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

Parameters:

subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates. ( Optional)
keep: keep is to control how to consider duplicate value. It has only three distinct value and default is ‘first’.
If ‘first‘, it considers first value as unique and rest of the same values as duplicate.
If ‘last‘, it considers last value as unique and rest of the same values as duplicate.
If False, it consider all of the same values as duplicates
inplace: Boolean values, removes rows with duplicates if True.
Return type: DataFrame with removed duplicate rows depending on Arguments passed.

Python dataframe.drop_duplicates() : Examples

Duplicate rows can arise due to merging datasets, incorrect data entry, or other reasons. The drop_duplicates() works by identifying duplicates based on all columns (default) or specified columns and removing them as per your requirements. Below, we are discussing examples of dataframe.drop_duplicates() method:

1. Dropping Duplicates Based on Specific Columns

You can target duplicates in specific columns using the subset parameter. This helps when certain fields are more relevant for identifying duplicates.

Python

import pandas as pd df = pd.DataFrame({     'Name': ['Alice', 'Bob', 'Alice', 'David'],     'Age': [25, 30, 25, 40],     'City': ['NY', 'LA', 'SF', 'Chicago'] })  # Drop duplicates based on the 'Name' column result = df.drop_duplicates(subset=['Name']) print(result)

Output

    Name  Age     City 0  Alice   25       NY 1    Bob   30       LA 3  David   40  Chicago

Here, duplicates are removed based solely on the Name column, ignoring the other fields. This is helpful when specific columns uniquely identify rows.

2. Keeping the Last Occurrence

By default, drop_duplicates() retains the first occurrence of duplicates. However, you can retain the last duplicate instead using keep='last'.

Python

import pandas as pd  df = pd.DataFrame({     'Name': ['Alice', 'Bob', 'Alice', 'David'],     'Age': [25, 30, 25, 40],     'City': ['NY', 'LA', 'NY', 'Chicago'] })  # Keep the last occurrence of duplicates result = df.drop_duplicates(keep='last') print(result)

Output

    Name  Age     City 1    Bob   30       LA 2  Alice   25       NY 3  David   40  Chicago

The keep='last' parameter ensures the last occurrence of each duplicate is retained instead of the first.

3. Dropping All Duplicates

To remove all rows with duplicates, use keep=False. This keeps only rows that are entirely unique.

Python

import pandas as pd  df = pd.DataFrame({     'Name': ['Alice', 'Bob', 'Alice', 'David'],     'Age': [25, 30, 25, 40],     'City': ['NY', 'LA', 'NY', 'Chicago'] }) # Drop all duplicates result = df.drop_duplicates(keep=False) print(result)

Output

    Name  Age     City 1    Bob   30       LA 3  David   40  Chicago

With keep=False, all occurrences of duplicate rows are removed, leaving only rows that are entirely unique across all columns.

4. Modifying the Original DataFrame Directly

To modify the original DataFrame directly without creating a new one, use inplace=True.

Python

import pandas as pd  df = pd.DataFrame({     'Name': ['Alice', 'Bob', 'Alice', 'David'],     'Age': [25, 30, 25, 40],     'City': ['NY', 'LA', 'NY', 'Chicago'] }) # Modify the DataFrame in place df.drop_duplicates(inplace=True) print(df)

Output

    Name  Age     City 0  Alice   25       NY 1    Bob   30       LA 3  David   40  Chicago

Using inplace=True modifies the original DataFrame directly, saving memory and avoiding the need to assign the result to a new variable.

Pandas DataFrame.dropna() Method

K

Kartikaybhutani

Improve

Article Tags :

Similar Reads

Joining two Pandas DataFrames using merge()

The merge() function is designed to merge two DataFrames based on one or more columns with matching values. The basic idea is to identify columns that contain common data between the DataFrames and use them to align rows. Let's understand the process of joining two pandas DataFrames usingÂ merge(), e

Python | Pandas DataFrame.astype()

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier. DataFrame.astype() method is used to cast a pandas object to a specified dtype.astype(

Python | Pandas DataFrame.set_index()

Pandas DataFrame.set_index() method sets one or more columns as the index of a DataFrame. It can accept single or multiple column names and is useful for modifying or adding new indices to your DataFrame. By doing so, you can enhance data retrieval, indexing, and merging tasks. Syntax: DataFrame.set

Pandas DataFrame.reset_index()

In Pandas, reset_index() method is used to reset the index of a DataFrame. By default, it creates a new integer-based index starting from 0, making the DataFrame easier to work with in various scenarios, especially after performing operations like filtering, grouping or multi-level indexing. Example

Python | Pandas Dataframe.at[ ]

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas at[] is used to return data in a dataframe at the passed location. The passed l

Pandas DataFrame iterrows() Method

iterrows() method in Pandas is a simple way to iterate over rows of a DataFrame. It returns an iterator that yields each row as a tuple containing the index and the row data (as a Pandas Series). This method is often used in scenarios where row-wise operations or transformations are required. Exampl

Python | Pandas Series.iteritems()

Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Pandas Series.iteritems() function iterates

Python | Pandas.to_datetime()

When a CSV file is imported and a Data Frame is made, the Date time objects in the file are read as a string object rather than a Date Time object Hence itâ€™s very tough to perform operations like Time difference on a string rather than a Date Time object. Pandas to_datetime() method helps to convert

Python | pandas.to_numeric method

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. pandas.to_numeric() is one of the general functions in Pandas which is used to convert

Pandas DataFrame.to_string-Python

Pandas is a powerful Python library for data manipulation, with DataFrame as its key two-dimensional, labeled data structure. It allows easy formatting and readable display of data. DataFrame.to_string() function in Pandas is specifically designed to render a DataFrame into a console-friendly tabula