Skip to content Skip to sidebar Skip to footer

Find Duplicate Rows In A Pandas Dataframe

I am trying to find duplicates rows in a pandas dataframe. df=pd.DataFrame(data=[[1,2],[3,4],[1,2],[1,4],[1,2]],columns=['col1','col2']) df Out[15]: col1 col2 0 1 2 1

Solution 1:

Use groupby, create a new column of indexes, and then call duplicated:

df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin')    
df[df.duplicated(subset=['col1','col2'], keep='first')]

   col1  col2  index_original
2     1     2               0
4     1     2               0

Details

I groupby first two columns and then call transform + idxmin to get the first index of each group.

df.groupby(['col1', 'col2']).col1.transform('idxmin') 

0011203340
Name: col1, dtype: int64

duplicated gives me a boolean mask of values I want to keep:

df.duplicated(subset=['col1','col2'], keep='first')

0False1False2True3False4True
dtype: bool

The rest is just boolean indexing.

Solution 2:

May you don't need this answer anymore but there's another way to find duplicated rows:

df=pd.DataFrame(data=[[1,2],[3,4],[1,2],[1,4],[1,2]],columns=['col1','col2'])

Given the above DataFrame you can use groupby with no drama but with larger DataFrames it'll be kinda slow, instead of that you can use

DataFrame.duplicated(subset=None, keep='first') Return boolean Series denoting duplicate rows.

As the documenation says, it returns a boolean series, in other words, a boolean mask, so you can manipulate the DataFrame with that mask, or just visualize the repeated rows:

>>>df[df.duplicated()]
   col1  col2
2     1     2
4     1     2

if you have a DataFrame with more columns and you want to find duplicated rows by specific columns, you can give to the function a list of columns to look for, for example to the following DataFrame:

# List of Tuplesstudents = [('jack', 34, 'Sydeny'),
            ('Riti', 30, 'Delhi'),
            ('Aadi', 16, 'New York'),
            ('Riti', 30, 'Delhi'),
            ('Riti', 30, 'Delhi'),
            ('Riti', 30, 'Mumbai'),
            ('Aadi', 40, 'London'),
            ('Sachin', 30, 'Delhi')
            ]
# Create a DataFrame objectdf = pd.DataFrame(students, columns=['Name', 'Age', 'City'])

if you want to find the duplicated rows by all columns and visualize it, just do:

>>>df[df.duplicated()]
   Name  Age   City
3  Riti   30  Delhi
4  Riti   30  Delhi

but if you want to just look for duplicated rows taking into account only two columns, for example, 'Name' and 'Age' just do:

>>>df[df.duplicated(['Name', 'Age'])]
   Name  Age    City
3  Riti   30   Delhi
4  Riti   30   Delhi
5  Riti   30  Mumbai

Or just one column like 'Name':

>>>df[df.duplicated(['Name'])]
   Name  Age    City
3  Riti   30   Delhi
4  Riti   30   Delhi
5  Riti   30  Mumbai
6  Aadi   40  London

The above examples just returned the repeated rows, not the 'original one' so if you look the examples, if there are three repeated rows by a given criteria, just two will be returned.

Solution 3:

len(df[df.duplicated()])

By this method, you can count the number of duplicates in your dataset.

Post a Comment for "Find Duplicate Rows In A Pandas Dataframe"