Find Duplicate Rows In A Pandas Dataframe
Solution 1:
Use groupby
, create a new column of indexes, and then call duplicated
:
df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin')
df[df.duplicated(subset=['col1','col2'], keep='first')]
col1 col2 index_original
2 1 2 0
4 1 2 0
Details
I groupby
first two columns and then call transform
+ idxmin
to get the first index of each group.
df.groupby(['col1', 'col2']).col1.transform('idxmin')
0011203340
Name: col1, dtype: int64
duplicated
gives me a boolean mask of values I want to keep:
df.duplicated(subset=['col1','col2'], keep='first')
0False1False2True3False4True
dtype: bool
The rest is just boolean indexing.
Solution 2:
May you don't need this answer anymore but there's another way to find duplicated rows:
df=pd.DataFrame(data=[[1,2],[3,4],[1,2],[1,4],[1,2]],columns=['col1','col2'])
Given the above DataFrame you can use groupby with no drama but with larger DataFrames it'll be kinda slow, instead of that you can use
DataFrame.duplicated(subset=None, keep='first') Return boolean Series denoting duplicate rows.
As the documenation says, it returns a boolean series, in other words, a boolean mask, so you can manipulate the DataFrame with that mask, or just visualize the repeated rows:
>>>df[df.duplicated()]
col1 col2
2 1 2
4 1 2
if you have a DataFrame with more columns and you want to find duplicated rows by specific columns, you can give to the function a list of columns to look for, for example to the following DataFrame:
# List of Tuplesstudents = [('jack', 34, 'Sydeny'),
('Riti', 30, 'Delhi'),
('Aadi', 16, 'New York'),
('Riti', 30, 'Delhi'),
('Riti', 30, 'Delhi'),
('Riti', 30, 'Mumbai'),
('Aadi', 40, 'London'),
('Sachin', 30, 'Delhi')
]
# Create a DataFrame objectdf = pd.DataFrame(students, columns=['Name', 'Age', 'City'])
if you want to find the duplicated rows by all columns and visualize it, just do:
>>>df[df.duplicated()]
Name Age City
3 Riti 30 Delhi
4 Riti 30 Delhi
but if you want to just look for duplicated rows taking into account only two columns, for example, 'Name' and 'Age' just do:
>>>df[df.duplicated(['Name', 'Age'])]
Name Age City
3 Riti 30 Delhi
4 Riti 30 Delhi
5 Riti 30 Mumbai
Or just one column like 'Name':
>>>df[df.duplicated(['Name'])]
Name Age City
3 Riti 30 Delhi
4 Riti 30 Delhi
5 Riti 30 Mumbai
6 Aadi 40 London
The above examples just returned the repeated rows, not the 'original one' so if you look the examples, if there are three repeated rows by a given criteria, just two will be returned.
Solution 3:
len(df[df.duplicated()])
By this method, you can count the number of duplicates in your dataset.
Post a Comment for "Find Duplicate Rows In A Pandas Dataframe"