drop_duplicates() : delete duplicate rows

Using DataFrame

Here is a sample DataFrame.
import pandas as pd 
my_dict={
  'id':[1,2,3,4,5,4,2],
  'name':['John','Max','Arnold','Krish','John','Krish','Max'],
  'class1':['Four','Three','Three','Four','Four','Four','Three'],
  'mark':[75,85,55,60,60,60,85],
  'gender':['female','male','male','female','female','female','male']
	}
df = pd.DataFrame(data=my_dict)
print(df)
Output ( here last two rows are duplicates, 6 is duplicate of 1 and 5 is duplicate of 3 )
   id    name class1  mark     gender
0   1    John   Four    75  female
1   2     Max  Three    85    male
2   3  Arnold  Three    55    male
3   4   Krish   Four    60  female
4   5    John   Four    60  female
5   4   Krish   Four    60  female
6   2     Max  Three    85    male

drop_duplicates: Deleting rows based on number of duplicate data #C03


Syntax
DataFrame.drop_duplicates(keep)
keepOptional ,
'first' default, delete all duplicate rows except first occurrence
'last', delete all duplicate rows except last occurrence
'False',delete all duplicate rows
Series : deletes duplicate values.
DataFrame :deletes duplicate rows.( can consider based some column values )
Serries.drop_duplicates()

Delete duplicate rows after first occurrence

Remove all duplicate rows but keep the first occurence. keep='first' but this is the default value of keep
import pandas as pd
my_dict={
	'id':[1,2,3,4,5,4,2],
	'name':['John','Max','Arnold','Krish','John','Krish','Max'],
    'class1':['Four','Three','Three','Four','Four','Four','Three'],
	'mark':[75,85,55,60,60,60,85],
    'gender':['female','male','male','female','female','female','male']
	}
df = pd.DataFrame(data=my_dict)
df=df.drop_duplicates(keep='first')
print(df)
Output : Note that we have assigned output to a new DataFrame df because by default inplace=False ( explained below )
   id    name class1  mark     gender
0   1    John   Four    75  female
1   2     Max  Three    85    male
2   3  Arnold  Three    55    male
3   4   Krish   Four    60  female
4   5    John   Four    60  female

Delete duplicate rows but keep last occurence

keep='last'
df=df.drop_duplicates(keep='last')
print(df)
Output
   id    name class1  mark     gender
0   1    John   Four    75  female
2   3  Arnold  Three    55    male
4   5    John   Four    60  female
5   4   Krish   Four    60  female
6   2     Max  Three    85    male

Delete duplicate rows in all places

keep=False
df=df.drop_duplicates(keep=False)
print(df)
Output ( all duplicate rows are deleted from all places )
   id    name class1  mark     gender
0   1    John   Four    75  female
2   3  Arnold  Three    55    male
4   5    John   Four    60  female

inplace=True

By default inplace=False, so our main dataframe df is not altered when we use drop_duplicates(). So in above codes we have used another DataFrame df to store the output of drop_duplicates(). By using inplace=True we can modify our main DataFrame df
df.drop_duplicates(inplace=True)
print(df)
Output
   id    name class1  mark     gender
0   1    John   Four    75  female
1   2     Max  Three    85    male
2   3  Arnold  Three    55    male
3   4   Krish   Four    60  female
4   5    John   Four    60  female

subset

Only consider certain columns for identifying duplicates, by default use all of the columns.
import pandas as pd 
my_dict={
  'id':[1,2,3,4,5,4,2],
  'name':['John','Max','Arnold','Krish','John','Krish','Max'],
  'class1':['Four','Three','Three','Four','Four','Four','Three'],
  'mark':[75,85,55,60,60,60,85],
  'gender':['female','male','male','female','female','female','male']
	}
df = pd.DataFrame(data=my_dict)
df.drop_duplicates(subset=['class1','mark','gender'],inplace=True)
print(df)
Output
   id    name class1  mark  gender
0   1    John   Four    75  female
1   2     Max  Three    85    male
2   3  Arnold  Three    55    male
3   4   Krish   Four    60  female

ignore_index

If True, the resulting axis will be labeled 0, 1, …, n - 1, default value is False
df.drop_duplicates(keep='last',inplace=True,ignore_index=True)
Output
   id    name class1  mark  gender
0   1    John   Four    75  female
1   3  Arnold  Three    55    male
2   5    John   Four    60  female
3   4   Krish   Four    60  female
4   2     Max  Three    85    male
Data Cleaning
Pandas dataframe.duplicated() Series.duplicated() Series.drop_duplicates()
Subscribe to our YouTube Channel here


Subscribe

* indicates required
Subscribe to plus2net

    plus2net.com



    Post your comments , suggestion , error , requirements etc here





    Python Video Tutorials
    Python SQLite Video Tutorials
    Python MySQL Video Tutorials
    Python Tkinter Video Tutorials
    We use cookies to improve your browsing experience. . Learn more
    HTML MySQL PHP JavaScript ASP Photoshop Articles FORUM . Contact us
    ©2000-2024 plus2net.com All rights reserved worldwide Privacy Policy Disclaimer