Home / Python / 104.2.7 Identifying and Removing Duplicate values from dataset in Python

104.2.7 Identifying and Removing Duplicate values from dataset in Python

In this post we will understand how to identify and remove the duplicate values form dataset. We will use bill dataset from Telecom Data Analysis folder.

Identifying & Removing Duplicates

In [90]:
bill_data=pd.read_csv("datasets\\Telecom Data Analysis\\Bill.csv")
bill_data.shape
Out[90]:
(9462, 7)
In [87]:
#Identify duplicates records in the data
dupes=bill_data.duplicated()
sum(dupes)
Out[87]:
10
In [88]:
#Removing Duplicates
bill_data_uniq=bill_data.drop_duplicates()
In [89]:
bill_data_uniq.shape
Out[89]:
(9452, 7)

Identifying & Duplicates based on Key

  • What if we are not interested in overall level records
  • Sometimes we may name the records as duplicates even if a key variable is repeated.
  • Instead of using duplicated function on full data, we use it on one variable
In [93]:
#Identify duplicates in bill data based on cust_id
dupe_id=bill_data.cust_id.duplicated()
In [95]:
#Removing duplicates based on a variable
bill_data_cust_uniq=bill_data.drop_duplicates(['cust_id'])

bill_data_cust_uniq.shape
Out[95]:
(9389, 7)

Practice : Handling Duplicates in R

  • DataSet: “./Telecom Data Analysis/Complaints.csv”
  • Identify overall duplicates in complaints data
  • Create a new dataset by removing overall duplicates in Complaints data
  • Identify duplicates in complaints data based on cust_id
  • Create a new dataset by removing duplicates based on cust_id in Complaints data
In [96]:
comp_data=pd.read_csv("datasets\\Telecom Data Analysis\\Complaints.csv")
comp_data.shape
Out[96]:
(6587, 8)
In [97]:
comp_data.columns.values
Out[97]:
array(['comp_id', 'month', 'incident', 'cust_id', 'sla status new',
       'incident type', 'type', 'severity'], dtype=object)
In [98]:
#Identify overall duplicates in complaints data

dupe=comp_data.duplicated()
sum(dupe) # gives total number of duplicates in data
Out[98]:
0
In [100]:
#Create a new dataset by removing overall duplicates in Complaints data
comp_data1=comp_data.drop_duplicates()
In [101]:
#Identify duplicates in complaints data based on cust_id
dupe_id=comp_data.cust_id.duplicated()
In [102]:
#Create a new dataset by removing duplicates based on cust_id in Complaints data
comp_data2=comp_data.drop_duplicates(['cust_id'])

comp_data2.shape
Out[102]:
(4856, 8)

About admin

Check Also

204.7.5 The Random Forest

Random Forest Like many trees form a forest, many decision tree model together form a …

Leave a Reply

Your email address will not be published. Required fields are marked *