Home / R / 103.2.6.a Handling Duplicates- An Example

103.2.6.a Handling Duplicates- An Example

The issue of duplicate values is very frequent in many transactional datasets. Here we have an example of telecom bill and complaints data. A single user may register a complaint twice a day. This repeated data needs to be found and removed from the space.

The following data on Complaints would help in understanding how this is done.

  1. DataSet: “./Telecom Data Analysis/Complaints.csv”
  2. Identify overall duplicates in complaints data
  3. Create a new dataset by removing overall duplicates in Complaints data
  4. Identify duplicates in complaints data based on cust_id
  5. Create a new dataset by removing duplicates based on cust_id in Complaints data

Solutions

  1. DataSet: “./Telecom Data Analysis/Complaints.csv”
>Complaint <- read.csv("C:\\Users\\venk\\Google Drive\\Training\\Datasets\\Telecom Data Analysis\\Complaints.csv")
>head(Complaint)
##     comp_id  month  incident    cust_id sla.status.new
 ## 1 318691791 12-Jul 7/22/2012 9243584301   ClosedIN SLA
 ## 2 318691795 12-Jul 7/22/2012 9243584297   ClosedIN SLA
 ## 3 318691807 12-Jul 7/22/2012 9243584279   ClosedIN SLA
 ## 4 318691809 12-Jul 7/22/2012 9243584277   ClosedIN SLA
 ## 5 317997895 12-Jul 7/17/2012 9282441943   ClosedIN SLA
 ## 6 318030735 12-Jul 7/17/2012 7204112312   ClosedIN SLA
 ##               incident.type       type severity
 ## 1    Customer Communication VAS-OTHERS     High
 ## 2    Customer Communication VAS-OTHERS     High
 ## 3    Customer Communication VAS-OTHERS     High
 ## 4    Customer Communication VAS-OTHERS     High
 ## 5         Barring/Unbarring VAS-OTHERS     High
 ## 6 Name/Address Modification       1515      Low
  1. Identify overall duplicates in complaints data
>Dupes_in_Complaint <- duplicated(Complaint)
>summary(Dupes_in_Complaint)
>Complaint_dupes <- Complaint[Dupes_in_Complaint,]
>Complaint_dupes

 

##    Mode   FALSE    NA's 
## logical    6587       0
## [1] comp_id        month          incident       cust_id       
## [5] sla.status.new incident.type  type           severity      
## <0 rows> (or 0-length row.names)

There are no record level duplicates

  1. Create a new dataset by removing overall duplicates in Complaints data

There are no record level duplicates, if there were any we can remove them using below code

>dim(Complaint)
>Complaint_unique <- Complaint[!Dupes_in_Complaint,]
>dim(Complaint_unique)

 

## [1] 6587    8
## [1] 6587    8

OR

>dim(Complaint)
>Complaint_unique <- unique(Complaint)
>dim(Complaint_unique)
## [1] 6587    8
## [1] 6587    8
  1. Identify duplicates in complaints data based on cust_id
>Dupes_in_Complaint_cust_id<-duplicated(Complaint$cust_id) 
>length(Dupes_in_Complaint_cust_id[Dupes_in_Complaint_cust_id==TRUE])
## [1] 1731

As we can see, there are 1731 duplicates if searched by using a key “cust_id”

  1. Create a new dataset by removing duplicates based on cust_id in Complaints data
>Complaint_unique<-Complaint[!Dupes_in_Complaint_cust_id,]
>nrow(Complaint)
>nrow(Complaint_unique)

OUTPUT:

## [1] 6587
## [1] 4856

From the above we can see that the duplicated rows are eliminated, as the number of rows in the complete data set is different from the number of rows in unique data set.

 

About admin

Check Also

103.2.3.b Sub Setting-An Example

Here we have to import the automobile dataset and then perform various operations in it. …

Leave a Reply

Your email address will not be published. Required fields are marked *