As soon as we get some data, we can carry out descriptive statistics on it. Basic descriptive statistics gives an idea on the variable and their distribution, we get an overall picture of dataset and it also helps us to create a report on the data. There are 2 types of basic descriptive statistics:
Central tendencies and Dispersion.
Central tendencies deal with the mean, median and mode, whereas the measures of Dispersion are range, variance and standard deviation.
Central tendencies: mean, median
Mean is nothing but the arithmetic mean or theaverage, i.e, the sum of values divided by the count of values. It helps us to understand the data, evaluate the data. Mean is a good measure to calculate the average of the variables, but it is not recommended when there are outliers in the data. Outliers are the few data elements in the dataset which are very much different from rest of the data elements.
For Example : Let us consider this data
Now here 90% of the values are below 2, but when we calculate the mean, we get the value as 2. This is because there is a value (i.e.,9), which is very much different from rest of the values. This is called an outlier. So in such cases, where there are outliers, we need a better approach which gives a more accurate or true middle value. Hence median can be considered in such cases. For calculating the median, the give data is sorted in either ascending or descending order, and then take the middle value which becomes the median which can be a true average value in such cases
For example, consider the same data;
Here the middle value is 1.4, which becomes the median.
Therefore we can say that even if there are outliers present in the data, we can get a true middle value using median, as the sorting shifts the outliers to the extreme ends.
Let us see how to calculate mean and median in R. We consider the Income data.
Income<-read.csv("C:\\Amrita\\Datavedi\\Census Income Data\\Income_data.csv")
From this dataset we calculate the mean and median of the variable “capital.gain”.
mean(Income$capital.gain) ##  1077.649 median(Income$capital.gain) ##  0
We get mean as 1077.649 and median as 0. As there is a vast difference between the two, we can say that there are outliers in the data. If there are no outliers, there will not be much difference in the mean and median values. So if there are outliers we must always consider the median.
Lab: Mean & Median
Now let us consider the dataset, Online Retail Sales Data.
Online_Retail<-read.csv("C:\\Amrita\\Datavedi\\Online Retail Sales Data\\Online Retail.csv")
Calculate the mean and median of the variable “UnitPrice” and let us see if there are any outliers in the data.
mean(Online_Retail$UnitPrice) ##  4.611114 median(Online_Retail$UnitPrice) ##  2.08
So here the mean is 4.611114 and median is 2.08, which means maen and meadian are very close. However we still cannot conclude on the absence of outlier because if there are balancing outliers on the either side of median, then also the mean and median can be close. Now also find the mean and median of the variable “Quantity”.
mean(Online_Retail$Quantity) ##  9.55225 median(Online_Retail$Quantity) ##  3
Here we can see that the mean is 9.55225 and the median is 3. In this case, as there is some difference in the mean and median value, there can be outliers in the data but we cannot be sure. Outliers can be detected using box plot which will be covered in further sessions.