Home / Python / 104.3.4 Percentiles & Quartiles in Python

104.3.4 Percentiles & Quartiles in Python

In previous post we went through Dispersion Measures and implemented them using python.
This post is extension of previous posts, again we will go on with the data we have imported in last sessions.
Percentiles and Quartiles are very useful when we need to identify the outlier in our data. They also helps us understand the basic distribution of the data.

Percentiles

  • A student attended an exam along with 1000 others.
    • He got 68% marks? How good or bad he performed in the exam?
    • What will be his rank overall?
    • What will be his rank if there were 100 students overall?
  • For example, with 68 marks, he stood at 90th position. There are 910 students who got less than 68, only 89 students got more marks than him
  • He is standing at 91 percentile.
  • Instead of stating 68 marks, 91% gives a good idea on his performance
  • Percentiles make the data easy to read
  • pth percentile: p percent of observations below it, (100 – p)% above it.
  • Marks are 40 but percentile is 80%, what does this mean?
  • 80% of CAT exam percentile means
    • 20% are above & 80% are below
  • Percentiles help us in getting an idea on outliers.
  • For example the highest income value is 400,000 but 95th percentile is 20,000 only. That means 95% of the values are less than 20,000. So the values near 400,000 are clearly outliers

Quartiles

  • Percentiles divide the whole population into 100 groups where as quartiles divide the population into 4 groups
  • p = 25: First Quartile or Lower quartile (LQ)
  • p = 50: second quartile or Median
  • p = 75: Third Quartile or Upper quartile (UQ)

Percentiles & Quartiles in Python

  • By default summary gives 4 quartiles
In [22]:
Income_Data['capital-gain'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Out[22]:
0.0        0.0
0.1        0.0
0.2        0.0
0.3        0.0
0.4        0.0
0.5        0.0
0.6        0.0
0.7        0.0
0.8        0.0
0.9        0.0
1.0    99999.0
Name: capital-gain, dtype: float64
In [23]:
Income_Data['capital-loss'].quantile([0, 0.1, 0.2, 0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])
Out[23]:
0.0       0.0
0.1       0.0
0.2       0.0
0.3       0.0
0.4       0.0
0.5       0.0
0.6       0.0
0.7       0.0
0.8       0.0
0.9       0.0
1.0    4356.0
Name: capital-loss, dtype: float64
In [24]:
Income_Data['hours-per-week'].quantile([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])
Out[24]:
0.0     1.0
0.1    24.0
0.2    35.0
0.3    40.0
0.4    40.0
0.5    40.0
0.6    40.0
0.7    40.0
0.8    48.0
0.9    55.0
1.0    99.0
Name: hours-per-week, dtype: float64

Looks like some people are working 90 hours perweek.

Practice : Percentiles & Quartiles in Python

  • Dataset: “./Bank Marketing/bank_market.csv”
  • Get the summary of the balance variable
  • Do you suspect any outliers in balance ?
  • Get relevant percentiles and see their distribution.
  • Are there really some outliers present?
  • Get the summary of the age variable
  • Do you suspect any outliers in age?
  • Get relevant percentiles and see their distribution.
  • Are there really some outliers present?
In [25]:
bank=pd.read_csv("datasets\\Bank Marketing\\bank_market.csv",encoding = "ISO-8859-1")
bank.shape
Out[25]:
(45211, 18)
In [26]:
#Get the summary of the balance variable
#we can find the summary of the balance variable by using .describe()
summary_bala=bank["balance"].describe()
summary_bala
Out[26]:
count     45211.000000
mean       1362.272058
std        3044.765829
min       -8019.000000
25%          72.000000
50%         448.000000
75%        1428.000000
max      102127.000000
Name: balance, dtype: float64

Yes, There are outliers as mean and median is very different

In [27]:
#Get relevant percentiles and see their distribution.
bank['balance'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Out[27]:
0.0     -8019.0
0.1         0.0
0.2        22.0
0.3       131.0
0.4       272.0
0.5       448.0
0.6       701.0
0.7      1126.0
0.8      1859.0
0.9      3574.0
1.0    102127.0
Name: balance, dtype: float64
In [28]:
#Get the summary of the age variable
summary_age=bank['age'].describe()
summary_age
Out[28]:
count    45211.000000
mean        40.936210
std         10.618762
min         18.000000
25%         33.000000
50%         39.000000
75%         48.000000
max         95.000000
Name: age, dtype: float64

Looks like no outliers.

In [29]:
#Get relevant percentiles and see their distribution
bank['age'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Out[29]:
0.0    18.0
0.1    29.0
0.2    32.0
0.3    34.0
0.4    36.0
0.5    39.0
0.6    42.0
0.7    46.0
0.8    51.0
0.9    56.0
1.0    95.0
Name: age, dtype: float64

About admin

Check Also

204.7.5 The Random Forest

Random Forest Like many trees form a forest, many decision tree model together form a …

Leave a Reply

Your email address will not be published. Required fields are marked *