Home / Python / 104.3.5 Box Plots and Outlier Dectection using Python

104.3.5 Box Plots and Outlier Dectection using Python

In this post we will discuss a basics or boxplots and how they help us identify outliers.
We will be carrying same python session form series 104 blog posts, i.e. same datasets.

Box plots and Outlier Detection

  • Box plots have box from LQ to UQ, with median marked.
  • They portray a five-number graphical summary of the data Minimum, LQ, Median, UQ, Maximum
  • Helps us to get an idea on the data distribution
  • Helps us to identify the outliers easily
  • 25% of the population is below first quartile,
  • 75% of the population is below third quartile
  • If the box is pushed to one side and some values are far away from the box then it’s a clear indication of outliers
  • Some set of values far away from box, is gives us a clear indication of outliers.
  • In this example the minimum is 5, maximum is 120, and 75% of the values are less than 15
  • Still there are some records reaching 120. Hence a clear indication of outliers
  • Sometimes the outliers are so evident that, the box appear to be a horizontal line in box plot.

Box plots and outlier detection on Python

In [30]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline  

plt.boxplot(bank.balance)
Out[30]:
{'boxes': [<matplotlib.lines.Line2D at 0xcbcd400>],
 'caps': [<matplotlib.lines.Line2D at 0xcbdde10>,
  <matplotlib.lines.Line2D at 0xcbddf28>],
 'fliers': [<matplotlib.lines.Line2D at 0xccc4f98>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xccc4780>],
 'whiskers': [<matplotlib.lines.Line2D at 0xcbcdda0>,
  <matplotlib.lines.Line2D at 0xcbcdeb8>]}

Practice: Box plots and outlier detection

  • Dataset: “./Bank Marketing/bank_market.csv”
  • Draw a box plot for balance variable
  • Do you suspect any outliers in balance ?
  • Get relevant percentiles and see their distribution.
  • Draw a box plot for age variable
  • Do you suspect any outliers in age?
  • Get relevant percentiles and see their distribution.
In [31]:
plt.boxplot(bank.balance)
Out[31]:
{'boxes': [<matplotlib.lines.Line2D at 0xcc78208>],
 'caps': [<matplotlib.lines.Line2D at 0xcc7fc18>,
  <matplotlib.lines.Line2D at 0xcc7fd30>],
 'fliers': [<matplotlib.lines.Line2D at 0xcc84da0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xcc84588>],
 'whiskers': [<matplotlib.lines.Line2D at 0xcc78ba8>,
  <matplotlib.lines.Line2D at 0xcc78cc0>]}

outlier are present in balance variable

In [32]:
#Get relevant percentiles and see their distribution
bank['balance'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Out[32]:
0.0     -8019.0
0.1         0.0
0.2        22.0
0.3       131.0
0.4       272.0
0.5       448.0
0.6       701.0
0.7      1126.0
0.8      1859.0
0.9      3574.0
1.0    102127.0
Name: balance, dtype: float64
In [33]:
# Draw a box plot for age variable
plt.boxplot(bank.age)
Out[33]:
{'boxes': [<matplotlib.lines.Line2D at 0xcf54470>],
 'caps': [<matplotlib.lines.Line2D at 0xcf5be80>,
  <matplotlib.lines.Line2D at 0xcf5bf98>],
 'fliers': [<matplotlib.lines.Line2D at 0xcf65748>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xcf617f0>],
 'whiskers': [<matplotlib.lines.Line2D at 0xcf54e10>,
  <matplotlib.lines.Line2D at 0xcf54f28>]}

No outliers are present

In [34]:
#Get relevant percentiles and see their distribution
bank['age'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Out[34]:
0.0    18.0
0.1    29.0
0.2    32.0
0.3    34.0
0.4    36.0
0.5    39.0
0.6    42.0
0.7    46.0
0.8    51.0
0.9    56.0
1.0    95.0
Name: age, dtype: float64

About admin

Check Also

204.7.5 The Random Forest

Random Forest Like many trees form a forest, many decision tree model together form a …

Leave a Reply

Your email address will not be published. Required fields are marked *