Home / Python / Predictive Modeling & Machine Learning / 204.7.4 The Bagging Algorithm

204.7.4 The Bagging Algorithm

Let’s move forward to the first type of Ensemble Methodology, the Bagging Algorithm.

We will cover the concept behind Bagging and implement it using python

The Bagging Algorithm

  • The training dataset D
  • Draw k boot strap sample sets from dataset D
  • For each boot strap sample i
    • Build a classifier model Mi
  • We will have total of k classifiers M1,M2,...Mk
  • Vote over for the final classifier output and take the average for regression output.

 

 

Why Bagging Works

  • We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again.
  • Note that the variance in the consolidated prediction is reduced, if we have independent samples. That way we can reduce the unavoidable errors made by the single model.
  • In a given boot strap sample, some observations have chance to select multiple times and some observations might not have selected at all.
  • There a proven theory that boot strap samples have only 63% of overall population and rest 37% is not present.
  • So the data used in each of these models is not exactly same, This makes our learning models independent. This helps our predictors have the uncorrelated errors..
  • Finally the errors from the individual models cancel out and give us a better ensemble model with higher accuracy.
  • Bagging is really useful when there is lot of variance in our data.

And now, lets put everything into practice.

Practice : Bagging Models

  • Import Boston house price data.
  • Get some basic meta details of the data
  • Take 90% data use it for training and take rest 10% as holdout data
  • Build a single linear regression model on the training data.
  • On the hold out data, calculate the error (squared deviation) for the regression model.
  • Build the regression model using bagging technique. Build at least 25 models
  • On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.
  • What is the improvement of the bagged model when compared with the single model?
In [1]:
#Importing Boston house price data
import pandas as pd
import sklearn as sk
import numpy as np
import scipy as sp
house=pd.read_csv("datasets/Housing/Boston.csv")
In [2]:
house.head(5)
Out[2]:
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
In [3]:
###columns of the dataset##
house.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
crim       506 non-null float64
zn         506 non-null float64
indus      506 non-null float64
chas       506 non-null int64
nox        506 non-null float64
rm         506 non-null float64
age        506 non-null float64
dis        506 non-null float64
rad        506 non-null int64
tax        506 non-null int64
ptratio    506 non-null float64
black      506 non-null float64
lstat      506 non-null float64
medv       506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 55.4 KB
In [4]:
###Splitting the dataset into training and testing datasets
from sklearn.cross_validation import train_test_split
house_train,house_test=train_test_split(house,train_size=0.9)
In [5]:
###Building a linear Regression with medv as the predictor variable on the traiing dadaset ###
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(house_train[['crim']+['zn']+['indus']+['chas']+['nox']+['rm']+['age']+['dis']+['rad']+['tax']+['ptratio']+['black']+['lstat']],house_train[['medv']])
Out[5]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [6]:
###predicting the model on test dataset
predict_test=lr.predict(house_test[['crim']+['zn']+['indus']+['chas']+['nox']+['rm']+['age']+['dis']+['rad']+['tax']+['ptratio']+['black']+['lstat']])
In [7]:
from sklearn.metrics import mean_squared_error

###error in linear regression model ###
mean_squared_error(house_test['medv'],predict_test, sample_weight=None, multioutput='uniform_average')
Out[7]:
22.567044241536465
In [8]:
#Build the regression model using bagging technique. 
from sklearn.ensemble import BaggingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Bag=BaggingRegressor(base_estimator=LinearRegression(), n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=1, random_state=None, verbose=0)
features = list(house.columns[:13])
X = house_train[features]
y = house_train['medv']
Bag.fit(X,y)
bagpredict_test=Bag.predict(house_test[features])
z=(house_test[['medv']])
In [9]:
### to estimate the accuracy of the Bagging model ###
mean_squared_error(z, bagpredict_test, sample_weight=None, multioutput='uniform_average')
Out[9]:
22.747229558702969

We can see the error of the model has been reduced.

About admin

Check Also

204.7.6 Practice : Random Forest

Let’s implement the concept of Random Forest into practice using Python. Practice : Random Forest …

Leave a Reply

Your email address will not be published. Required fields are marked *