Let’s move forward to the first type of Ensemble Methodology, the Bagging Algorithm.
We will cover the concept behind Bagging and implement it using python
The Bagging Algorithm
- The training dataset D
- Draw k boot strap sample sets from dataset D
- For each boot strap sample i
- Build a classifier model Mi
- We will have total of k classifiers M1,M2,...Mk
- Vote over for the final classifier output and take the average for regression output.
Why Bagging Works
- We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again.
- Note that the variance in the consolidated prediction is reduced, if we have independent samples. That way we can reduce the unavoidable errors made by the single model.
- In a given boot strap sample, some observations have chance to select multiple times and some observations might not have selected at all.
- There a proven theory that boot strap samples have only 63% of overall population and rest 37% is not present.
- So the data used in each of these models is not exactly same, This makes our learning models independent. This helps our predictors have the uncorrelated errors..
- Finally the errors from the individual models cancel out and give us a better ensemble model with higher accuracy.
- Bagging is really useful when there is lot of variance in our data.
And now, lets put everything into practice.
Practice : Bagging Models
- Import Boston house price data.
- Get some basic meta details of the data
- Take 90% data use it for training and take rest 10% as holdout data
- Build a single linear regression model on the training data.
- On the hold out data, calculate the error (squared deviation) for the regression model.
- Build the regression model using bagging technique. Build at least 25 models
- On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.
- What is the improvement of the bagged model when compared with the single model?
#Importing Boston house price data import pandas as pd import sklearn as sk import numpy as np import scipy as sp house=pd.read_csv("datasets/Housing/Boston.csv")
###columns of the dataset## house.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): crim 506 non-null float64 zn 506 non-null float64 indus 506 non-null float64 chas 506 non-null int64 nox 506 non-null float64 rm 506 non-null float64 age 506 non-null float64 dis 506 non-null float64 rad 506 non-null int64 tax 506 non-null int64 ptratio 506 non-null float64 black 506 non-null float64 lstat 506 non-null float64 medv 506 non-null float64 dtypes: float64(11), int64(3) memory usage: 55.4 KB
###Splitting the dataset into training and testing datasets from sklearn.cross_validation import train_test_split house_train,house_test=train_test_split(house,train_size=0.9)
###Building a linear Regression with medv as the predictor variable on the traiing dadaset ### from sklearn.linear_model import LinearRegression lr = LinearRegression() lr.fit(house_train[['crim']+['zn']+['indus']+['chas']+['nox']+['rm']+['age']+['dis']+['rad']+['tax']+['ptratio']+['black']+['lstat']],house_train[['medv']])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
###predicting the model on test dataset predict_test=lr.predict(house_test[['crim']+['zn']+['indus']+['chas']+['nox']+['rm']+['age']+['dis']+['rad']+['tax']+['ptratio']+['black']+['lstat']])
from sklearn.metrics import mean_squared_error ###error in linear regression model ### mean_squared_error(house_test['medv'],predict_test, sample_weight=None, multioutput='uniform_average')
#Build the regression model using bagging technique. from sklearn.ensemble import BaggingRegressor from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error Bag=BaggingRegressor(base_estimator=LinearRegression(), n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=1, random_state=None, verbose=0) features = list(house.columns[:13]) X = house_train[features] y = house_train['medv'] Bag.fit(X,y) bagpredict_test=Bag.predict(house_test[features]) z=(house_test[['medv']])
### to estimate the accuracy of the Bagging model ### mean_squared_error(z, bagpredict_test, sample_weight=None, multioutput='uniform_average')
We can see the error of the model has been reduced.