Let’s move forward to the first type of Ensemble Methodology, the Bagging Algorithm.

We will cover the concept behind Bagging and implement it using python

### The Bagging Algorithm

- The training dataset D
- Draw k boot strap sample sets from dataset D
- For each boot strap sample i
- Build a classifier model Mi

- We will have total of k classifiers M1,M2,...Mk
- Vote over for the final classifier output and take the average for regression output.

### Why Bagging Works

- We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again.
- Note that the variance in the consolidated prediction is reduced, if we have independent samples. That way we can reduce the unavoidable errors made by the single model.
- In a given boot strap sample, some observations have chance to select multiple times and some observations might not have selected at all.
- There a proven theory that boot strap samples have only 63% of overall population and rest 37% is not present.
- So the data used in each of these models is not exactly same, This makes our learning models independent. This helps our predictors have the uncorrelated errors..
- Finally the errors from the individual models cancel out and give us a better ensemble model with higher accuracy.
- Bagging is really useful when there is lot of variance in our data.

And now, lets put everything into practice.

### Practice : Bagging Models

- Import Boston house price data.
- Get some basic meta details of the data
- Take 90% data use it for training and take rest 10% as holdout data
- Build a single linear regression model on the training data.
- On the hold out data, calculate the error (squared deviation) for the regression model.
- Build the regression model using bagging technique. Build at least 25 models
- On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.
- What is the improvement of the bagged model when compared with the single model?

In [1]:

```
#Importing Boston house price data
import pandas as pd
import sklearn as sk
import numpy as np
import scipy as sp
house=pd.read_csv("datasets/Housing/Boston.csv")
```

In [2]:

```
house.head(5)
```

Out[2]:

In [3]:

```
###columns of the dataset##
house.info()
```

In [4]:

```
###Splitting the dataset into training and testing datasets
from sklearn.cross_validation import train_test_split
house_train,house_test=train_test_split(house,train_size=0.9)
```

In [5]:

```
###Building a linear Regression with medv as the predictor variable on the traiing dadaset ###
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(house_train[['crim']+['zn']+['indus']+['chas']+['nox']+['rm']+['age']+['dis']+['rad']+['tax']+['ptratio']+['black']+['lstat']],house_train[['medv']])
```

Out[5]:

In [6]:

```
###predicting the model on test dataset
predict_test=lr.predict(house_test[['crim']+['zn']+['indus']+['chas']+['nox']+['rm']+['age']+['dis']+['rad']+['tax']+['ptratio']+['black']+['lstat']])
```

In [7]:

```
from sklearn.metrics import mean_squared_error
###error in linear regression model ###
mean_squared_error(house_test['medv'],predict_test, sample_weight=None, multioutput='uniform_average')
```

Out[7]:

In [8]:

```
#Build the regression model using bagging technique.
from sklearn.ensemble import BaggingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
Bag=BaggingRegressor(base_estimator=LinearRegression(), n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=1, random_state=None, verbose=0)
features = list(house.columns[:13])
X = house_train[features]
y = house_train['medv']
Bag.fit(X,y)
bagpredict_test=Bag.predict(house_test[features])
z=(house_test[['medv']])
```

In [9]:

```
### to estimate the accuracy of the Bagging model ###
mean_squared_error(z, bagpredict_test, sample_weight=None, multioutput='uniform_average')
```

Out[9]:

**We can see the error of the model has been reduced.**