Let’s move forward to the first type of Ensemble Methodology, the Bagging Algorithm.

We will cover the concept behind Bagging and implement it using R.

### The Bagging Algorithm

- The training dataset D
- Draw k boot strap sample sets from dataset D
- For each boot strap sample i
- Build a classifier model \(M_i\)

- We will have total of k classifiers \(M_1 , M_2 ,… M_k\)
- Vote over for the final classifier output and take the average for regression output

### Why Bagging Works

- We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again
- Note that the variance in the consolidated prediction is reduced, if we have independent samples. That way we can reduce the unavoidable errors made by the single model.
- In a given boot strap sample, some observations have chance to select multiple times and some observations might not have selected at all.
- There a proven theory that boot strap samples have only 63% of overall population and rest 37% is not present.
- So the data used in each of these models is not exactly same, This makes our learning models independent. This helps our predictors have the uncorrelated errors.
- Finally the errors from the individual models cancel out and give us a better ensemble model with higher accuracy
- Bagging is really useful when there is lot of variance in our data

### LAB: Bagging Models

- Import Boston house price data. It is part of MASS package
- Get some basic meta details of the data
- Take 90% data use it for training and take rest 10% as holdout data
- Build a single linear regression model on the training data.
- On the hold out data, calculate the error (squared deviation) for the regression model.
- Build the regression model using bagging technique. Build at least 25 models
- On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.
- What is the improvement of the bagged model when compared with the single model?

### Solution

```
#Importing Boston house pricing data.
library(MASS)
data(Boston)
head(Boston)
```

```
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
```

`dim(Boston)`

`## [1] 506 14`

```
##Training and holdout sample
library(caret)
```

`## Loading required package: lattice`

`## Loading required package: ggplot2`

```
set.seed(500)
sampleseed <- createDataPartition(Boston$medv, p=0.9, list=FALSE)
train_boston<-Boston[sampleseed,]
test_boston<-Boston[-sampleseed,]
###Regression Model
reg_model<- lm(medv ~ ., data=train_boston)
summary(reg_model)
```

```
##
## Call:
## lm(formula = medv ~ ., data = train_boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4763 -2.7684 -0.4912 1.9030 26.4569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.637e+01 5.534e+00 6.572 1.40e-10 ***
## crim -1.042e-01 3.513e-02 -2.965 0.003195 **
## zn 4.482e-02 1.459e-02 3.073 0.002248 **
## indus 1.986e-02 6.566e-02 0.302 0.762462
## chas 2.733e+00 8.765e-01 3.118 0.001939 **
## nox -1.844e+01 4.018e+00 -4.590 5.79e-06 ***
## rm 3.845e+00 4.670e-01 8.234 2.04e-15 ***
## age 8.782e-04 1.434e-02 0.061 0.951211
## dis -1.488e+00 2.096e-01 -7.101 4.94e-12 ***
## rad 2.770e-01 6.993e-02 3.960 8.71e-05 ***
## tax -1.062e-02 3.944e-03 -2.693 0.007348 **
## ptratio -9.799e-01 1.385e-01 -7.073 5.92e-12 ***
## black 9.620e-03 2.827e-03 3.403 0.000726 ***
## lstat -5.051e-01 5.706e-02 -8.852 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.787 on 444 degrees of freedom
## Multiple R-squared: 0.7309, Adjusted R-squared: 0.723
## F-statistic: 92.75 on 13 and 444 DF, p-value: < 2.2e-16
```

```
###Accuracy testing on holdout data
pred_reg<-predict(reg_model, newdata=test_boston[,-14])
reg_err<-sum((test_boston$medv-pred_reg)^2)
reg_err
```

`## [1] 918.5927`

```
###Bagging Ensemble Model
library(ipred)
bagg_model<- bagging(medv ~ ., data=train_boston , nbagg=30)
###Accuracy testing on holout data
pred_bagg<-predict(bagg_model, newdata=test_boston[,-14])
bgg_err<-sum((test_boston$medv-pred_bagg)^2)
bgg_err
```

`## [1] 390.9028`

```
###Overall Improvement
reg_err
```

`## [1] 918.5927`

`bgg_err`

`## [1] 390.9028`

`(reg_err-bgg_err)/reg_err`

```
## [1] 0.5744547
We can see the error of the model has been reduced.
```