Home / Predictive Modeling & Machine Learning / 203.2.6 Model Selection : Logistic Regression

# 203.2.6 Model Selection : Logistic Regression

### LAB-Logistic Regression Model Selection

1. What are the top-2 impacting variables in fiber bits model?
1. What are the least impacting variables in fiber bits model?
1. Can we drop any of these variables?
1. Can we derive any new variables to increase the accuracy of the model?
1. What is the final model? What the best accuracy that you can expect on this data?

### Solution

1. What are the top-2 impacting variables in fiber bits model?

Speed_test_result and relocation status are the top two important variables

1. What are the least impacting variables in fiber bits model?

monthly_bill and income are the least impacting variables

1. Can we drop any of these variables?

We can drop monthly_bill and income, they have the least impact when compared to other predictors. But we need to see the accuracy and AIC then take the final decision.

### AIC & Accuracy of Model1

``````threshold=0.5

predicted_values<-ifelse(predict(Fiberbits_model_1,type="response")>threshold,1,0)
actual_values<-Fiberbits_model_1\$y

conf_matrix<-table(predicted_values,actual_values)
conf_matrix``````
``````##                 actual_values
## predicted_values     0     1
##                0 29492 10847
##                1 12649 47012``````
``````accuracy1<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy1``````
``## [1] 0.76504``
``AIC(Fiberbits_model_1)``
``## [1] 98377.36``

### AIC & Accuracy of Model1 without monthly_bill

``Fiberbits_model_11<-glm(active_cust~.-monthly_bill,family=binomial(),data=Fiberbits)``
``## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred``
``````threshold=0.5

predicted_values<-ifelse(predict(Fiberbits_model_11,type="response")>threshold,1,0)
actual_values<-Fiberbits_model_11\$y

conf_matrix<-table(predicted_values,actual_values)
accuracy11<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy11``````
``## [1] 0.76337``
``AIC(Fiberbits_model_11)``
``## [1] 98580.54``

### AIC & Accuracy of Model1 without income

``Fiberbits_model_2<-glm(active_cust~.-income,family=binomial(),data=Fiberbits)``
``## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred``
``````threshold=0.5

predicted_values<-ifelse(predict(Fiberbits_model_2,type="response")>threshold,1,0)
actual_values<-Fiberbits_model_2\$y

conf_matrix<-table(predicted_values,actual_values)
accuracy2<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy2``````
``## [1] 0.76695``
``AIC(Fiberbits_model_2)``
``## [1] 99076.27``

### Deciding which Variable to Drop

Model All Variables Without monthly_bill Without income
AIC 98377.36 98580.54 99076.27
Accuracy 0.76504 0.76337 0.76695

Dropping Income has not reduced the accuracy. AIC(Loss of information) also shows no big change.

### Output of Model2

``summary(Fiberbits_model_2)``
``````##
## Call:
## glm(formula = active_cust ~ . - income, family = binomial(),
##     data = Fiberbits)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -8.4904  -0.8901   0.4175   0.7675   3.1083
##
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)
## (Intercept)                -1.309e+01  2.100e-01  -62.34   <2e-16 ***
## months_on_network           1.004e-02  4.644e-04   21.62   <2e-16 ***
## Num_complaints             -7.071e-01  2.990e-02  -23.65   <2e-16 ***
## number_plan_changes        -2.016e-01  7.571e-03  -26.63   <2e-16 ***
## relocated                  -3.133e+00  3.933e-02  -79.66   <2e-16 ***
## monthly_bill               -2.253e-03  1.566e-04  -14.39   <2e-16 ***
## technical_issues_per_month -3.970e-01  7.159e-03  -55.45   <2e-16 ***
## Speed_test_result           2.198e-01  2.334e-03   94.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 136149  on 99999  degrees of freedom
## Residual deviance:  99060  on 99992  degrees of freedom
## AIC: 99076
##
## Number of Fisher Scoring iterations: 7``````
1. Can we derive any new variables to increase the accuracy of the model?
``````Fiberbits_model_3<-glm(active_cust~    income
+months_on_network
+Num_complaints
+number_plan_changes
+relocated
+monthly_bill
+technical_issues_per_month
+technical_issues_per_month*number_plan_changes
+Speed_test_result+I(Speed_test_result^2),
family=binomial(),data=Fiberbits)``````
``## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred``
``summary(Fiberbits_model_3)``
``````##
## Call:
## glm(formula = active_cust ~ income + months_on_network + Num_complaints +
##     number_plan_changes + relocated + monthly_bill + technical_issues_per_month +
##     technical_issues_per_month * number_plan_changes + Speed_test_result +
##     I(Speed_test_result^2), family = binomial(), data = Fiberbits)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -4.6112  -0.8478   0.3780   0.7401   2.9909
##
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                    -2.501e+01  3.647e-01
## income                                          1.831e-03  8.386e-05
## months_on_network                               2.905e-02  1.011e-03
## Num_complaints                                 -6.972e-01  3.030e-02
## number_plan_changes                            -4.404e-01  2.199e-02
## relocated                                      -3.253e+00  3.997e-02
## monthly_bill                                   -2.295e-03  1.588e-04
## technical_issues_per_month                     -4.670e-01  9.694e-03
## Speed_test_result                               3.910e-01  4.260e-03
## I(Speed_test_result^2)                         -9.438e-04  1.272e-05
## number_plan_changes:technical_issues_per_month  7.481e-02  6.164e-03
##                                                z value Pr(>|z|)
## (Intercept)                                     -68.56   <2e-16 ***
## income                                           21.83   <2e-16 ***
## months_on_network                                28.73   <2e-16 ***
## Num_complaints                                  -23.00   <2e-16 ***
## number_plan_changes                             -20.03   <2e-16 ***
## relocated                                       -81.39   <2e-16 ***
## monthly_bill                                    -14.46   <2e-16 ***
## technical_issues_per_month                      -48.17   <2e-16 ***
## Speed_test_result                                91.79   <2e-16 ***
## I(Speed_test_result^2)                          -74.20   <2e-16 ***
## number_plan_changes:technical_issues_per_month   12.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 136149  on 99999  degrees of freedom
## Residual deviance:  97105  on 99989  degrees of freedom
## AIC: 97127
##
## Number of Fisher Scoring iterations: 7``````

### AIC & Accuracy of Model 3

``````threshold=0.5

predicted_values<-ifelse(predict(Fiberbits_model_3,type="response")>threshold,1,0)
actual_values<-Fiberbits_model_3\$y

conf_matrix<-table(predicted_values,actual_values)
accuracy3<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy3``````
``## [1] 0.76061``
``AIC(Fiberbits_model_3)``
``## [1] 97127.17``
1. What is the final model? What the best accuracy that you can expect on this data?
``AIC(Fiberbits_model_1,Fiberbits_model_2,Fiberbits_model_3)``
``````##                   df      AIC
## Fiberbits_model_1  9 98377.36
## Fiberbits_model_2  8 99076.27
## Fiberbits_model_3 11 97127.17``````
``accuracy1``
``## [1] 0.76504``
``accuracy2``
``## [1] 0.76695``
``accuracy3``
``## [1] 0.76061``

Conclusion: Logistic Regression

Logistic Regression is the base of all classification algorithms
A good understanding on logistic regression and goodness of fit measures will really help in understanding complex machine learning algorithms like neural networks and SVMs
One has to be careful while selecting the model, all the goodness of fit measures are calculated on training data. We may have to do cross validation to get an idea on the test error.