Home / Python / Predictive Modeling & Machine Learning / 204.2.6 Model Selection : Logistic Regression

# 204.2.6 Model Selection : Logistic Regression

We left some part of the post regarding goodness of fitness behind. We will cover them in this post and see if we can improve our model based on AIC and BIC.
We will also cover various methods used for model selection in a series dedicated to it.

### How to improve model

• By adding more independent variables?
• By deriving new variables from available set?
• By transforming variables ?
• By collecting more data?
• How do we choose best model from the list of fitted models with different parameters

### AIC and BIC

• AIC and BIC values are like adjusted R-squared values in linear regression
• Stand-alone model AIC has no real use, but if we are choosing between the models AIC really helps.
• Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models
• If we are choosing between two models, a model with less AIC is preferred
• AIC is an estimate of the information lost when a given model is used to represent the process that generates the data
• AIC= -2ln(L)+ 2k
• L be the maximum value of the likelihood function for the model
• k is the number of independent variables
• BIC is a substitute to AIC with a slightly different formula. We will follow either AIC or BIC throughout our analysis

### Practice : Logistic Regression Model Selection

• Find AIC and BIC values for the first fiber bits model(m1)
• What are the top-2 impacting variables in fiber bits model?
• What are the least impacting variables in fiber bits model?
• Can we drop any of these variables and build a new model(m2)
• Can we add any new interaction and polynomial variables to increase the accuracy of the model?(m3)
• We have three models, what the best accuracy that you can expect on this data?
In [30]:
```#Find AIC and BIC values for the first fiber bits model(m2)

m1=sm.Logit(Fiber['active_cust'],Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
m1
m1.fit()

m1.fit().summary2()
```
```Optimization terminated successfully.
Current function value: 0.517172
Iterations 7
Optimization terminated successfully.
Current function value: 0.517172
Iterations 7
```
Out[30]:
 Model: Logit Pseudo R-squared: 0.240 Dependent Variable: active_cust AIC: 103450.4420 Date: 2016-10-16 14:35 BIC: 103526.5454 No. Observations: 100000 Log-Likelihood: -51717. Df Model: 7 LL-Null: -68074. Df Residuals: 99992 LLR p-value: 0.0000 Converged: 1.0000 Scale: 1.0000 No. Iterations: 7.0000
Coef. Std.Err. z P>|z| [0.025 0.975] 0.0000 0.0000 4.0973 0.0000 0.0000 0.0000 0.0150 0.0005 31.1715 0.0000 0.0141 0.0159 -1.7669 0.0271 -65.2837 0.0000 -1.8199 -1.7138 -0.1784 0.0075 -23.9093 0.0000 -0.1930 -0.1638 -3.0826 0.0404 -76.2589 0.0000 -3.1618 -3.0034 -0.0024 0.0002 -16.0138 0.0000 -0.0027 -0.0021 -0.4636 0.0072 -64.0101 0.0000 -0.4778 -0.4494 0.1094 0.0015 75.0729 0.0000 0.1065 0.1122
• What are the top-2 impacting variables in fiber bits model?
• What are the least impacting variables in fiber bits model?
In [31]:
```m1.fit().summary()
```
```Optimization terminated successfully.
Current function value: 0.517172
Iterations 7
```
Out[31]:
Dep. Variable: No. Observations: active_cust 100000 Logit 99992 MLE 7 Sun, 16 Oct 2016 0.2403 14:35:52 -51717 True -68074 0
coef std err z P>|z| [95.0% Conf. Int.] 1.71e-05 4.17e-06 4.097 0.000 8.92e-06 2.53e-05 0.0150 0.000 31.172 0.000 0.014 0.016 -1.7669 0.027 -65.284 0.000 -1.820 -1.714 -0.1784 0.007 -23.909 0.000 -0.193 -0.164 -3.0826 0.040 -76.259 0.000 -3.162 -3.003 -0.0024 0.000 -16.014 0.000 -0.003 -0.002 -0.4636 0.007 -64.010 0.000 -0.478 -0.449 0.1094 0.001 75.073 0.000 0.107 0.112
• Can we drop any of these variables and build a new model(m2)
In [32]:
```#Income and Monthly Bill Dropped because those are the least impacting variables
m2=sm.Logit(Fiber['active_cust'],Fiber[['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['technical_issues_per_month']+['Speed_test_result']])
m2
m2.fit()
m2.fit().summary()
m2.fit().summary2()
```
```Optimization terminated successfully.
Current function value: 0.518605
Iterations 7
Optimization terminated successfully.
Current function value: 0.518605
Iterations 7
Optimization terminated successfully.
Current function value: 0.518605
Iterations 7
```
Out[32]:
 Model: Logit Pseudo R-squared: 0.238 Dependent Variable: active_cust AIC: 103732.9794 Date: 2016-10-16 14:35 BIC: 103790.0570 No. Observations: 100000 Log-Likelihood: -51860. Df Model: 5 LL-Null: -68074. Df Residuals: 99994 LLR p-value: 0.0000 Converged: 1.0000 Scale: 1.0000 No. Iterations: 7.0000
Coef. Std.Err. z P>|z| [0.025 0.975] 0.0146 0.0005 30.8698 0.0000 0.0137 0.0155 -1.7621 0.0270 -65.2891 0.0000 -1.8150 -1.7092 -0.1765 0.0074 -23.7127 0.0000 -0.1910 -0.1619 -3.0800 0.0404 -76.1640 0.0000 -3.1592 -3.0007 -0.4762 0.0072 -66.1848 0.0000 -0.4903 -0.4621 0.1074 0.0014 74.5451 0.0000 0.1046 0.1102

## Conclusion: Logistic Regression

• Logistic Regression is the base of all classification algorithms
• A good understanding on logistic regression and goodness of fit measures will really help in understanding complex machine learning algorithms like neural networks and SVMs
• One has to be careful while selecting the model, all the goodness of fit measures are calculated on training data. We may have to do cross validation to get an idea on the test error.