Home / Python / Predictive Modeling & Machine Learning / 204.2.6 Model Selection : Logistic Regression

204.2.6 Model Selection : Logistic Regression

We left some part of the post regarding goodness of fitness behind. We will cover them in this post and see if we can improve our model based on AIC and BIC.
We will also cover various methods used for model selection in a series dedicated to it.

How to improve model

  • By adding more independent variables?
  • By deriving new variables from available set?
  • By transforming variables ?
  • By collecting more data?
  • How do we choose best model from the list of fitted models with different parameters

AIC and BIC

  • AIC and BIC values are like adjusted R-squared values in linear regression
  • Stand-alone model AIC has no real use, but if we are choosing between the models AIC really helps.
  • Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models
  • If we are choosing between two models, a model with less AIC is preferred
  • AIC is an estimate of the information lost when a given model is used to represent the process that generates the data
  • AIC= -2ln(L)+ 2k
    • L be the maximum value of the likelihood function for the model
    • k is the number of independent variables
  • BIC is a substitute to AIC with a slightly different formula. We will follow either AIC or BIC throughout our analysis

Practice : Logistic Regression Model Selection

  • Find AIC and BIC values for the first fiber bits model(m1)
  • What are the top-2 impacting variables in fiber bits model?
  • What are the least impacting variables in fiber bits model?
  • Can we drop any of these variables and build a new model(m2)
  • Can we add any new interaction and polynomial variables to increase the accuracy of the model?(m3)
  • We have three models, what the best accuracy that you can expect on this data?
In [30]:
#Find AIC and BIC values for the first fiber bits model(m2)

m1=sm.Logit(Fiber['active_cust'],Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
m1
m1.fit()

m1.fit().summary2()
Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7
Out[30]:
Model: Logit Pseudo R-squared: 0.240
Dependent Variable: active_cust AIC: 103450.4420
Date: 2016-10-16 14:35 BIC: 103526.5454
No. Observations: 100000 Log-Likelihood: -51717.
Df Model: 7 LL-Null: -68074.
Df Residuals: 99992 LLR p-value: 0.0000
Converged: 1.0000 Scale: 1.0000
No. Iterations: 7.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
income 0.0000 0.0000 4.0973 0.0000 0.0000 0.0000
months_on_network 0.0150 0.0005 31.1715 0.0000 0.0141 0.0159
Num_complaints -1.7669 0.0271 -65.2837 0.0000 -1.8199 -1.7138
number_plan_changes -0.1784 0.0075 -23.9093 0.0000 -0.1930 -0.1638
relocated -3.0826 0.0404 -76.2589 0.0000 -3.1618 -3.0034
monthly_bill -0.0024 0.0002 -16.0138 0.0000 -0.0027 -0.0021
technical_issues_per_month -0.4636 0.0072 -64.0101 0.0000 -0.4778 -0.4494
Speed_test_result 0.1094 0.0015 75.0729 0.0000 0.1065 0.1122
  • What are the top-2 impacting variables in fiber bits model?
  • What are the least impacting variables in fiber bits model?
In [31]:
m1.fit().summary()
Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7
Out[31]:
Logit Regression Results
Dep. Variable: active_cust No. Observations: 100000
Model: Logit Df Residuals: 99992
Method: MLE Df Model: 7
Date: Sun, 16 Oct 2016 Pseudo R-squ.: 0.2403
Time: 14:35:52 Log-Likelihood: -51717.
converged: True LL-Null: -68074.
LLR p-value: 0.000
coef std err z P>|z| [95.0% Conf. Int.]
income 1.71e-05 4.17e-06 4.097 0.000 8.92e-06 2.53e-05
months_on_network 0.0150 0.000 31.172 0.000 0.014 0.016
Num_complaints -1.7669 0.027 -65.284 0.000 -1.820 -1.714
number_plan_changes -0.1784 0.007 -23.909 0.000 -0.193 -0.164
relocated -3.0826 0.040 -76.259 0.000 -3.162 -3.003
monthly_bill -0.0024 0.000 -16.014 0.000 -0.003 -0.002
technical_issues_per_month -0.4636 0.007 -64.010 0.000 -0.478 -0.449
Speed_test_result 0.1094 0.001 75.073 0.000 0.107 0.112
  • Can we drop any of these variables and build a new model(m2)
In [32]:
#Income and Monthly Bill Dropped because those are the least impacting variables
m2=sm.Logit(Fiber['active_cust'],Fiber[['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['technical_issues_per_month']+['Speed_test_result']])
m2
m2.fit()
m2.fit().summary()
m2.fit().summary2()
Optimization terminated successfully.
         Current function value: 0.518605
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.518605
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.518605
         Iterations 7
Out[32]:
Model: Logit Pseudo R-squared: 0.238
Dependent Variable: active_cust AIC: 103732.9794
Date: 2016-10-16 14:35 BIC: 103790.0570
No. Observations: 100000 Log-Likelihood: -51860.
Df Model: 5 LL-Null: -68074.
Df Residuals: 99994 LLR p-value: 0.0000
Converged: 1.0000 Scale: 1.0000
No. Iterations: 7.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
months_on_network 0.0146 0.0005 30.8698 0.0000 0.0137 0.0155
Num_complaints -1.7621 0.0270 -65.2891 0.0000 -1.8150 -1.7092
number_plan_changes -0.1765 0.0074 -23.7127 0.0000 -0.1910 -0.1619
relocated -3.0800 0.0404 -76.1640 0.0000 -3.1592 -3.0007
technical_issues_per_month -0.4762 0.0072 -66.1848 0.0000 -0.4903 -0.4621
Speed_test_result 0.1074 0.0014 74.5451 0.0000 0.1046 0.1102

Conclusion: Logistic Regression

  • Logistic Regression is the base of all classification algorithms
  • A good understanding on logistic regression and goodness of fit measures will really help in understanding complex machine learning algorithms like neural networks and SVMs
  • One has to be careful while selecting the model, all the goodness of fit measures are calculated on training data. We may have to do cross validation to get an idea on the test error.

About admin

Check Also

204.7.6 Practice : Random Forest

Let’s implement the concept of Random Forest into practice using Python. Practice : Random Forest …

Leave a Reply

Your email address will not be published. Required fields are marked *