Home / Python / Predictive Modeling & Machine Learning / 204.2.5 Multicollinearity and Individual Impact Of Variables in Logistic Regression

204.2.5 Multicollinearity and Individual Impact Of Variables in Logistic Regression

Previous post was about goodness of fit, we covered Confusion matrix and will cover the rest in next posts too.

But first let’s deal with a common issue with modeling:

Multicollinearity

  • The relation between X and Y is non linear, we used logistic regression
  • The multicollinearity is an issue related to predictor variables.
  • Multicollinearity need to be fixed in logistic regression as well.
  • Otherwise the individual coefficients of the predictors will be effected by the interdependency
  • The process of identification is same as linear regression

Practice : Multicollinearity

  • Is there any multicollinearity in fiber bits model?
  • Identify and remove multicollinearity from the model
In [27]:
def vif_cal(input_data, dependent_col):
    x_vars=input_data.drop([dependent_col], axis=1)
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)
In [28]:
#Calculating VIF values using that function
vif_cal(input_data=Fiber, dependent_col="active_cust")
income  VIF =  1.02
months_on_network  VIF =  1.03
Num_complaints  VIF =  1.01
number_plan_changes  VIF =  1.59
relocated  VIF =  1.56
monthly_bill  VIF =  1.02
technical_issues_per_month  VIF =  1.06
Speed_test_result  VIF =  1.0

Individual Impact of Variables

  • Out of these predictor variables, what are the important variables?
  • If we have to choose the top 5 variables what are they?
  • While selecting the model, we may want to drop few less impacting variables.
  • How to rank the predictor variables in the order of their importance?
  • We can simply look at the z values of the each variable. Look at their absolute values
  • Or calculate the Wald chi-square, which is nearly equal to square of the z-score
  • Wald Chi-Square value helps in ranking the variables

Practice : Individual Impact of Variables

  • Identify top impacting and least impacting variables in fiber bits models
  • Find the variable importance and order them based on their impact
In [29]:
result1.summary()
Out[29]:
Logit Regression Results
Dep. Variable: active_cust No. Observations: 100000
Model: Logit Df Residuals: 99992
Method: MLE Df Model: 7
Date: Sun, 16 Oct 2016 Pseudo R-squ.: 0.2403
Time: 14:35:51 Log-Likelihood: -51717.
converged: True LL-Null: -68074.
LLR p-value: 0.000
coef std err z P>|z| [95.0% Conf. Int.]
income 1.71e-05 4.17e-06 4.097 0.000 8.92e-06 2.53e-05
months_on_network 0.0150 0.000 31.172 0.000 0.014 0.016
Num_complaints -1.7669 0.027 -65.284 0.000 -1.820 -1.714
number_plan_changes -0.1784 0.007 -23.909 0.000 -0.193 -0.164
relocated -3.0826 0.040 -76.259 0.000 -3.162 -3.003
monthly_bill -0.0024 0.000 -16.014 0.000 -0.003 -0.002
technical_issues_per_month -0.4636 0.007 -64.010 0.000 -0.478 -0.449
Speed_test_result 0.1094 0.001 75.073 0.000 0.107 0.112

Top impacting variables are – relocated & Speed_test_result

Least impacting variables are – monthly_bill & income

About admin

Check Also

204.7.6 Practice : Random Forest

Let’s implement the concept of Random Forest into practice using Python. Practice : Random Forest …

Leave a Reply

Your email address will not be published. Required fields are marked *