This will be our last post of our Model Selection and Cross Validation Series. Bootstrap Methods Boot strapping is a powerful tool to get an idea on accuracy of the model and the test error Can estimate the likely future performance of a given modeling procedure, on new data not …

Read More »## 204.4.11 K-fold Cross Validation

Ten-fold Cross – Validation Divide the data into 10 parts(randomly) Use 9 parts as training data(90%) and the tenth part as holdout data(10%) We can repeat this process 10 times Build 10 models, find average error on 10 holdout samples. This gives us an idea on testing error K-fold Cross …

Read More »## 204.4.10 Cross Validation

Cross Validation We always build and train a model with a training dataset. With default parameters the fitting process optimises the training data as well as possible. Introducing a whole different sample to pre-built model won’t replicate the accuracy we expect form the model. One way to solve the problem …

Read More »## 204.4.1 Model Section and Cross Validation

Building a model is not that difficult. However, tuning the model and checking if the model is working as we have built it to is a different game. In this series, we will be covering methods and matrices to validate the model and find an optimum model for our requirement. …

Read More »## 203.4.7 Cross Validation

Choosing Optimal Model Unfortunately There is no scientific method of choosing optimal model complexity that gives minimum test error. Training error is not a good estimate of the test error. There is always bias-variance tradeoff in choosing the appropriate complexity of the model. We can use cross validation methods, boot …

Read More »## 203.4.6 model-Bias Variance Tradeoff

Model Bias and Variance Over fitting Low Bias with High Variance Low training error – ‘Low Bias’ High testing error Unstable model – ‘High Variance’ The coefficients of the model change with small changes in the data Under fitting High Bias with low Variance High training error – ‘high Bias’ …

Read More »## 203.4.5 Type of Datasets, Type of Errors and Problem of Overfitting

The Problem of Over Fitting In search of the best model on the given data we add many predictors, polynomial terms, Interaction terms, variable transformations, derived variables, indicator/dummy variables etc., Most of the times we succeed in reducing the error. What error is this? So by complicating the model we …

Read More »## 203.4.4 What is a Best Model?

What is a best model? How to build? A model with maximum accuracy /least error A model that uses maximum information available in the given data A model that has minimum squared error A model that captures all the hidden patterns in the data A model that produces the best …

Read More »## 203.4.3 ROC and AUC

ROC Curve – Interpretation How many mistakes are we making to identify all the positives? How many mistakes are we making to identify 70%, 80% and 90% of positives? 1-Specificty(false positive rate) gives us an idea on mistakes that we are making We would like to make 0% mistakes for …

Read More »## 203.4.2 Calculating Sensitivity and Specificity in R

Calculating Sensitivity and Specificity Building Logistic Regression Model Fiberbits <- read.csv("C:\\Amrita\\Datavedi\\Fiberbits\\Fiberbits.csv") Fiberbits_model_1<-glm(active_cust~., family=binomial, data=Fiberbits) ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred summary(Fiberbits_model_1) ## ## Call: ## glm(formula = active_cust ~ ., family = binomial, data = Fiberbits) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max …

Read More »