The Problem of Under-fitting
- Simple models are better. Its true but is that always true? May not be always true.
- We might have given it up too early. Did we really capture all the information?
- Did we do enough research and future reengineering to fit the best model? Is it the best model that can be fit on this data?
- By being over cautious about variance in the parameters, we might miss out on some patterns in the data.
- Model need to be complicated enough to capture all the information present.
- If the training error itself is high, how can we be so sure about the model performance on unknown data?
- Most of the accuracy and error measuring statistics give us a clear idea on training error, this is one advantage of under fitting, we can identify it confidently.
- Under fitting
- A model that is too simple
- A mode with a scope for improvement
- A model with lot of bias
Practice : Model with huge Bias
- Lets simplify the model.
- Take the high variance model and prune it.
- Make it as simple as possible.
- Find the training error and validation error.
#We can prune the tree by changing the parameters tree_bias = tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=10, min_samples_split=30, min_samples_leaf=30, max_leaf_nodes=20) tree_bias.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10, max_features=None, max_leaf_nodes=20, min_samples_leaf=30, min_samples_split=30, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
#Training accuracy tree_bias.score(X_train,y_train)
#Lets prune the tree further. Lets oversimplyfy the model tree_bias1 = tree.DecisionTreeClassifier(criterion='gini', splitter='random', max_depth=1, min_samples_split=100, min_samples_leaf=100, max_leaf_nodes=2) tree_bias1.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1, max_features=None, max_leaf_nodes=2, min_samples_leaf=100, min_samples_split=100, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='random')
#Training Accuracy of new model tree_bias1.score(X_train,y_train)
#Validation accuracy on test data tree_bias1.score(X_test,y_test)
In next post we will discuss how to choose optimal model using Bias-Variance Trade-off.