What is a best model? How to build?
- A model with maximum accuracy /least error
- A model that uses maximum information available in the given data
- A model that has minimum squared error
- A model that captures all the hidden patterns in the data
- A model that produces the best perdition results
- How to build/choose a best model?
- Error on the training data is not a good meter of performance on future data
- How to select the best model out of the set of available models ?
- Are there any methods/metrics to choose best model?
- What is training error? What is testing error? What is hold out sample error?
Practice : The Most Accurate Model
- Data: Fiberbits/Fiberbits.csv
- Build a decision tree to predict active_user
- What is the accuracy of your model?
- Grow the tree as much as you can and achieve 95% accuracy.
#Preparing the X and y to train the model features = list(Fiber_df.drop(['active_cust'],1).columns) X = np.array(Fiber_df[features]) y = np.array(Fiber_df['active_cust'])
#Let's make a model by choosing some initial parameters. from sklearn import tree tree_config = tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=10, min_samples_split=1, min_samples_leaf=30, max_leaf_nodes=10)
#Training the model and finding the accuracy of the model tree_config.fit(X,y) tree_config.score(X,y)
The first decision tree we have built is giving us an accuracy of 84.97% on the training data. We will grow the tree to achieve 95% accuracy.
tree_config_new = tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, max_leaf_nodes=None)
#Training the model and accuracy tree_config_new.fit(X,y) tree_config_new.score(X,y)
This seem to be a matter of accuracy, the high the accuracy is good a model becomes. But, high accuracy comes with a price too. We might get to see it in next posts.