Home / Predictive Modeling & Machine Learning / 203.3.11 Practice : Tree Building & Model Selection

# 203.3.11 Practice : Tree Building & Model Selection

### LAB: Tree Building & Model Selection

• Import fiber bits data. This is internet service provider data. The idea is to predict the customer attrition based on some independent factors
• Build a decision tree model for fiber bits data
• Prune the tree if required
• Find out the final accuracy
• Is there any 100% active/inactive customer segment?

### Solution

``````Fiberbits <- read.csv("C:\\Amrita\\Datavedi\\Fiberbits\\Fiberbits.csv")
Fiber_bits_tree<-rpart(active_cust~., method="class", control=rpart.control(minsplit=30, cp=0.001), data=Fiberbits)
Fiber_bits_tree``````
``````## n= 100000
##
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
##
##    1) root 100000 42141 1 (0.42141000 0.57859000)
##      2) relocated>=0.5 12348   954 0 (0.92274052 0.07725948)
##        4) technical_issues_per_month>=1.5 11294   526 0 (0.95342660 0.04657340) *
##        5) technical_issues_per_month< 1.5 1054   428 0 (0.59392789 0.40607211)
##         10) number_plan_changes>=4.5 495    45 0 (0.90909091 0.09090909) *
##         11) number_plan_changes< 4.5 559   176 1 (0.31484794 0.68515206)
##           22) Speed_test_result< 79.5 45     0 0 (1.00000000 0.00000000) *
##           23) Speed_test_result>=79.5 514   131 1 (0.25486381 0.74513619) *
##      3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)
##        6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308)
##         12) technical_issues_per_month>=3.5 22187  5791 0 (0.73899130 0.26100870)
##           24) number_plan_changes< 0.5 9735  1132 0 (0.88371854 0.11628146)
##             48) Speed_test_result< 77.5 7750   541 0 (0.93019355 0.06980645) *
##             49) Speed_test_result>=77.5 1985   591 0 (0.70226700 0.29773300)
##               98) income>=2008.5 1211   133 0 (0.89017341 0.10982659)
##                196) income< 2526 1163    85 0 (0.92691316 0.07308684) *
##                197) income>=2526 48     0 1 (0.00000000 1.00000000) *
##               99) income< 2008.5 774   316 1 (0.40826873 0.59173127)
##                198) income< 1785.5 270    97 0 (0.64074074 0.35925926) *
##                199) income>=1785.5 504   143 1 (0.28373016 0.71626984) *
##           25) number_plan_changes>=0.5 12452  4659 0 (0.62584324 0.37415676)
##             50) number_plan_changes>=1.5 7867  1358 0 (0.82738020 0.17261980) *
##             51) number_plan_changes< 1.5 4585  1284 1 (0.28004362 0.71995638) *
##         13) technical_issues_per_month< 3.5 5330   818 1 (0.15347092 0.84652908)
##           26) income>=1945.5 1849   619 1 (0.33477555 0.66522445)
##             52) monthly_bill>=148 167    29 0 (0.82634731 0.17365269) *
##             53) monthly_bill< 148 1682   481 1 (0.28596908 0.71403092)
##              106) income< 2362 1407   472 1 (0.33546553 0.66453447)
##                212) technical_issues_per_month>=1.5 176    25 0 (0.85795455 0.14204545) *
##                213) technical_issues_per_month< 1.5 1231   321 1 (0.26076361 0.73923639)
##                  426) income>=2180.5 126    21 0 (0.83333333 0.16666667) *
##                  427) income< 2180.5 1105   216 1 (0.19547511 0.80452489) *
##              107) income>=2362 275     9 1 (0.03272727 0.96727273) *
##           27) income< 1945.5 3481   199 1 (0.05716748 0.94283252) *
##        7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635)
##         14) Speed_test_result< 82.5 25734  9271 1 (0.36026269 0.63973731)
##           28) Speed_test_result>=80.5 11671  5312 0 (0.54485477 0.45514523)
##             56) income>=1722.5 6306  1299 0 (0.79400571 0.20599429)
##              112) income< 1992.5 5828   888 0 (0.84763212 0.15236788)
##                224) number_plan_changes>=1.5 3053   189 0 (0.93809368 0.06190632) *
##                225) number_plan_changes< 1.5 2775   699 0 (0.74810811 0.25189189)
##                  450) number_plan_changes< 0.5 2284   358 0 (0.84325744 0.15674256)
##                    900) technical_issues_per_month>=3.5 1511    57 0 (0.96227664 0.03772336) *
##                    901) technical_issues_per_month< 3.5 773   301 0 (0.61060802 0.38939198)
##                     1802) monthly_bill>=148 364    41 0 (0.88736264 0.11263736) *
##                     1803) monthly_bill< 148 409   149 1 (0.36430318 0.63569682) *
##                  451) number_plan_changes>=0.5 491   150 1 (0.30549898 0.69450102) *
##              113) income>=1992.5 478    67 1 (0.14016736 0.85983264) *
##             57) income< 1722.5 5365  1352 1 (0.25200373 0.74799627)
##              114) number_plan_changes>=1.5 1586   680 1 (0.42875158 0.57124842)
##                228) Speed_test_result< 81.5 894   370 0 (0.58612975 0.41387025)
##                  456) technical_issues_per_month>=3.5 301    54 0 (0.82059801 0.17940199) *
##                  457) technical_issues_per_month< 3.5 593   277 1 (0.46711636 0.53288364)
##                    914) income< 1604.5 261    92 0 (0.64750958 0.35249042) *
##                    915) income>=1604.5 332   108 1 (0.32530120 0.67469880) *
##                229) Speed_test_result>=81.5 692   156 1 (0.22543353 0.77456647) *
##              115) number_plan_changes< 1.5 3779   672 1 (0.17782482 0.82217518) *
##           29) Speed_test_result< 80.5 14063  2912 1 (0.20706819 0.79293181)
##             58) income< 1960.5 11360  2725 1 (0.23987676 0.76012324)
##              116) Num_complaints>=4.5 292    87 0 (0.70205479 0.29794521)
##                232) technical_issues_per_month>=3.5 197     0 0 (1.00000000 0.00000000) *
##                233) technical_issues_per_month< 3.5 95     8 1 (0.08421053 0.91578947) *
##              117) Num_complaints< 4.5 11068  2520 1 (0.22768341 0.77231659)
##                234) number_plan_changes>=1.5 4003  1180 1 (0.29477892 0.70522108)
##                  468) income>=1809.5 1229   582 1 (0.47355574 0.52644426)
##                    936) Speed_test_result>=79.5 477   132 0 (0.72327044 0.27672956) *
##                    937) Speed_test_result< 79.5 752   237 1 (0.31515957 0.68484043) *
##                  469) income< 1809.5 2774   598 1 (0.21557318 0.78442682) *
##                235) number_plan_changes< 1.5 7065  1340 1 (0.18966737 0.81033263) *
##             59) income>=1960.5 2703   187 1 (0.06918239 0.93081761) *
##         15) Speed_test_result>=82.5 34401  4262 1 (0.12389175 0.87610825) *``````

#### Plotting the Tree

``prp(Fiber_bits_tree,box.col=c("Grey", "Orange")[Fiber_bits_tree\$frame\$yval],varlen=0,faclen=0, type=1,extra=4,under=TRUE)``

#### Code-Choosing Cp and Cross Validation Error

``printcp(Fiber_bits_tree)``
``````##
## Classification tree:
## rpart(formula = active_cust ~ ., data = Fiberbits, method = "class",
##     control = rpart.control(minsplit = 30, cp = 0.001))
##
## Variables actually used in tree construction:
## [1] income                     monthly_bill
## [3] Num_complaints             number_plan_changes
## [5] relocated                  Speed_test_result
## [7] technical_issues_per_month
##
## Root node error: 42141/100000 = 0.42141
##
## n= 100000
##
##           CP nsplit rel error  xerror      xstd
## 1  0.2477397      0   1.00000 1.00000 0.0037054
## 2  0.1639971      1   0.75226 0.75226 0.0034917
## 3  0.0876581      2   0.58826 0.58826 0.0032402
## 4  0.0293301      3   0.50061 0.50061 0.0030616
## 5  0.0239316      6   0.41261 0.41295 0.0028450
## 6  0.0081631      8   0.36475 0.37498 0.0027372
## 7  0.0024560      9   0.35659 0.35811 0.0026862
## 8  0.0022662     11   0.35168 0.35362 0.0026723
## 9  0.0018272     13   0.34714 0.34520 0.0026457
## 10 0.0016848     15   0.34349 0.34228 0.0026364
## 11 0.0014001     18   0.33832 0.33825 0.0026234
## 12 0.0013763     24   0.32859 0.33495 0.0026127
## 13 0.0013170     26   0.32583 0.33115 0.0026003
## 14 0.0012933     28   0.32320 0.32859 0.0025918
## 15 0.0011390     33   0.31563 0.32465 0.0025787
## 16 0.0010678     34   0.31449 0.32088 0.0025661
## 17 0.0010000     35   0.31342 0.31926 0.0025606``````

#### Plot-Choosing Cp and Cross Validation Error

``plotcp(Fiber_bits_tree) ``

#### Pruning

``````Fiber_bits_tree_1<-prune(Fiber_bits_tree, cp=0.0081631)
Fiber_bits_tree_1``````
``````## n= 100000
##
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
##
##  1) root 100000 42141 1 (0.42141000 0.57859000)
##    2) relocated>=0.5 12348   954 0 (0.92274052 0.07725948) *
##    3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)
##      6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308)
##       12) technical_issues_per_month>=3.5 22187  5791 0 (0.73899130 0.26100870)
##         24) number_plan_changes< 0.5 9735  1132 0 (0.88371854 0.11628146) *
##         25) number_plan_changes>=0.5 12452  4659 0 (0.62584324 0.37415676)
##           50) number_plan_changes>=1.5 7867  1358 0 (0.82738020 0.17261980) *
##           51) number_plan_changes< 1.5 4585  1284 1 (0.28004362 0.71995638) *
##       13) technical_issues_per_month< 3.5 5330   818 1 (0.15347092 0.84652908) *
##      7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635)
##       14) Speed_test_result< 82.5 25734  9271 1 (0.36026269 0.63973731)
##         28) Speed_test_result>=80.5 11671  5312 0 (0.54485477 0.45514523)
##           56) income>=1722.5 6306  1299 0 (0.79400571 0.20599429) *
##           57) income< 1722.5 5365  1352 1 (0.25200373 0.74799627) *
##         29) Speed_test_result< 80.5 14063  2912 1 (0.20706819 0.79293181) *
##       15) Speed_test_result>=82.5 34401  4262 1 (0.12389175 0.87610825) *``````

#### Plot after Pruning

``prp(Fiber_bits_tree_1,box.col=c("Grey", "Orange")[Fiber_bits_tree\$frame\$yval],varlen=0,faclen=0, type=1,extra=4,under=TRUE)``

#### Choosing Cp and Cross Validation Error with New Model

``printcp(Fiber_bits_tree_1) ``
``````##
## Classification tree:
## rpart(formula = active_cust ~ ., data = Fiberbits, method = "class",
##     control = rpart.control(minsplit = 30, cp = 0.001))
##
## Variables actually used in tree construction:
## [1] income                     number_plan_changes
## [3] relocated                  Speed_test_result
## [5] technical_issues_per_month
##
## Root node error: 42141/100000 = 0.42141
##
## n= 100000
##
##          CP nsplit rel error  xerror      xstd
## 1 0.2477397      0   1.00000 1.00000 0.0037054
## 2 0.1639971      1   0.75226 0.75226 0.0034917
## 3 0.0876581      2   0.58826 0.58826 0.0032402
## 4 0.0293301      3   0.50061 0.50061 0.0030616
## 5 0.0239316      6   0.41261 0.41295 0.0028450
## 6 0.0081631      8   0.36475 0.37498 0.0027372``````
``plotcp(Fiber_bits_tree_1) ``

#### Pruning further

``````Fiber_bits_tree_2<-prune(Fiber_bits_tree, cp=0.0239316)
Fiber_bits_tree_2``````
``````## n= 100000
##
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
##
##  1) root 100000 42141 1 (0.42141000 0.57859000)
##    2) relocated>=0.5 12348   954 0 (0.92274052 0.07725948) *
##    3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)
##      6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308)
##       12) technical_issues_per_month>=3.5 22187  5791 0 (0.73899130 0.26100870) *
##       13) technical_issues_per_month< 3.5 5330   818 1 (0.15347092 0.84652908) *
##      7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635)
##       14) Speed_test_result< 82.5 25734  9271 1 (0.36026269 0.63973731)
##         28) Speed_test_result>=80.5 11671  5312 0 (0.54485477 0.45514523)
##           56) income>=1722.5 6306  1299 0 (0.79400571 0.20599429) *
##           57) income< 1722.5 5365  1352 1 (0.25200373 0.74799627) *
##         29) Speed_test_result< 80.5 14063  2912 1 (0.20706819 0.79293181) *
##       15) Speed_test_result>=82.5 34401  4262 1 (0.12389175 0.87610825) *``````

#### Tree- After Pruning further

``prp(Fiber_bits_tree_2,box.col=c("Grey", "Orange")[Fiber_bits_tree\$frame\$yval],varlen=0,faclen=0, type=1,extra=4,under=TRUE)``

## Conclusion

• Decision trees are powerful and very simple to represent and understand.
• One need to be careful with the size of the tree. Decision trees are more prone to overfitting than other algorithms
• Can be applied to any type of data, especially with categorical predictors
• One can use decision trees to perform a basic customer segmentation and build a different predictive model on the segments