March 4th Scribe Notes Jonathan W. Yu and Hannah Worrall 36-464/36-664: Applied Multivariate Methods March 11, 2014 Classification Trees 1 Definition 3 Pruning • Classification Trees (CT): Predicts a We can use any of the three approaches to check qualitative response instead of quantitative the effects of pruning. It will generally improve - the prediction for each terminal gives the the error rate. most common class for all the data that falls into the node 4 Example: Carseats • Regression Trees (RT): Predicts a quan- Sales: continuous response variable titative response for each terminal node - ⇒ High: binary representation, 1 (> 8) and 0 its prediction of each terminal node is the (≤ 7) mean for all the data that falls into the node # Regress High on all variables except sales tree.Carseats<-tree(high~. - sales, Carseats) summary(tree.Carseats) 2 How to grow CT # Plot the tree (see Figure 1 in Appendix) plot(tree.Carseats,pretty=0) RSS cannot be used to make splits since we have categorical response. Instead, we should do use Note: ”Misclassification Error Rate” (training classification error rate to determine how to best error rate): percentage of misclassified nodes split the tree. To properly judge the classification of the CT, we should test it on the test error rather than E = 1 − max(ˆ pmk ) the simple training error. You would then divide where pˆmk = proportion of training observations the total correct predictions by the total values in the mth region from class k The method is not in the test data. sensitve: we will use 2 other methods that are set.seed(2) preferred. test <- sample(1:nrow(Carseats),200) Carseats.train <- Carseats[-test,] 1. Gini Index High.train<-High[-train] tree.carseats<-tree (High~.-Sales,Carseats,subset=train) tree.pred<-predict(tree.carseats,Carseats.train, K X type="class") G= pˆmk (1 − pˆmk ) table(tree.pred,High.train) k=1 The gini index takes a small value if all pˆmk are close to 0 or 1 (⇒ minimize for less error in assignments) The gini index is a measure of node purity with small value indicating pure nodes. 5 • Regression trees & classification trees are easy to understand • can be displayed graphically. It can have qualitative responses. 2. Cross-Entropy D=− Summary K X pˆmk log pˆmk ≥ 0 k=1 Again, this is a measure of node purity and small values indicate a pure node. Both Gini Index and Cross-Entropy are more sensitve to purity (quality of tree split) of node. If we want to worry about prediction accuracy, we will look at classification error rate. However, it does not necessarily have the best prediction rate and has high variance. 6 Future: Random Forests Random forests are useful as many trees suffer from high variance. You can use baggin for predictions but these trees are correlated. Therefore you can use random forests to decorrelate the trees. A Figures ShelveLoc: Bad,Medium | Price < 92.5 Price < 135 Income < 46 US: No Price < 109 Advertising < 13.5 Income < 57 No Yes Yes Population < 207.5 CompPrice < 110.5 Yes Yes No Yes Yes No Age < 54.5 CompPrice < 124.5 CompPrice < 130.5 CompPrice < 122.5 Price < 106.5 Price < 122.5 Income < 100 Price < 125 Yes Population < 177 No No Yes No Income < 60.5 CompPrice < 147.5 ShelveLoc: Bad Yes No Price < 109.5 No No Price < 147 Yes No Age < 49.5 No CompPrice < 152.5 No Yes Yes Yes No No 80 ● 75 ● 75 80 Figure 1: Classification Tree of Carseats: normal. Partitioning the Carseats dataset. If we just enter the tree object in R, the output will correspond to each of the branch of the tree. It will display the split criterion from the top to the bottom, number of observations in each of the branch, deviance and the overall prediction of the branch (Yes or No) and the fraction in the branch for each value of yes/no respectively. 70 70 ShelveLoc: Bad,Medium | 65 deviance ● ● Price < 142.5 Price < 142 60 65 ● 60 deviance ● ShelveLoc: Bad Yes No No ● ● 55 ● 55 ● ● ● No ● ● 50 ● 50 Price < 86.5 ● Advertising < 6.5 ● Yes 5 10 size 15 0 5 10 15 k 20 No Age < 37.5 CompPrice < 118.5 Yes No Yes (a) Pruning the Tree: Plotting the error rates against both size (terminal nodes) and k (cost- (b) Plot of the Classification tree complexity parameter) after Pruning: 9 Nodes Figure 2: To prune the tree, we can use prune.misclass(). In this case, we pruned it down to a 9 node tree by using the command: prune.carseats ¡- prune.misclass(tree.carseats,best=9) 0.32 25067 0.60 Bin 22 9440 0.28 27302 249 32342 9191 -1 Height 0.14 Height 8605 12 23505 0.2 0.12 27639 12384 18480 23530 0.02 Height 0 Height 30625 30624 26874 26395 26081 25592 25472 24852 24831 24754 24753 24180 23608 23587 23585 23241 23106 23050 22912 22592 22537 21437 21362 21299 21293 21234 21118 20594 20589 20442 20102 19803 19787 19685 19684 19105 18551 18451 18288 18229 17996 17824 17804 17785 17438 15893 15770 15689 15058 15003 14924 13437 13090 13077 13035 12882 12392 12300 11730 11103 11102 11011 10988 10724 10360 10187 9476 9475 9267 9221 9099 9090 9059 9020 8557 8104 7106 712 5929 7034 6895 4864 6761 5204 5070 6696 5926 2365 4555 3240 5927 4798 4632 4079 2061 3757 2364 627 624 5922 8630 0.16 0.18 Bin 14 282 0.24 Height 0.50 Height 0.8 27042 Bin 12 27190 0.40 0.4 Height 277 0.14 Bin 4 9433 8625 0.0 0.10 Height Bin 10 2568 27411 27105 0.06 Bin 1 Bin 17 Bin 25 (a) Random Forest: bagging (b) Random Forest: bagging with 1, 4, 10, and 12 bins with 14, 17, 22, and 25 bins Figure 3: Examples of Random forest for Syria database and bagging after using bootstrap to fit the data. Bagging, is the general-purpose procedure for reducing the variance of a statistical learning method. We can build the number of decision trees on the bootstrapped training samples. Each time a split in a tree is considered, a random sample of m mpredictors is chosen as split candidtates from the full set of p predictors.
© Copyright 2024 ExpyDoc