tree-based methods for regression and classificationstratifying or segmenting the predictor space into a number of simple regionsdecision-tree methodsTree-based methods are simple and useful for interpretationbagging, random forests, and boostingregreesion and classification problemsregression problems, and then move on to classification![]() | ![]() |
|---|
internal nodes and three terminal nodes, or leavesmean of the response for the observations that fall there
tree analogy, the regions , and are known as terminal nodesDecision trees are typically drawn upside down, in the sense that the leaves are at the bottom of the tereinternal nodesinternal nodes are indicated by the textregression model, it is easy to display, interpret and explainpredictor space - that is, the set of possible values for - into distinct and non-overlapping regions high-dimensional rectangls, or boxes, for simplicity and for ease of interpretation of the resulting predictive modelminimizes the RSS, given bytop-down, greedy approach that is known as recursive binary splittingtop-down because it begins at the top of the tree and then successively splits the predictor spacegreedy because at each step of the tree-building process, the best split is made at the particular step, rather than looking ahead and picking a split that will lead to a better tree in some future stepRSSbest predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regionsminimize the RSS. The process continues until a stopping criterion is reachedFor instance, we may continue until no region contains more than five observations
mean of the training observations in the region to which that test observation belongs
Top Left : A partition of two-dimensional feature space that could not result from recursive binary splittingTop Right : The output of recursive binary splitting on a two-dimensional exampleBottom Left : A tree corresponding to the partition in the top right pannelBottom Right : A perspective plot of the prediction surface corresponding to that treeoverfit the data, leading to poor test set perfomancesmaller tree with fewer splits (that is, fewer regions ) might lead to lower variance and better interpretation at the cost of little biasgrow the tree only so long as the decrease in the RSS due to each split exceeds some (high) thresholdsmaller trees, but is too short-sighteda seemingly worthless split early on in the tree might be followed by a very good split - that is, a split that leads to a large reduction in RSS later on
better strategy is to grow a very large tree , and then prune it back in order to obtain a subtreeCost complexity pruning - also known as weakest link pruning - is used to do thisnonnegative tuning parameter subtree such thatRSS : Terminal Node : terminal nodes of the Tree is the rectangle(i.e. the subset of predictor space) corresponding to the th terminal node, and is the mean of the trainin observations in trade-off between the subtree's complexity and its fit to the training dataoptimal value using cross-validationrecursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observationscomplexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of cross-validation to choose . For each mean squared prediction error on the data in the left-out th fold, as a function of subtree from Step 2 that corresponds to the chosen value of 
Then we could pick value when the
Tree Sizehas become 3 using 1 std rule
regression tree, except that it is used to predict a qualitative response rather than a quantitative oneclassification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongsregression setting, we use recursive binary splitting to grow a classification treeclassification setting, RSS cannot be used as a criterion for for making the binary splitsRSS is the classification error rateclassification error is not sufficiently sensitive for tree-growing, and in practice two other measures are preferableGini index is defined byGini index is referred to as a measure of node purity - a small value indicates that a node contains predominantly observations from a single classGini index is cross-entropy, given by
Gini index and the cross-entropy are very similar numerically
linear boundarynon-linear boundaryLinear modelnon-linear modellinear regressiondecision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chaptersgraphically and are easily interpreted even by a non-expertqualitative predictors without the need to create dummy variablespredictive accuracy as some of the other regression and classification approaches seen in this bookBootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning methodvariance of the mean of the observations is given by averaging a set of observations reduces variancebootstrap, by taking repeated samples from the (single) training data setbootstrapped training data setsbootstrapped training set in order to get , the prediction at a point This is called
bagging
regression treesclassification trees : for each test observation, we record the class predicted by each of the trees, and take a majority vote
test error (black and orange) is shown as a function of , the number of bootstrapped training sets usedRandom forests were applied with single classificaion treeOOB error, which in this case is considerably lowerbagged modelbagging is that trees are repeatedly fit to bootstrapped subsets of the observationsaverage, each bagged tree makes use of around two-thirds of the observationsone-third of the observations not used to fit a given bagged tree and referred to as the out-of-bag (OOB) observationsOOBaverageLOO cross-validation error for bagging, if is largeRandom forests provide an improvement over bagged trees by way of a small tweak that decorrelates the treesvariance when we average the treesbagging, we build a number of decision trees on bootstrapped training samplesdecision trees, each time a split in a tree is considered, a random selection of predictors is chosen as split candidates from the full set of predictorspredictors considered at each split is approximately equal to the square root of the total number of prdictorsbagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classificationboosting for decision treesbagging involves creating multiple copies of the original training data set using the bootstrap, fitting a seperate decision tree to each copy, and then combining all of the trees in order to create a single predictive modelbootstrap data set, independent of the other treesBoosting works in a similar way, except that the trees are grown sequentiallyterminal nodes to the training data shrunken version of the new tree :boosted modelfitting the data hard and potentially overfitting, the boosting approach instead learns slowlydecision tree into the fitted function in order to update the residualssmall, with just a few terminal nodes, determined by the parameter in the algorithmfitting small trees to the residuals, we slowly improve in areas where it does not perform wellshrinkage parameter slows the process down even further, allowing more and different shaped trees to attack the residualsBoosting for classification is similar in spirit to boosting for regression, but is a bit more complexnumber of trees bagging and random forests, boosting can overfit if is too large, although this overfitting tends to occur slowly if at all. We use cross-validation to select shrinkage parameter boosting learns. Typical values are 0.01 or 0.001, and the right choice can depend on the problem.number of splits complexity of the boosted ensemblestump, consisting of a sinple split and resulting in an additive model.interaction depth, and controls the interaction order of the boosted model, since splits can involve at most variables
bagged/RF regression trees, we record the total amount that the RSS is decreased due to splits over a given predictor, averaged over all treesimportant predictorbagged/RF classification trees, we add up the total amount that the Gini index is decreased by splits over a given predictor, averaged over all trees