Lecture video link: https://youtu.be/wr9gUr-eWdA
~ one of our first examples of a non-linear model.
Greedy, Top-Down, Recursive Partitioning
ex.
: Regions where skiing is possible
: otherwise
Latitude vs. Months
(Source: https://youtu.be/wr9gUr-eWdA 6 min. 30 sec.)
Region looking for a split .
How to choose these splits?
Define := loss on .
Given classes, define
to be the proportion of examples in that are of class .
.
Misclassification Loss
has issues:
(Source: https://youtu.be/wr9gUr-eWdA 15 min. 38 sec.)
.
.
.
Instead, define cross-entropy loss:
.
(Source: https://youtu.be/wr9gUr-eWdA 28 min. 16 sec.)
Gini loss: .
Take the ski example again and you’re predicting the amount of snowfall you would expect in that area around that time.
(Source: https://youtu.be/wr9gUr-eWdA 35 min. 37 sec.)
Predict .
Categorical Vars
categories, possible splits.
Regularization of DTs
Runtime
examples
features
depth
Test time
Train time
Each point is part of nodes.
Cost of point at each node is .
So total cost if .
Data matrix is of size .
Disadvantage - no additive structure
(Source: https://youtu.be/wr9gUr-eWdA 49 min. 24 sec.)
For a decision tree, you’d have to ask a lot of questions that even somewhat approximate the line above.
Recap
Easy to explain | High variance |
Interpretable | Bad at additive |
Categorical Vars | Low predictive acc. |
Fast |
Take ’s which are random variables (RVs) that are independent and identically distributed (i.i.d.).
.
.
Drop independence assumption.
So now ’s are i.d.
’s correlated by .
.
Ways to ensemble
Have a true population .
Training set . (sample)
Assume .
Bootstrap samples .
Bootstrap samples
Train model on .
Bias-Variance Analysis
.
Bootstrapping is driving down .
More → less variance.
Bias slightly increased. Random subsampling
… training on less data → slightly less complex → increase bias.
DTs + Bagging
DTs have high variance / low bias.
Ideal fit for bagging.
cf. random forests are sort of a version of DTs + bagging.
At each split, consider only a fraction of your total features.
(e.g. For the ski example, for the first split, you only let it look at latitude, and then for the second split, you only let it look at the time of the year.)
Decrease .
Decorrelate models.
Decrease bias.
Additive.
Determine for classifier a weight proportional .
Adaboost
Each trained on reweighted training set.