For a decision tree, you’d have to ask a lot of questions that even somewhat approximate the line above.
Recap
(+)
(−)
Easy to explain
High variance
Interpretable
Bad at additive
Categorical Vars
Low predictive acc.
Fast
Ensembling
Take Xi’s which are random variables (RVs) that are independent and identically distributed (i.i.d.).
Var(X)=σ2.
Var(Xˉ)=Var(m1i∑Xi)=nσ2.
Drop independence assumption.
So now Xi’s are i.d.
Xi’s correlated by ρ.
Var(Xˉ)=ρσ2+n1−ρσ2.
Ways to ensemble
Different algorithms
Different training sets
Bagging (Random Forests)
Boosting (Adaboost, xgboost)
Bagging - Bootstrap Aggregation
Have a true population P.
Training set S∼P. (sample)
Assume P=S.
Bootstrap samples Z∼S.
Bootstrap samples ∗∗Z1,⋯,ZM∗∗
Train model Gm on Zm.
G(m)=Mm=1∑MGm(x).
Bias-Variance Analysis
Var(Xˉ)=ρσ2+M1−ρσ2.
Bootstrapping is driving down p.
More M → less variance.
Bias slightly increased. ∵ Random subsampling
… training on less data → slightly less complex → increase bias.
DTs + Bagging
DTs have high variance / low bias.
Ideal fit for bagging.
cf. random forests are sort of a version of DTs + bagging.
Random Forests
At each split, consider only a fraction of your total features.
(e.g. For the ski example, for the first split, you only let it look at latitude, and then for the second split, you only let it look at the time of the year.)
Decrease ρ.
Decorrelate models.
Boosting
Decrease bias.
Additive.
Determine for classifier Gm a weight αm proportional log(errm1−errm).