Lecture 10. Decision Trees and Ensemble Methods

cryptnomy·2022년 11월 24일

ML cs229 machine learning

CS229: Machine Learning

목록 보기

10/18

Lecture video link: https://youtu.be/wr9gUr-eWdA

Outline

Decision Trees
Ensemble Methods
Bagging
Random Forests
Boosting

Decision Trees

~ one of our first examples of a non-linear model.

Greedy, Top-Down, Recursive Partitioning

ex.

$+$ : Regions where skiing is possible

$-$ : otherwise

Latitude vs. Months

(Source: https://youtu.be/wr9gUr-eWdA 6 min. 30 sec.)

Region $R_p$ looking for a split $S_p$ .

S_p(j,t)=(\{X\;|\;X_j<t,x\in R_p\},\{X\;|\;X_j\ge t,x\in R_p\})=(R_1,R_2).

How to choose these splits?

Define $L(R)$ := loss on $R$ .

Given $C$ classes, define

$\hat p_c$ to be the proportion of examples in $R$ that are of class $c$ .

$L_{\text{misclassification}}=1-\max\limits_c\;\hat p_c$ .

\max\underbrace{L(R_p)}_\text{parent loss}-\underbrace{(L(R_1)+L(R_2))}_\text{children loss}.

Misclassification Loss

has issues:

(Source: https://youtu.be/wr9gUr-eWdA 15 min. 38 sec.)

$L(R_1)+L(R_2)=100+0=100$ .

$L(R_p)=100$ .

Instead, define cross-entropy loss:

$L_\text{cross}:=-\sum\limits_c\hat p_c\log_2\hat p_c$ .

(Source: https://youtu.be/wr9gUr-eWdA 28 min. 16 sec.)

Gini loss: $\sum\limits_c\hat p_c(1-\hat p_c)$ .

Regression Trees

Take the ski example again and you’re predicting the amount of snowfall you would expect in that area around that time.

(Source: https://youtu.be/wr9gUr-eWdA 35 min. 37 sec.)

Predict $\hat y_m=\frac{1}{|R_m|}\sum\limits_{i\in R_m}(y_i-\hat y_m)^2$ .

L_{\text{squared}}=\frac{1}{|R_m|}\sum\limits_{i\in R_m}(y_i-\hat y_m)^2.

Categorical Vars

$q$ categories, $2^q$ possible splits.

Regularization of DTs

Min leaf size
Max depth
Max number of nodes
Min decrease in loss
Pruning (misclassification with val. set)

Runtime

$n$ examples

$f$ features

$d$ depth

Test time

$O(d)$

$d<\log_2n$

Train time

Each point is part of $O(d)$ nodes.

Cost of point at each node is $O(f)$ .

So total cost if $O(nfd)$ .

Data matrix is of size $nf$ .

Disadvantage - no additive structure

(Source: https://youtu.be/wr9gUr-eWdA 49 min. 24 sec.)

For a decision tree, you’d have to ask a lot of questions that even somewhat approximate the line above.

Recap

$(+)$	$(-)$
Easy to explain	High variance
Interpretable	Bad at additive
Categorical Vars	Low predictive acc.
Fast

Ensembling

Take $X_i$ ’s which are random variables (RVs) that are independent and identically distributed (i.i.d.).

$\text{Var}(X)=\sigma^2$ .

$\text{Var}(\bar X)=\text{Var}\left(\frac{1}{m}\sum\limits_i X_i\right)=\frac{\sigma^2}{n}$ .

Drop independence assumption.

So now $X_i$ ’s are i.d.

$X_i$ ’s correlated by $\rho$ .

$\text{Var}(\bar X)=\rho\sigma^2+\frac{1-\rho}{n}\sigma^2$ .

Ways to ensemble

Different algorithms
Different training sets
Bagging (Random Forests)
Boosting (Adaboost, xgboost)

Bagging - Bootstrap Aggregation

Have a true population $P$ .

Training set $S\sim P$ . (sample)

Assume $P=S$ .

Bootstrap samples $Z\sim S$ .

Bootstrap samples $**Z_1,\cdots,Z_M**$

Train model $G_m$ on $Z_m$ .

G(m)=\frac{\sum\limits_{m=1}^M G_m(x)}{M}.

Bias-Variance Analysis

$\text{Var}(\bar X)=\rho\sigma^2+\frac{1-\rho}{M}\sigma^2$ .

Bootstrapping is driving down $p$ .

More $M$ → less variance.

Bias slightly increased. $\because$ Random subsampling

… training on less data → slightly less complex → increase bias.

DTs + Bagging

DTs have high variance / low bias.

Ideal fit for bagging.

cf. random forests are sort of a version of DTs + bagging.

Random Forests

At each split, consider only a fraction of your total features.

(e.g. For the ski example, for the first split, you only let it look at latitude, and then for the second split, you only let it look at the time of the year.)

Decrease $\rho$ .

Decorrelate models.

Boosting

Decrease bias.

Additive.

Determine for classifier $G_m$ a weight $\alpha_m$ proportional $\log\left(\frac{1-\text{err}_m}{\text{err}_m}\right)$ .

Adaboost

G(x)=\sum_m\alpha_mG_m.

Each $G_m$ trained on reweighted training set.

cryptnomy

이전 포스트

Lecture 09. Approx./Estimation Error & ERM

다음 포스트