Lecture 10. Decision Trees and Ensemble Methods

cryptnomy·2022년 11월 24일
0

CS229: Machine Learning

목록 보기
10/18
post-thumbnail

Lecture video link: https://youtu.be/wr9gUr-eWdA

Outline

  • Decision Trees
  • Ensemble Methods
  • Bagging
  • Random Forests
  • Boosting

Decision Trees

~ one of our first examples of a non-linear model.

Greedy, Top-Down, Recursive Partitioning

ex.

++: Regions where skiing is possible

-: otherwise

Latitude vs. Months

(Source: https://youtu.be/wr9gUr-eWdA 6 min. 30 sec.)

Region RpR_p looking for a split SpS_p.

Sp(j,t)=({X    Xj<t,xRp},{X    Xjt,xRp})=(R1,R2).S_p(j,t)=(\{X\;|\;X_j<t,x\in R_p\},\{X\;|\;X_j\ge t,x\in R_p\})=(R_1,R_2).

How to choose these splits?

Define L(R)L(R):= loss on RR.

Given CC classes, define

p^c\hat p_c to be the proportion of examples in RR that are of class cc.

Lmisclassification=1maxc  p^cL_{\text{misclassification}}=1-\max\limits_c\;\hat p_c.

maxL(Rp)parent loss(L(R1)+L(R2))children loss.\max\underbrace{L(R_p)}_\text{parent loss}-\underbrace{(L(R_1)+L(R_2))}_\text{children loss}.

Misclassification Loss

has issues:

(Source: https://youtu.be/wr9gUr-eWdA 15 min. 38 sec.)

L(R1)+L(R2)=100+0=100L(R_1)+L(R_2)=100+0=100.

L(R1)+L(R2)=100+0=100L(R_1)+L(R_2)=100+0=100.

L(Rp)=100L(R_p)=100.

Instead, define cross-entropy loss:

Lcross:=cp^clog2p^cL_\text{cross}:=-\sum\limits_c\hat p_c\log_2\hat p_c.

(Source: https://youtu.be/wr9gUr-eWdA 28 min. 16 sec.)

Gini loss: cp^c(1p^c)\sum\limits_c\hat p_c(1-\hat p_c).

Regression Trees

Take the ski example again and you’re predicting the amount of snowfall you would expect in that area around that time.

(Source: https://youtu.be/wr9gUr-eWdA 35 min. 37 sec.)

Predict y^m=1RmiRm(yiy^m)2\hat y_m=\frac{1}{|R_m|}\sum\limits_{i\in R_m}(y_i-\hat y_m)^2.

Lsquared=1RmiRm(yiy^m)2.L_{\text{squared}}=\frac{1}{|R_m|}\sum\limits_{i\in R_m}(y_i-\hat y_m)^2.

Categorical Vars

qq categories, 2q2^q possible splits.

Regularization of DTs

  1. Min leaf size
  2. Max depth
  3. Max number of nodes
  4. Min decrease in loss
  5. Pruning (misclassification with val. set)

Runtime

nn examples

ff features

dd depth

Test time

O(d)O(d)

d<log2nd<\log_2n

Train time

Each point is part of O(d)O(d) nodes.

Cost of point at each node is O(f)O(f).

So total cost if O(nfd)O(nfd).

Data matrix is of size nfnf.

Disadvantage - no additive structure

(Source: https://youtu.be/wr9gUr-eWdA 49 min. 24 sec.)

For a decision tree, you’d have to ask a lot of questions that even somewhat approximate the line above.

Recap

(+)(+)()(-)
Easy to explainHigh variance
InterpretableBad at additive
Categorical VarsLow predictive acc.
Fast

Ensembling

Take XiX_i’s which are random variables (RVs) that are independent and identically distributed (i.i.d.).

Var(X)=σ2\text{Var}(X)=\sigma^2.

Var(Xˉ)=Var(1miXi)=σ2n\text{Var}(\bar X)=\text{Var}\left(\frac{1}{m}\sum\limits_i X_i\right)=\frac{\sigma^2}{n}.

Drop independence assumption.

So now XiX_i’s are i.d.

XiX_i’s correlated by ρ\rho.

Var(Xˉ)=ρσ2+1ρnσ2\text{Var}(\bar X)=\rho\sigma^2+\frac{1-\rho}{n}\sigma^2.

Ways to ensemble

  1. Different algorithms
  2. Different training sets
  3. Bagging (Random Forests)
  4. Boosting (Adaboost, xgboost)

Bagging - Bootstrap Aggregation

Have a true population PP.

Training set SPS\sim P. (sample)

Assume P=SP=S.

Bootstrap samples ZSZ\sim S.

Bootstrap samples Z1,,ZM**Z_1,\cdots,Z_M**

Train model GmG_m on ZmZ_m.

G(m)=m=1MGm(x)M.G(m)=\frac{\sum\limits_{m=1}^M G_m(x)}{M}.

Bias-Variance Analysis

Var(Xˉ)=ρσ2+1ρMσ2\text{Var}(\bar X)=\rho\sigma^2+\frac{1-\rho}{M}\sigma^2.

Bootstrapping is driving down pp.

More MM → less variance.

Bias slightly increased. \because Random subsampling

… training on less data → slightly less complex → increase bias.

DTs + Bagging

DTs have high variance / low bias.

Ideal fit for bagging.

cf. random forests are sort of a version of DTs + bagging.

Random Forests

At each split, consider only a fraction of your total features.

(e.g. For the ski example, for the first split, you only let it look at latitude, and then for the second split, you only let it look at the time of the year.)

Decrease ρ\rho.

Decorrelate models.

Boosting

Decrease bias.

Additive.

Determine for classifier GmG_m a weight αm\alpha_m proportional log(1errmerrm)\log\left(\frac{1-\text{err}_m}{\text{err}_m}\right).

Adaboost

G(x)=mαmGm.G(x)=\sum_m\alpha_mG_m.

Each GmG_m trained on reweighted training set.

0개의 댓글