Lecture 09. Approx./Estimation Error & ERM

cryptnomy·2022년 11월 24일

ML cs229 machine learning

CS229: Machine Learning

목록 보기

9/18

Lecture video link: https://youtu.be/iVOxMcumR4A

Outline

Setup / Assumptions
Bias - Variance
Approximation / Estimation
Empirical Risk Minimizer (ERM)
Uniform convergence
VC (Vapnik-Chervonenkis) dimension

Assumptions

Data distribution $D$ s.t. $(x,y)\sim D$ … train / test.
Independent samples.

$S=\{\left(x^{(i)},y^{(i)}\right)|\;i=1,\cdots,m\}\sim D \Longrightarrow\text{Learning algorithm}\Longrightarrow \hat h\;\text{ or }\;\hat\theta$ .

$S$ ~ random variable

Learning algorithm ~ deterministic function

$\hat{h}$ or $\hat\theta$ ~ random variable

(also called … learning algorithm - estimator // $\hat\theta$ - sampling distribution.)

$\theta^*$ : true parameter (not random)

Data view vs. Parameter view

(Source: https://youtu.be/iVOxMcumR4A 13 min. 13 sec.)

Bias vs. Variance

(Source: https://youtu.be/iVOxMcumR4A 13 min. 50 sec.)

$\text{Var}[\hat\theta]\rightarrow0$ as $m\rightarrow\infty$ : “Statistical Efficient”

$\hat\theta\rightarrow\theta^*$ as $m\rightarrow\infty$ : “(algorithm being) Consistent”

$\mathbb{E}[\hat\theta]=\theta^*$ for all $m$ .

Risks (Errors)

(Source: https://youtu.be/iVOxMcumR4A 26 min. 45 sec.)

$g$ - Best possible hypothesis

$h^*$ - Best in class $\mathcal{H}$

$\hat h$ - Learnt from finite data

$\epsilon(h)$ : Risk / Generalization Error

$=\mathbb{E}_{(x,y)\sim D}[1\{h(x)≠y\}]$ .

$\hat\epsilon_s(h)$ : Empirical Risk

$=\frac{1}{m}\sum\limits_{i=1}^m 1\{h(x^{(i)}≠y^{(i)}\}$ .

$\epsilon(g)$ : Bayes Error / Irreducible Error

$\epsilon(h^*)-\epsilon(g)$ : Approximation Error → refer to the image above

$\epsilon(\hat h)-\epsilon(h^*)$ : Estimation Error → refer to the image above

$\epsilon(\hat h)=\text{Estimation Error}+\text{Approximation Error}+\text{Irreducible Error}\\=(\text{Est. Var.})+(\text{Est. Bias}+\text{Approx. Error})+(\text{Irreducible Error})\\=\text{Var}+\text{Bias}+\text{Irreducible Error}$

Fight Variance

$m\rightarrow\infty$ .
Regularization. (Low bias & High var → Small bias & Low var)

Fight High Bias

→ Make $\mathcal{H}$ bigger.

Empirical Risk Minimizer (ERM)

~ a learning algorithm.

\hat h_{\text{ERM}}=\argmin_{h\in\mathcal{H}}\frac{1}{m}\sum_{i=1}^m 1\left\{h(x^{(i)})\neq y^{(i)}\right\}.

Uniform Convergence

$\hat\epsilon(i)$ vs. $\epsilon(i)$
$\epsilon(\hat h)$ vs. $\epsilon(h^*)$

Tools

Union Bound

$A_1,A_2,\cdots,A_k$ (need not be independent)

$p(A_1\cup A_2\cup\cdots\cup A_k)\le p(A_1)+p(A_2)+\cdots+p(A_k)$
Hoeffding’s inequality

Let $Z_1,Z_2,\cdots,Z_m\sim\text{Bern}(\phi)$ and $\hat\phi=\frac{1}{m}\sum\limits_{i=1}^m Z_i$ .

Let $\gamma>0$ (margin). Then

$p\left[|\hat\phi-\phi|>\gamma\right]\le2\exp{\left(-2\gamma^2m\right)}$ .

Finite Hypotheses class $\mathcal{H}$

$|\mathcal H|=K$ .

$p\left[\exist\;h\in\mathcal H\;|\;\left|\hat\epsilon_s(h)-\epsilon(h)\right|>\gamma \right]\le K\cdot2\exp(-2\gamma^2m)$

$\Rightarrow p\left[\forall h\in\mathcal H\;|\; \left|\hat\epsilon_s(h)-\epsilon(h)\right|<\gamma\right]\ge1-K\cdot2\exp(-2\gamma^2m)$ .

Let $\delta=2K\exp(-2\gamma^2m)$ .

$\delta$ - Probability of Error

$\gamma$ - Margin of Error

$m$ - Sample Size

Fix $\gamma,\delta>0.$

m\ge\frac{1}{2\gamma^2}\log\left(\frac{2K}{\delta}\right)

(called “sample complexity”).

$\epsilon(\hat h)\\\le\hat\epsilon(\hat h)+\gamma\\\le\hat\epsilon(h^*)+\gamma\\\le\epsilon(h^*)+2\gamma.$

$\Rightarrow$ With probability $1-\delta$ , train size $m$ ,

\epsilon(\hat h)\le\epsilon(h^*)+2\sqrt{\frac{1}{2m}+\log\frac{2K}{\delta}}.

The case of infinite $\mathcal H$

Def.

Given a set $S=\{x^{(1)},\cdots,^{(D)}\}$ (no relation to training set) of points $x^{(i)}\in\mathcal X$ , we say that $\mathcal H$ “shatters” $S$ if $\mathcal H$ can realize any labeling on $S$ . I.e., if for any set of labels $\{y^{(1)},\cdots,y^{(D)}\}$ , there exists some $h\in\mathcal H$ so that $h\left(x^{(i)}\right)=y^{(i)}$ for all $i=1,\cdots,D$ .

$\text{VC}(\mathcal H)$ : the size of the largest set that is “shattered” by $\mathcal H$ .

\epsilon(\hat h)\le\epsilon(h^*)+O\left(\sqrt{\frac{\text{VC}(\mathcal H)}{m}\log\frac{m}{\text{VC}(\mathcal H)}+\frac{1}{m}\log\frac{1}{\delta}}\right).

cryptnomy

이전 포스트

Lecture 08. Data Splits, Models & Cross-Validation

다음 포스트