Lecture 09. Approx./Estimation Error & ERM

cryptnomy·2022년 11월 24일
0

CS229: Machine Learning

목록 보기
9/18
post-thumbnail

Lecture video link: https://youtu.be/iVOxMcumR4A

Outline

  • Setup / Assumptions
  • Bias - Variance
  • Approximation / Estimation
  • Empirical Risk Minimizer (ERM)
  • Uniform convergence
  • VC (Vapnik-Chervonenkis) dimension

Assumptions

  1. Data distribution DD s.t. (x,y)D(x,y)\sim D … train / test.
  2. Independent samples.

S={(x(i),y(i))  i=1,,m}DLearning algorithmh^   or   θ^S=\{\left(x^{(i)},y^{(i)}\right)|\;i=1,\cdots,m\}\sim D \Longrightarrow\text{Learning algorithm}\Longrightarrow \hat h\;\text{ or }\;\hat\theta.

SS ~ random variable

Learning algorithm ~ deterministic function

h^\hat{h} or θ^\hat\theta ~ random variable

(also called … learning algorithm - estimator // θ^\hat\theta - sampling distribution.)

θ\theta^*: true parameter (not random)

Data view vs. Parameter view

(Source: https://youtu.be/iVOxMcumR4A 13 min. 13 sec.)

Bias vs. Variance

(Source: https://youtu.be/iVOxMcumR4A 13 min. 50 sec.)

Var[θ^]0\text{Var}[\hat\theta]\rightarrow0 as mm\rightarrow\infty: “Statistical Efficient”

θ^θ\hat\theta\rightarrow\theta^* as mm\rightarrow\infty: “(algorithm being) Consistent”

E[θ^]=θ\mathbb{E}[\hat\theta]=\theta^* for all mm.

Risks (Errors)

(Source: https://youtu.be/iVOxMcumR4A 26 min. 45 sec.)

gg - Best possible hypothesis

hh^* - Best in class H\mathcal{H}

h^\hat h - Learnt from finite data

ϵ(h)\epsilon(h): Risk / Generalization Error

=E(x,y)D[1{h(x)y}]=\mathbb{E}_{(x,y)\sim D}[1\{h(x)≠y\}].

ϵ^s(h)\hat\epsilon_s(h): Empirical Risk

=1mi=1m1{h(x(i)y(i)}=\frac{1}{m}\sum\limits_{i=1}^m 1\{h(x^{(i)}≠y^{(i)}\}.

ϵ(g)\epsilon(g): Bayes Error / Irreducible Error

ϵ(h)ϵ(g)\epsilon(h^*)-\epsilon(g): Approximation Error → refer to the image above

ϵ(h^)ϵ(h)\epsilon(\hat h)-\epsilon(h^*): Estimation Error → refer to the image above

ϵ(h^)=Estimation Error+Approximation Error+Irreducible Error=(Est. Var.)+(Est. Bias+Approx. Error)+(Irreducible Error)=Var+Bias+Irreducible Error\epsilon(\hat h)=\text{Estimation Error}+\text{Approximation Error}+\text{Irreducible Error}\\=(\text{Est. Var.})+(\text{Est. Bias}+\text{Approx. Error})+(\text{Irreducible Error})\\=\text{Var}+\text{Bias}+\text{Irreducible Error}

Fight Variance

  1. mm\rightarrow\infty.
  2. Regularization. (Low bias & High var → Small bias & Low var)

Fight High Bias

→ Make H\mathcal{H} bigger.

Empirical Risk Minimizer (ERM)

~ a learning algorithm.

h^ERM=arg minhH1mi=1m1{h(x(i))y(i)}.\hat h_{\text{ERM}}=\argmin_{h\in\mathcal{H}}\frac{1}{m}\sum_{i=1}^m 1\left\{h(x^{(i)})\neq y^{(i)}\right\}.

Uniform Convergence

  1. ϵ^(i)\hat\epsilon(i) vs. ϵ(i)\epsilon(i)
  2. ϵ(h^)\epsilon(\hat h) vs. ϵ(h)\epsilon(h^*)

Tools

  1. Union Bound

    A1,A2,,AkA_1,A_2,\cdots,A_k (need not be independent)

    p(A1A2Ak)p(A1)+p(A2)++p(Ak)p(A_1\cup A_2\cup\cdots\cup A_k)\le p(A_1)+p(A_2)+\cdots+p(A_k)

  2. Hoeffding’s inequality

    Let Z1,Z2,,ZmBern(ϕ)Z_1,Z_2,\cdots,Z_m\sim\text{Bern}(\phi) and ϕ^=1mi=1mZi\hat\phi=\frac{1}{m}\sum\limits_{i=1}^m Z_i.

    Let γ>0\gamma>0 (margin). Then

    p[ϕ^ϕ>γ]2exp(2γ2m)p\left[|\hat\phi-\phi|>\gamma\right]\le2\exp{\left(-2\gamma^2m\right)}.

Finite Hypotheses class H\mathcal{H}

H=K|\mathcal H|=K.

p[  hH    ϵ^s(h)ϵ(h)>γ]K2exp(2γ2m)p\left[\exist\;h\in\mathcal H\;|\;\left|\hat\epsilon_s(h)-\epsilon(h)\right|>\gamma \right]\le K\cdot2\exp(-2\gamma^2m)

p[hH    ϵ^s(h)ϵ(h)<γ]1K2exp(2γ2m)\Rightarrow p\left[\forall h\in\mathcal H\;|\; \left|\hat\epsilon_s(h)-\epsilon(h)\right|<\gamma\right]\ge1-K\cdot2\exp(-2\gamma^2m).

Let δ=2Kexp(2γ2m)\delta=2K\exp(-2\gamma^2m).

δ\delta - Probability of Error

γ\gamma - Margin of Error

mm - Sample Size

Fix γ,δ>0.\gamma,\delta>0.

m12γ2log(2Kδ)m\ge\frac{1}{2\gamma^2}\log\left(\frac{2K}{\delta}\right)

(called “sample complexity”).

ϵ(h^)ϵ^(h^)+γϵ^(h)+γϵ(h)+2γ.\epsilon(\hat h)\\\le\hat\epsilon(\hat h)+\gamma\\\le\hat\epsilon(h^*)+\gamma\\\le\epsilon(h^*)+2\gamma.

\Rightarrow With probability 1δ1-\delta, train size mm,

ϵ(h^)ϵ(h)+212m+log2Kδ.\epsilon(\hat h)\le\epsilon(h^*)+2\sqrt{\frac{1}{2m}+\log\frac{2K}{\delta}}.

The case of infinite H\mathcal H

Def.

Given a set S={x(1),,(D)}S=\{x^{(1)},\cdots,^{(D)}\} (no relation to training set) of points x(i)Xx^{(i)}\in\mathcal X, we say that H\mathcal H “shatters” SS if H\mathcal H can realize any labeling on SS. I.e., if for any set of labels {y(1),,y(D)}\{y^{(1)},\cdots,y^{(D)}\}, there exists some hHh\in\mathcal H so that h(x(i))=y(i)h\left(x^{(i)}\right)=y^{(i)} for all i=1,,Di=1,\cdots,D.

VC(H)\text{VC}(\mathcal H): the size of the largest set that is “shattered” by H\mathcal H.

ϵ(h^)ϵ(h)+O(VC(H)mlogmVC(H)+1mlog1δ).\epsilon(\hat h)\le\epsilon(h^*)+O\left(\sqrt{\frac{\text{VC}(\mathcal H)}{m}\log\frac{m}{\text{VC}(\mathcal H)}+\frac{1}{m}\log\frac{1}{\delta}}\right).

0개의 댓글