Lecture video link: https://youtu.be/iVOxMcumR4A
Outline
Setup / Assumptions
Bias - Variance
Approximation / Estimation
Empirical Risk Minimizer (ERM)
Uniform convergence
VC (Vapnik-Chervonenkis) dimension
Assumptions
Data distribution D D D s.t. ( x , y ) ∼ D (x,y)\sim D ( x , y ) ∼ D … train / test.
Independent samples.
S = { ( x ( i ) , y ( i ) ) ∣ i = 1 , ⋯ , m } ∼ D ⟹ Learning algorithm ⟹ h ^ or θ ^ S=\{\left(x^{(i)},y^{(i)}\right)|\;i=1,\cdots,m\}\sim D \Longrightarrow\text{Learning algorithm}\Longrightarrow \hat h\;\text{ or }\;\hat\theta S = { ( x ( i ) , y ( i ) ) ∣ i = 1 , ⋯ , m } ∼ D ⟹ Learning algorithm ⟹ h ^ or θ ^ .
S S S ~ random variable
Learning algorithm ~ deterministic function
h ^ \hat{h} h ^ or θ ^ \hat\theta θ ^ ~ random variable
(also called … learning algorithm - estimator // θ ^ \hat\theta θ ^ - sampling distribution.)
θ ∗ \theta^* θ ∗ : true parameter (not random)
Data view vs. Parameter view
(Source: https://youtu.be/iVOxMcumR4A 13 min. 13 sec.)
Bias vs. Variance
(Source: https://youtu.be/iVOxMcumR4A 13 min. 50 sec.)
Var [ θ ^ ] → 0 \text{Var}[\hat\theta]\rightarrow0 Var [ θ ^ ] → 0 as m → ∞ m\rightarrow\infty m → ∞ : “Statistical Efficient”
θ ^ → θ ∗ \hat\theta\rightarrow\theta^* θ ^ → θ ∗ as m → ∞ m\rightarrow\infty m → ∞ : “(algorithm being) Consistent”
E [ θ ^ ] = θ ∗ \mathbb{E}[\hat\theta]=\theta^* E [ θ ^ ] = θ ∗ for all m m m .
Risks (Errors)
(Source: https://youtu.be/iVOxMcumR4A 26 min. 45 sec.)
g g g - Best possible hypothesis
h ∗ h^* h ∗ - Best in class H \mathcal{H} H
h ^ \hat h h ^ - Learnt from finite data
ϵ ( h ) \epsilon(h) ϵ ( h ) : Risk / Generalization Error
= E ( x , y ) ∼ D [ 1 { h ( x ) ≠ y } ] =\mathbb{E}_{(x,y)\sim D}[1\{h(x)≠y\}] = E ( x , y ) ∼ D [ 1 { h ( x ) = y } ] .
ϵ ^ s ( h ) \hat\epsilon_s(h) ϵ ^ s ( h ) : Empirical Risk
= 1 m ∑ i = 1 m 1 { h ( x ( i ) ≠ y ( i ) } =\frac{1}{m}\sum\limits_{i=1}^m 1\{h(x^{(i)}≠y^{(i)}\} = m 1 i = 1 ∑ m 1 { h ( x ( i ) = y ( i ) } .
ϵ ( g ) \epsilon(g) ϵ ( g ) : Bayes Error / Irreducible Error
ϵ ( h ∗ ) − ϵ ( g ) \epsilon(h^*)-\epsilon(g) ϵ ( h ∗ ) − ϵ ( g ) : Approximation Error → refer to the image above
ϵ ( h ^ ) − ϵ ( h ∗ ) \epsilon(\hat h)-\epsilon(h^*) ϵ ( h ^ ) − ϵ ( h ∗ ) : Estimation Error → refer to the image above
ϵ ( h ^ ) = Estimation Error + Approximation Error + Irreducible Error = ( Est. Var. ) + ( Est. Bias + Approx. Error ) + ( Irreducible Error ) = Var + Bias + Irreducible Error \epsilon(\hat h)=\text{Estimation Error}+\text{Approximation Error}+\text{Irreducible Error}\\=(\text{Est. Var.})+(\text{Est. Bias}+\text{Approx. Error})+(\text{Irreducible Error})\\=\text{Var}+\text{Bias}+\text{Irreducible Error} ϵ ( h ^ ) = Estimation Error + Approximation Error + Irreducible Error = ( Est. Var. ) + ( Est. Bias + Approx. Error ) + ( Irreducible Error ) = Var + Bias + Irreducible Error
Fight Variance
m → ∞ m\rightarrow\infty m → ∞ .
Regularization. (Low bias & High var → Small bias & Low var)
Fight High Bias
→ Make H \mathcal{H} H bigger.
Empirical Risk Minimizer (ERM)
~ a learning algorithm.
h ^ ERM = arg min h ∈ H 1 m ∑ i = 1 m 1 { h ( x ( i ) ) ≠ y ( i ) } . \hat h_{\text{ERM}}=\argmin_{h\in\mathcal{H}}\frac{1}{m}\sum_{i=1}^m 1\left\{h(x^{(i)})\neq y^{(i)}\right\}. h ^ ERM = h ∈ H a r g m i n m 1 i = 1 ∑ m 1 { h ( x ( i ) ) = y ( i ) } .
ϵ ^ ( i ) \hat\epsilon(i) ϵ ^ ( i ) vs. ϵ ( i ) \epsilon(i) ϵ ( i )
ϵ ( h ^ ) \epsilon(\hat h) ϵ ( h ^ ) vs. ϵ ( h ∗ ) \epsilon(h^*) ϵ ( h ∗ )
Tools
Union Bound
A 1 , A 2 , ⋯ , A k A_1,A_2,\cdots,A_k A 1 , A 2 , ⋯ , A k (need not be independent)
p ( A 1 ∪ A 2 ∪ ⋯ ∪ A k ) ≤ p ( A 1 ) + p ( A 2 ) + ⋯ + p ( A k ) p(A_1\cup A_2\cup\cdots\cup A_k)\le p(A_1)+p(A_2)+\cdots+p(A_k) p ( A 1 ∪ A 2 ∪ ⋯ ∪ A k ) ≤ p ( A 1 ) + p ( A 2 ) + ⋯ + p ( A k )
Hoeffding’s inequality
Let Z 1 , Z 2 , ⋯ , Z m ∼ Bern ( ϕ ) Z_1,Z_2,\cdots,Z_m\sim\text{Bern}(\phi) Z 1 , Z 2 , ⋯ , Z m ∼ Bern ( ϕ ) and ϕ ^ = 1 m ∑ i = 1 m Z i \hat\phi=\frac{1}{m}\sum\limits_{i=1}^m Z_i ϕ ^ = m 1 i = 1 ∑ m Z i .
Let γ > 0 \gamma>0 γ > 0 (margin). Then
p [ ∣ ϕ ^ − ϕ ∣ > γ ] ≤ 2 exp ( − 2 γ 2 m ) p\left[|\hat\phi-\phi|>\gamma\right]\le2\exp{\left(-2\gamma^2m\right)} p [ ∣ ϕ ^ − ϕ ∣ > γ ] ≤ 2 exp ( − 2 γ 2 m ) .
Finite Hypotheses class H \mathcal{H} H
∣ H ∣ = K |\mathcal H|=K ∣ H ∣ = K .
p [ ∃ h ∈ H ∣ ∣ ϵ ^ s ( h ) − ϵ ( h ) ∣ > γ ] ≤ K ⋅ 2 exp ( − 2 γ 2 m ) p\left[\exist\;h\in\mathcal H\;|\;\left|\hat\epsilon_s(h)-\epsilon(h)\right|>\gamma \right]\le K\cdot2\exp(-2\gamma^2m) p [ ∃ h ∈ H ∣ ∣ ϵ ^ s ( h ) − ϵ ( h ) ∣ > γ ] ≤ K ⋅ 2 exp ( − 2 γ 2 m )
⇒ p [ ∀ h ∈ H ∣ ∣ ϵ ^ s ( h ) − ϵ ( h ) ∣ < γ ] ≥ 1 − K ⋅ 2 exp ( − 2 γ 2 m ) \Rightarrow p\left[\forall h\in\mathcal H\;|\; \left|\hat\epsilon_s(h)-\epsilon(h)\right|<\gamma\right]\ge1-K\cdot2\exp(-2\gamma^2m) ⇒ p [ ∀ h ∈ H ∣ ∣ ϵ ^ s ( h ) − ϵ ( h ) ∣ < γ ] ≥ 1 − K ⋅ 2 exp ( − 2 γ 2 m ) .
Let δ = 2 K exp ( − 2 γ 2 m ) \delta=2K\exp(-2\gamma^2m) δ = 2 K exp ( − 2 γ 2 m ) .
δ \delta δ - Probability of Error
γ \gamma γ - Margin of Error
m m m - Sample Size
Fix γ , δ > 0. \gamma,\delta>0. γ , δ > 0 .
m ≥ 1 2 γ 2 log ( 2 K δ ) m\ge\frac{1}{2\gamma^2}\log\left(\frac{2K}{\delta}\right) m ≥ 2 γ 2 1 log ( δ 2 K )
(called “sample complexity”).
ϵ ( h ^ ) ≤ ϵ ^ ( h ^ ) + γ ≤ ϵ ^ ( h ∗ ) + γ ≤ ϵ ( h ∗ ) + 2 γ . \epsilon(\hat h)\\\le\hat\epsilon(\hat h)+\gamma\\\le\hat\epsilon(h^*)+\gamma\\\le\epsilon(h^*)+2\gamma. ϵ ( h ^ ) ≤ ϵ ^ ( h ^ ) + γ ≤ ϵ ^ ( h ∗ ) + γ ≤ ϵ ( h ∗ ) + 2 γ .
⇒ \Rightarrow ⇒ With probability 1 − δ 1-\delta 1 − δ , train size m m m ,
ϵ ( h ^ ) ≤ ϵ ( h ∗ ) + 2 1 2 m + log 2 K δ . \epsilon(\hat h)\le\epsilon(h^*)+2\sqrt{\frac{1}{2m}+\log\frac{2K}{\delta}}. ϵ ( h ^ ) ≤ ϵ ( h ∗ ) + 2 2 m 1 + log δ 2 K .
The case of infinite H \mathcal H H
Def.
Given a set S = { x ( 1 ) , ⋯ , ( D ) } S=\{x^{(1)},\cdots,^{(D)}\} S = { x ( 1 ) , ⋯ , ( D ) } (no relation to training set) of points x ( i ) ∈ X x^{(i)}\in\mathcal X x ( i ) ∈ X , we say that H \mathcal H H “shatters” S S S if H \mathcal H H can realize any labeling on S S S . I.e., if for any set of labels { y ( 1 ) , ⋯ , y ( D ) } \{y^{(1)},\cdots,y^{(D)}\} { y ( 1 ) , ⋯ , y ( D ) } , there exists some h ∈ H h\in\mathcal H h ∈ H so that h ( x ( i ) ) = y ( i ) h\left(x^{(i)}\right)=y^{(i)} h ( x ( i ) ) = y ( i ) for all i = 1 , ⋯ , D i=1,\cdots,D i = 1 , ⋯ , D .
VC ( H ) \text{VC}(\mathcal H) VC ( H ) : the size of the largest set that is “shattered” by H \mathcal H H .
ϵ ( h ^ ) ≤ ϵ ( h ∗ ) + O ( VC ( H ) m log m VC ( H ) + 1 m log 1 δ ) . \epsilon(\hat h)\le\epsilon(h^*)+O\left(\sqrt{\frac{\text{VC}(\mathcal H)}{m}\log\frac{m}{\text{VC}(\mathcal H)}+\frac{1}{m}\log\frac{1}{\delta}}\right). ϵ ( h ^ ) ≤ ϵ ( h ∗ ) + O ( m VC ( H ) log VC ( H ) m + m 1 log δ 1 ) .