Lecture video link: https://youtu.be/8NYoQiRANpg
SVM
Optimization problem
Representer theorem
Kernels
Example of kernels
Recap
Optimal margin classifier:
min w , b 1 2 ∣ ∣ w ∣ ∣ 2 s.t. y ( i ) ( w T x ( i ) + b ) ≥ 1. \begin{aligned}\min_{w,b}\;\;&\frac{1}{2}||w||^2\\\text{s.t.}\;\;&y^{(i)}(w^Tx^{(i)}+b)\ge1.\end{aligned} w , b min s.t. 2 1 ∣ ∣ w ∣ ∣ 2 y ( i ) ( w T x ( i ) + b ) ≥ 1 .
γ ( i ) = y ( i ) ( w T x ( i ) + b ) ∣ ∣ w ∣ ∣ ( geometric margin ) . \gamma^{(i)}=\frac{y^{(i)}(w^Tx^{(i)}+b)}{||w||}\;\;(\text{geometric\;margin}). γ ( i ) = ∣ ∣ w ∣ ∣ y ( i ) ( w T x ( i ) + b ) ( geometric margin ) .
Assume w = ∑ i = 1 m α i x ( i ) w=\sum\limits_{i=1}^m \alpha_ix^{(i)} w = i = 1 ∑ m α i x ( i ) .
Why?
Intuition #1. Logistic regression
θ i = 0 \theta_i=0 θ i = 0
Gradient descent:
θ ← θ − α ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) θ ← θ − α ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) \theta\leftarrow\theta-\alpha(h_\theta(x^{(i)})-y^{(i)})x^{(i)}\\\theta\leftarrow\theta-\alpha\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x^{(i)} θ ← θ − α ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) θ ← θ − α i = 1 ∑ m ( h θ ( x ( i ) ) − y ( i ) ) x ( i )
Intuition#2. w ⊥ decision boundary w\perp\text{decision\;boundary} w ⊥ decision boundary
The proof of representer theorem … refer to the lecture note.
w = ∑ i = 1 m α i y ( i ) x ( i ) w=\sum_{i=1}^m\alpha_iy^{(i)}x^{(i)} w = i = 1 ∑ m α i y ( i ) x ( i )
min w , b 1 2 ∣ ∣ w ∣ ∣ 2 s.t. y ( i ) ( w T x ( i ) + b ) ≥ 1. \begin{aligned}\min_{w,b}\;\;&\frac{1}{2}||w||^2\\\text{s.t.}\;\;&y^{(i)}(w^Tx^{(i)}+b)\ge1.\end{aligned} w , b min s.t. 2 1 ∣ ∣ w ∣ ∣ 2 y ( i ) ( w T x ( i ) + b ) ≥ 1 .
min 1 2 ∣ ∣ w ∣ ∣ 2 = min 1 2 ( ∑ i = 1 m α i y ( i ) x ( i ) ) T ( ∑ i = 1 m α i y ( i ) x ( i ) ) ⇒ min ∑ i ∑ j α i α j y ( i ) y ( j ) x ( i ) T x ( j ) = min ∑ i ∑ j α i α j y ( i ) y ( j ) < x ( i ) , x ( j ) > s.t. y ( i ) ( ∑ j α j y ( j ) < x ( j ) , x ( i ) > + b ) ≥ 1. \begin{aligned}&\min\;\frac{1}{2}||w||^2\\=&\min\frac{1}{2}\left(\sum_{i=1}^m\alpha_iy^{(i)}x^{(i)}\right)^T\left(\sum_{i=1}^m\alpha_iy^{(i)}x^{(i)}\right)\\\Rightarrow&\min\;\sum_i\sum_j\alpha_i\alpha_jy^{(i)}y^{(j)}{x^{(i)}}^Tx^{(j)}\\=&\min\;\sum_i\sum_j\alpha_i\alpha_jy^{(i)}y^{(j)}\left<x^{(i)},x^{(j)}\right>\\ \text{s.t.}\;\;&y^{(i)}\left(\sum_j\alpha_jy^{(j)}\left<x^{(j)},x^{(i)}\right>+b\right)\ge1. \end{aligned} = ⇒ = s.t. min 2 1 ∣ ∣ w ∣ ∣ 2 min 2 1 ( i = 1 ∑ m α i y ( i ) x ( i ) ) T ( i = 1 ∑ m α i y ( i ) x ( i ) ) min i ∑ j ∑ α i α j y ( i ) y ( j ) x ( i ) T x ( j ) min i ∑ j ∑ α i α j y ( i ) y ( j ) ⟨ x ( i ) , x ( j ) ⟩ y ( i ) ( j ∑ α j y ( j ) ⟨ x ( j ) , x ( i ) ⟩ + b ) ≥ 1 .
Dual optimization problem
max ∑ i α i − 1 2 ∑ i ∑ j y ( i ) y ( j ) α i α j < x ( i ) , x ( j ) > s.t. α i ≥ 0 , ∑ i y ( i ) α i = 0. \begin{aligned}\max\;&\sum_i\alpha_i-\frac{1}{2}\sum_i\sum_jy^{(i)}y^{(j)}\alpha_i\alpha_j\left<x^{(i)},x^{(j)}\right>\\\text{s.t.}\;\;&\alpha_i\ge0,\\&\sum_iy^{(i)}\alpha_i=0. \end{aligned} max s.t. i ∑ α i − 2 1 i ∑ j ∑ y ( i ) y ( j ) α i α j ⟨ x ( i ) , x ( j ) ⟩ α i ≥ 0 , i ∑ y ( i ) α i = 0 .
h w , b ( x ) = g ( w T x + b ) = g ( ∑ i α i y ( i ) < x ( i ) , x > + b ) \begin{aligned}h_{w,b}(x)&=g(w^Tx+b)\\&=g\left(\sum_i\alpha_iy^{(i)}\left<x^{(i)},x\right>+b\right)\end{aligned} h w , b ( x ) = g ( w T x + b ) = g ( i ∑ α i y ( i ) ⟨ x ( i ) , x ⟩ + b )
Kernel trick:
1) Write algorithm in terms of < x ( i ) , x ( j ) > \left<x^{(i)},x^{(j)}\right> ⟨ x ( i ) , x ( j ) ⟩ .
2) Let that be mapping from x ⟼ ϕ ( x ) x\longmapsto\phi(x) x ⟼ ϕ ( x ) .
ex. [ x 1 x 2 ] ⟼ ϕ ( x ) = [ x 1 x 2 x 1 x 2 x 1 2 x 2 ⋮ ] \begin{bmatrix}x_1\\x_2\end{bmatrix}\longmapsto\phi(x)=\begin{bmatrix}x_1\\x_2\\x_1x_2\\x_1^2x_2\\\vdots\end{bmatrix} [ x 1 x 2 ] ⟼ ϕ ( x ) = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x 1 x 2 x 1 x 2 x 1 2 x 2 ⋮ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ .
3) Find way to compute K ( x , z ) = ϕ ( x ) T ϕ ( z ) K(x,z)=\phi(x)^T\phi(z) K ( x , z ) = ϕ ( x ) T ϕ ( z ) .
4) Replace < x , z > \left<x,z\right> ⟨ x , z ⟩ in algorithm with K ( x , z ) K(x,z) K ( x , z ) .
Andrew Ng thinks “no free lunch theorem” is a fascinating theoretical concept, but he finds that it’s been less useful because he thinks we have inductive biases that turn out to be useful.
cf. no free lunch theorem (famous in machine learning and optimization): basically says that in the worst case learning algorithms do not work. (ref. https://en.wikipedia.org/wiki/No_free_lunch_theorem )
… but universe is not that hostile toward us 😁 learning algorithms turned out to be okay.
ex 1.
x = [ x 1 x 2 x 3 ] ∈ R n ϕ ( x ) = [ x 1 x 1 x 1 x 2 x 1 x 3 x 2 x 1 x 2 x 2 x 2 x 3 x 3 x 1 x 3 x 2 x 3 x 3 ] ∈ R n 2 , ϕ ( z ) = [ z 1 z 1 z 1 z 2 z 1 z 3 z 2 z 1 z 2 z 2 z 2 z 3 z 3 z 1 z 3 z 2 z 3 z 3 ] ∈ R n 2 \begin{aligned}x&=\begin{bmatrix}x_1\\x_2\\x_3\end{bmatrix}\in\mathbb{R}^n\\\phi(x)&=\begin{bmatrix}x_1x_1\\x_1x_2\\x_1x_3\\x_2x_1\\x_2x_2\\x_2x_3\\x_3x_1\\x_3x_2\\x_3x_3\end{bmatrix}\in\mathbb{R}^{n^2}, \phi(z)=\begin{bmatrix}z_1z_1\\z_1z_2\\z_1z_3\\z_2z_1\\z_2z_2\\z_2z_3\\z_3z_1\\z_3z_2\\z_3z_3\end{bmatrix}\in\mathbb{R}^{n^2}\end{aligned} x ϕ ( x ) = ⎣ ⎢ ⎡ x 1 x 2 x 3 ⎦ ⎥ ⎤ ∈ R n = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x 1 x 1 x 1 x 2 x 1 x 3 x 2 x 1 x 2 x 2 x 2 x 3 x 3 x 1 x 3 x 2 x 3 x 3 ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ ∈ R n 2 , ϕ ( z ) = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ z 1 z 1 z 1 z 2 z 1 z 3 z 2 z 1 z 2 z 2 z 2 z 3 z 3 z 1 z 3 z 2 z 3 z 3 ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ ∈ R n 2
n 2 n^2 n 2 elements need O ( n 2 ) O(n^2) O ( n 2 ) time to compute ϕ ( x ) \phi(x) ϕ ( x ) or ϕ ( x ) T ϕ ( z ) \phi(x)^T\phi(z) ϕ ( x ) T ϕ ( z ) explicitly.
K ( x , z ) = ϕ ( x ) T ϕ ( z ) = ( x T z ) 2 K(x,z)=\phi(x)^T\phi(z)=(x^Tz)^2 K ( x , z ) = ϕ ( x ) T ϕ ( z ) = ( x T z ) 2 → O ( n ) \rightarrow O(n) → O ( n ) time computation.
( x T z ) 2 = ( ∑ i = 1 n x i z i ) ( ∑ i = 1 n x i z i ) = ∑ i = 1 n ∑ i = 1 n x i z i x j z j = ∑ i = 1 n ∑ j = 1 n ( x i x j ) ( z i z j ) \begin{aligned}(x^Tz)^2&=\left(\sum_{i=1}^nx_iz_i\right)\left(\sum_{i=1}^nx_iz_i\right)\\&=\sum_{i=1}^n\sum_{i=1}^nx_iz_ix_jz_j\\&=\sum_{i=1}^n\sum_{j=1}^n(x_ix_j)(z_iz_j)\end{aligned} ( x T z ) 2 = ( i = 1 ∑ n x i z i ) ( i = 1 ∑ n x i z i ) = i = 1 ∑ n i = 1 ∑ n x i z i x j z j = i = 1 ∑ n j = 1 ∑ n ( x i x j ) ( z i z j )
where we take x i x j x_ix_j x i x j and z i z j z_iz_j z i z j from ϕ ( x ) \phi(x) ϕ ( x ) and ϕ ( z ) \phi(z) ϕ ( z ) .
ex 2.
K ( x , z ) = ( x T z + c ) 2 , c ∈ R K(x,z)=(x^Tz+c)^2,\;\;c\in\mathbb{R} K ( x , z ) = ( x T z + c ) 2 , c ∈ R
where
x = [ x 1 x 2 x 3 ] ∈ R n ϕ ( x ) = [ x 1 x 1 x 1 x 2 x 1 x 3 x 2 x 1 x 2 x 2 x 2 x 3 x 3 x 1 x 3 x 2 x 3 x 3 2 c x 1 2 c x 2 2 c x 3 c ] ∈ R n 2 , ϕ ( z ) = [ z 1 z 1 z 1 z 2 z 1 z 3 z 2 z 1 z 2 z 2 z 2 z 3 z 3 z 1 z 3 z 2 z 3 z 3 2 c z 1 2 c z 2 2 c z 3 c ] ∈ R n 2 . \begin{aligned}x&=\begin{bmatrix}x_1\\x_2\\x_3\end{bmatrix}\in\mathbb{R}^n\\\phi(x)&=\begin{bmatrix}x_1x_1\\x_1x_2\\x_1x_3\\x_2x_1\\x_2x_2\\x_2x_3\\x_3x_1\\x_3x_2\\x_3x_3\\\sqrt{2c}x_1\\\sqrt{2c}x_2\\\sqrt{2c}x_3\\c\end{bmatrix}\in\mathbb{R}^{n^2}, \phi(z)=\begin{bmatrix}z_1z_1\\z_1z_2\\z_1z_3\\z_2z_1\\z_2z_2\\z_2z_3\\z_3z_1\\z_3z_2\\z_3z_3\\\sqrt{2c}z_1\\\sqrt{2c}z_2\\\sqrt{2c}z_3\\c\end{bmatrix}\in\mathbb{R}^{n^2}.\end{aligned} x ϕ ( x ) = ⎣ ⎢ ⎡ x 1 x 2 x 3 ⎦ ⎥ ⎤ ∈ R n = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x 1 x 1 x 1 x 2 x 1 x 3 x 2 x 1 x 2 x 2 x 2 x 3 x 3 x 1 x 3 x 2 x 3 x 3 2 c x 1 2 c x 2 2 c x 3 c ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ ∈ R n 2 , ϕ ( z ) = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ z 1 z 1 z 1 z 2 z 1 z 3 z 2 z 1 z 2 z 2 z 2 z 3 z 3 z 1 z 3 z 2 z 3 z 3 2 c z 1 2 c z 2 2 c z 3 c ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ ∈ R n 2 .
ex 3.
K ( x , z ) = ( x T z + c ) d , c ∈ R K(x,z)=(x^Tz+c)^d,\;\;c\in\mathbb{R} K ( x , z ) = ( x T z + c ) d , c ∈ R
… O ( n ) O(n) O ( n ) time.
ϕ ( x ) \phi(x) ϕ ( x ) has all ( n + d d ) \begin{pmatrix}n+d\\d\end{pmatrix} ( n + d d ) features of polynomials up to order d d d .
Optimal margin classifier + Kernel tricks = SVM
Nice visualization of how SVM works:
SVM with a polynomial Kernel visualization created by Udi Aharoni
Video link: https://youtu.be/3liCbRZPrZA
SVM in short:
Map the data to a much higher dimensional feature space, and find a linear decision boundary, or a hyperplane in the high dimensional feature space. And then project it to the original feature space, you would obtain a non-linear(linear if lucky) decision boundary.
Q. Are always features in higher dimensional feature space linearly separable?
A. So far we’re pretending that it does, but we will come back and fix that assumption later this lecture.
How to make kernels
If x , z x, z x , z are “similar,” K ( x , z ) = ϕ ( x ) T ϕ ( z ) K(x,z)=\phi(x)^T\phi(z) K ( x , z ) = ϕ ( x ) T ϕ ( z ) is “large.”
If x , z x, z x , z are “dissimilar,” K ( x , z ) = ϕ ( x ) T ϕ ( z ) K(x,z)=\phi(x)^T\phi(z) K ( x , z ) = ϕ ( x ) T ϕ ( z ) is “small.”
(Think about inner product property)
K ( x , z ) = exp ( − ∣ ∣ x − z ∣ ∣ 2 2 σ 2 ) K(x,z)=\exp\left(-\frac{||x-z||^2}{2\sigma^2}\right) K ( x , z ) = exp ( − 2 σ 2 ∣ ∣ x − z ∣ ∣ 2 )
called Gaussian kernel (the most widely used kernel).
Let { x ( 1 ) , ⋯ , x ( d ) } \{x^{(1)},\cdots,x^{(d)}\} { x ( 1 ) , ⋯ , x ( d ) } be d d d points and K ∈ R d × d K\in\mathbb{R}^{d\times d} K ∈ R d × d be kernel matrix where K i j = K ( x ( i ) , x ( j ) ) K_{ij}=K(x^{(i)},x^{(j)}) K i j = K ( x ( i ) , x ( j ) ) .
Given any vector z z z ,
z T K z = ∑ i ∑ j z i K i j z j = ∑ i ∑ j z i ϕ ( x ( i ) ) T ϕ ( x ( j ) ) z j = ∑ i ∑ j z i ∑ k ( ϕ ( x ( i ) ) ) k ( ϕ ( x ( j ) ) ) k z j = ∑ i ∑ j ∑ k z i ( ϕ ( x ( i ) ) ) k ( ϕ ( x ( j ) ) ) k z j = ∑ k ( ∑ i z i ϕ ( x ( i ) ) k ) 2 ≥ 0. \begin{aligned}z^TKz&=\sum_i\sum_jz_iK_{ij}z_j\\&=\sum_i\sum_jz_i\phi(x^{(i)})^T\phi(x^{(j)})z_j\\&=\sum_i\sum_jz_i\sum_k(\phi(x^{(i)}))_k(\phi(x^{(j)}))_kz_j\\&=\sum_i\sum_j\sum_kz_i(\phi(x^{(i)}))_k(\phi(x^{(j)}))_kz_j\\&=\sum_k\left(\sum_iz_i\phi(x^{(i)})_k\right)^2\\&\ge0.\end{aligned} z T K z = i ∑ j ∑ z i K i j z j = i ∑ j ∑ z i ϕ ( x ( i ) ) T ϕ ( x ( j ) ) z j = i ∑ j ∑ z i k ∑ ( ϕ ( x ( i ) ) ) k ( ϕ ( x ( j ) ) ) k z j = i ∑ j ∑ k ∑ z i ( ϕ ( x ( i ) ) ) k ( ϕ ( x ( j ) ) ) k z j = k ∑ ( i ∑ z i ϕ ( x ( i ) ) k ) 2 ≥ 0 .
⇒ K ≥ 0 \Rightarrow K\ge0 ⇒ K ≥ 0 (positive semi-definite).
Theorem (Mercer)
K K K is a valid kernel function ( i.e. (\text{i.e.} ( i.e. ∃ ϕ s.t. K ( x , z ) = ϕ ( x ) T ϕ ( z ) ) \exist\;\phi\;\;\text{s.t.}\;K(x,z)=\phi(x)^T\phi(z)) ∃ ϕ s.t. K ( x , z ) = ϕ ( x ) T ϕ ( z ) ) if and only if for any d d d points { x ( 1 ) , ⋯ , x ( d ) } \{x^{(1)}, \cdots, x^{(d)}\} { x ( 1 ) , ⋯ , x ( d ) } , the corresponding kernel matrix K ≥ 0 K\ge0 K ≥ 0 .
Linear kernel: K ( x , z ) = x T z , ϕ ( x ) = x . K(x,z)=x^Tz,\;\phi(x)=x. K ( x , z ) = x T z , ϕ ( x ) = x .
Gaussian kernel: K ( x , z ) = exp ( − ∣ ∣ x − z ∣ ∣ 2 2 σ 2 ) , ϕ ( x ) ∈ R ∞ K(x,z)=\exp\left(-\frac{||x-z||^2}{2\sigma^2}\right),\;\phi(x)\in\mathbb{R}^\infty K ( x , z ) = exp ( − 2 σ 2 ∣ ∣ x − z ∣ ∣ 2 ) , ϕ ( x ) ∈ R ∞ .
Note:
All of the learning algorithms we’ve learned so far can be written in linear product so that you can apply the kernel trick.
e.g. linear regression, logistic regression, everything of the generalized linear model family, the perceptron algorithm, etc.
∗ ∗ L 1 **L_1 ∗ ∗ L 1 norm soft margin SVM:**
min w , b 1 2 ∣ ∣ w ∣ ∣ 2 + c ∑ i = 1 m ξ i s.t. y ( i ) ( w T x ( i ) + b ) ≥ 1 − ξ i , i = 1 , ⋯ , m , ξ i ≥ 0. \begin{aligned}\min_{w,b}\;\;&\frac{1}{2}||w||^2+c\sum_{i=1}^m\xi_i\\\text{s.t.}\;\;&y^{(i)}(w^Tx^{(i)}+b)\ge1-\xi_i,\;\;i=1,\cdots,m,\;\\&\xi_i\ge0.\end{aligned} w , b min s.t. 2 1 ∣ ∣ w ∣ ∣ 2 + c i = 1 ∑ m ξ i y ( i ) ( w T x ( i ) + b ) ≥ 1 − ξ i , i = 1 , ⋯ , m , ξ i ≥ 0 .
Why use L 1 L_1 L 1 norm soft margin SVM?
Imagine the situation where we add one training example with label “x” but close to training examples labeled “o.” The basic optimal margin classifier will allow the presence of one training example to cause the decision boundary to swing dramatically because the original optimal margin classifier optimizes for the worst-case margin. → can have huge impact on the decision boundary.
However, L 1 L_1 L 1 norm soft margin SVM enables you to have the decision boundary still closer to the original one even when there’s an outlier.
max ∑ i α i − 1 2 ∑ i ∑ j y ( i ) y ( j ) α i α j < x ( i ) , x ( j ) > s.t. 0 ≤ α i ≤ c , ∑ i y ( i ) α i = 0 , i = 1 , ⋯ , m . \begin{aligned}\max\;&\sum_i\alpha_i-\frac{1}{2}\sum_i\sum_jy^{(i)}y^{(j)}\alpha_i\alpha_j\left<x^{(i)},x^{(j)}\right>\\\text{s.t.}\;\;&0\le\alpha_i\le c,\\&\sum_iy^{(i)}\alpha_i=0,\;\;i=1,\cdots,m. \end{aligned} max s.t. i ∑ α i − 2 1 i ∑ j ∑ y ( i ) y ( j ) α i α j ⟨ x ( i ) , x ( j ) ⟩ 0 ≤ α i ≤ c , i ∑ y ( i ) α i = 0 , i = 1 , ⋯ , m .
K ( x , z ) = ( x T z ) d = exp ( − ∣ ∣ x − z ∣ ∣ 2 2 σ 2 ) K(x,z)=(x^Tz)^d=\exp\left(-\frac{||x-z||^2}{2\sigma^2}\right) K ( x , z ) = ( x T z ) d = exp ( − 2 σ 2 ∣ ∣ x − z ∣ ∣ 2 ) .
Protein sequence classifier
Suppose that there are 26 amino acids labeled by from A to Z although we all know there are actually 20 amino acids.
seq ex: BAJTSTAIBAJTAU…
ϕ ( x ) = ? \phi(x)=? ϕ ( x ) = ?
ϕ ( x ) ∈ R 2 0 4 \phi(x)\in\mathbb{R}^{20^4} ϕ ( x ) ∈ R 2 0 4 .