Lecture 06. Support Vector Machines

cryptnomy·2022년 11월 23일
0

CS229: Machine Learning

목록 보기
6/18
post-thumbnail

Lecture video link: https://youtu.be/lDwow4aOrtg

Outline

  • Naive Bayes
    • Laplace smoothing
    • Event models
  • Comments on applying ML
    • SVM intro

Recap

x=[1010]aaardvarkbuyx=\begin{bmatrix} 1 \\ 0 \\ \vdots \\ 1 \\ 0 \\ \vdots \end{bmatrix} \begin{matrix} \small\text{a} \\ \small\text{aardvark} \\ \vdots \\ \small\text{buy} \\ \vdots \\\\\end{matrix}

xj=1{word  j  appears  in  e-mail}x_j=1\{\text{word}\;j\;\text{appears}\;\text{in}\;\text{e-mail}\}

Generative model:

p(xy),p(y)p(xy)=i=1mp(xjy)p(x|y), p(y) \\ p(x|y)=\prod_{i=1}^m p(x_j|y)

Parameters:

p(y=1)=ϕyp(xj=1y=0)=ϕjy=0p(xj=1y=1)=ϕjy=1p(y=1)=\phi_y \\ p(x_j=1|y=0)=\phi_{j|y=0} \\ p(x_j=1|y=1)=\phi_{j|y=1}
ϕy=1mi=1m1{y(i)=1}ϕjy=1=i=1m1{xj(i)=1,y(i)=1}i=1m1{y(i)=1}.\begin{aligned} \phi_y&=\frac{1}{m}\sum_{i=1}^m 1\{y^{(i)}=1\} \\ \phi_{j|y=1}&=\frac{\sum\limits_{i=1}^m 1\{x_j^{(i)}=1, y^{(i)}=1\}}{\sum\limits_{i=1}^m 1\{y^{(i)}=1\}}. \end{aligned}

At prediction time:

p(y=1x)=p(xy=1)p(y=1)p(xy=1)p(y=1)+p(xy=0)p(y=0).p(y=1|x)=\frac{p(x|y=1)p(y=1)}{p(x|y=1)p(y=1)+p(x|y=0)p(y=0)}.

Comment: statistically it’s a bad idea that estimating the probability of something as 00 just because you have not seen it once yet. → Laplace smoothing which helps address this problem.

ex.

Spam content: “Drugs! Buy drugs now!”

Multivariate Bernoulli event model

x=[a (1)aardvark (2)buy (800)drugs (1600)now (6200)]=[0,0,,1,,1,,1,]TRm.x=\begin{bmatrix}\text{a (1)} \\ \text{aardvark (2)} \\ \vdots \\\text{buy (800)} \\ \vdots \\\text{drugs (1600)} \\ \vdots \\\text{now (6200)}\\\vdots\end{bmatrix}=[0, 0, \cdots, 1, \cdots, 1, \cdots,1,\cdots]^T\in\mathbb{R}^m.

This representation loses some information since it does not contain the fact that drugs appear twice in the spam.

Multinomial event model

x=[1600,800,1600,6200]TRnxj{1,,10000}ni:length of e-mail  ix=[1600,800,1600,6200]^T\in\mathbb{R}^n \\ x_j\in\{1,\cdots,10000\} \\ n_i: \text{length of e-mail} \;i

Parameters:

ϕy=p(y=1)ϕky=0=p(xj=ky=0)\begin{aligned} \phi_y&=p(y=1) \\ \phi_{k|y=0}&=p(x_j=k|y=0) \end{aligned}

where the RHS of the second equation signifies the chance of word jj being kk if y=0y=0.

LHS of the second equation assumes the chance of any position jj in the email is independent of jj.

ϕky=1=p(xj=ky=1).\phi_{k|y=1}=p(x_j=k|y=1).

MLE:

ϕky=0=i=1m(1{y(i)=0}j=1ni1{xj(i)=k})i=1m1{y(i)=0}ni=p(xj=ky=0)\begin{aligned} \phi_{k|y=0} &= \frac{\sum\limits_{i=1}^m \left(1\{y^{(i)}=0\} \sum\limits_{j=1}^{n_i} 1\{x_j^{(i)}=k\}\right)}{\sum\limits_{i=1}^m 1\{y^{(i)}=0\}\cdot n_i} \\ &=p(x_j=k|y=0) \end{aligned}

Comment: the English meaning of the above formula?

→ Look at all the words in all of your non-spam emails (y=0)(y=0), and what fraction of those words is the word “drugs” (k)(k)?

Your estimate of the chance of the word “drugs” appearing in the non-spam email in some position in that email.

To implement Laplace smoothing with the formula, you would add 11 to the numerator and 1000010000 to the denominator.

For most problems, you will find that logistic regression works better in terms of delivering a higher accuracy than Naive Bayes. But the advatages of Naive Bayes …

  • Computationally very efficient
  • Relatively quick to implement
  • Does not require an iterative gradient descent thing
  • Code lines is relatively short

Advice

When you get started on a machine learning project, start by implementing something quick and dirty, which has been implemented in most complicated possible learning algorithms.

→ You can then better understand how it’s performing.

→ Do error analysis.

→ Use that to drive your development.

Build a very complicated algorithm at the onset \sim not recommended

Implement something quickly \sim recommended

What if some spammers deliberately misspell words, e.g., …

mϕrtgag3\text{m}\phi\text{rt}\text{gag}3 for “mortgage”?

For GDA and Naive Bayes …

despite their relatively lower accuracy, they are very quick to train and non-iterative.

Support Vector Machines

→ help us to find potentially very very “non-linear” decision boundaries.

One of the reasons SVMs are used today is that it is relatively a turn-key algorithm, which means it doesn’t have too many parameters to fiddle with.

SVMs are not as effective as neural networks for many problems but …

one great property of SVMs is that it is a turn-key.

Optimal margin classifier (separable case) (in this lecture)

linearly separable

Kernels (next lecture)

ex.

xϕ(x):R2Rnx\longmapsto\phi(x):\mathbb{R}^2\longrightarrow\mathbb{R}^n

where nn can be a sufficiently large number, even infinity.

(Q. How do we define infinite-dimensional image of xx?)

Inseparable case (next lecture)

Functional margin

hθ(x)=g(θTx)h_\theta(x)=g(\theta^T x)

Predict “1” if θTx0\theta^T x ≥ 0, otherwise “0.”

If y(i)=1y^{(i)}=1, hope that θTx0\theta^T x \gg 0.

If y(i)=0y^{(i)}=0, hope that θTx0\theta^T x \ll 0.

Geometric margin

Ref: lecture-notes-cs229, Ch. 06 SVM figures…

Previously…

hθ(x)=g(θTx)h_\theta(x)=g(\theta^T x)

where xRn+1,x0=1x\in\mathbb{R}^{n+1}, x_0=1. Rewrite the above as

hw,b(x)=g(wTx+b)h_{w,b}(x)=g(w^Tx+b)

where wRn,bRw\in\mathbb{R}^n,b\in\mathbb{R}.

Functional margin of hyperplane defined by (w,b)(w,b) w.r.t. (x(i),y(i))(x^{(i)}, y^{(i)}):

γ^(i)=y(i)(wTx(i)+b).\hat \gamma^{(i)}=y^{(i)}(w^Tx^{(i)}+b).

If y(i)=1y^{(i)}=1, want wTx(i)+b0w^Tx^{(i)}+b\gg0.

If y(i)=1y^{(i)}=-1, want wTx(i)+b0w^Tx^{(i)}+b\ll0.

Want γ^(i)0\hat\gamma^{(i)}\gg0.

If γ^(i)>0\hat\gamma^{(i)}>0, h(x(i))=y(i)h(x^{(i)})=y^{(i)}.

Functional margin w.r.t. training set

γ^(i)=miniγ^(i),    i=1,,m.\hat\gamma^{(i)}=\min_i\hat\gamma^{(i)},\;\;i=1,\cdots,m.
(w,b)(ww,bw)(w,b)\longrightarrow\left(\frac{w}{||w||}, \frac{b}{||w||}\right)

Geometric margin

The Euclidean distance from point (x(i),y(i))(x^{(i)}, y^{(i)}) to the decision boundary wTx+b=0w^Tx+b=0.

Geometric margin of hyperplane defined by (w,b)(w,b) w.r.t. (x(i),y(i))(x^{(i)}, y^{(i)}):

γ(i)=y(i)(wTx(i)+b)w.\gamma^{(i)}=\frac{y^{(i)}(w^Tx^{(i)}+b)}{||w||}.

Relation between functional margin and geometric margin:

geometric  margin=functional  marginw.\text{geometric\;margin}=\frac{\text{functional\;margin}}{||w||}.

Geometric margin with training set:

γ=miniγ(i).\gamma=\min_i\gamma^{(i)}.

cf. γ^\hat\gamma: Functional margin, γ\gamma: Geometric margin.

Optimal margin classifier:

Choose w,bw,b to maximize γ\gamma.

Mathematically,

maxγ,w,b    γs.t.    y(i)(wTx(i)+b)wγ,    i=1,,m.\begin{aligned}\max_{\gamma,w,b}\;\;&\gamma\\\text{s.t.}\;\;&\frac{y^{(i)}(w^Tx^{(i)}+b)}{||w||}\ge\gamma,\;\;i=1,\cdots,m.\end{aligned}

Or you can reformulate this problem into the equivalent problem:

minw,b    12w2s.t.    y(i)(wTx(i)+b)1.\begin{aligned}\min_{w,b}\;\;&\frac{1}{2}||w||^2\\\text{s.t.}\;\;&y^{(i)}(w^Tx^{(i)}+b)\ge1.\end{aligned}

0개의 댓글