Lecture 06. Support Vector Machines

cryptnomy·2022년 11월 23일

ML cs229 machine learning

CS229: Machine Learning

목록 보기

6/18

Lecture video link: https://youtu.be/lDwow4aOrtg

Outline

Naive Bayes
- Laplace smoothing
- Event models
Comments on applying ML
- SVM intro

Recap

$x=\begin{bmatrix} 1 \\ 0 \\ \vdots \\ 1 \\ 0 \\ \vdots \end{bmatrix} \begin{matrix} \small\text{a} \\ \small\text{aardvark} \\ \vdots \\ \small\text{buy} \\ \vdots \\\\\end{matrix}$

$x_j=1\{\text{word}\;j\;\text{appears}\;\text{in}\;\text{e-mail}\}$

Generative model:

p(x|y), p(y) \\ p(x|y)=\prod_{i=1}^m p(x_j|y)

Parameters:

p(y=1)=\phi_y \\ p(x_j=1|y=0)=\phi_{j|y=0} \\ p(x_j=1|y=1)=\phi_{j|y=1}

\begin{aligned} \phi_y&=\frac{1}{m}\sum_{i=1}^m 1\{y^{(i)}=1\} \\ \phi_{j|y=1}&=\frac{\sum\limits_{i=1}^m 1\{x_j^{(i)}=1, y^{(i)}=1\}}{\sum\limits_{i=1}^m 1\{y^{(i)}=1\}}. \end{aligned}

At prediction time:

p(y=1|x)=\frac{p(x|y=1)p(y=1)}{p(x|y=1)p(y=1)+p(x|y=0)p(y=0)}.

Comment: statistically it’s a bad idea that estimating the probability of something as $0$ just because you have not seen it once yet. → Laplace smoothing which helps address this problem.

ex.

Spam content: “Drugs! Buy drugs now!”

Multivariate Bernoulli event model

$x=\begin{bmatrix}\text{a (1)} \\ \text{aardvark (2)} \\ \vdots \\\text{buy (800)} \\ \vdots \\\text{drugs (1600)} \\ \vdots \\\text{now (6200)}\\\vdots\end{bmatrix}=[0, 0, \cdots, 1, \cdots, 1, \cdots,1,\cdots]^T\in\mathbb{R}^m.$

This representation loses some information since it does not contain the fact that drugs appear twice in the spam.

Multinomial event model

$x=[1600,800,1600,6200]^T\in\mathbb{R}^n \\ x_j\in\{1,\cdots,10000\} \\ n_i: \text{length of e-mail} \;i$

Parameters:

\begin{aligned} \phi_y&=p(y=1) \\ \phi_{k|y=0}&=p(x_j=k|y=0) \end{aligned}

where the RHS of the second equation signifies the chance of word $j$ being $k$ if $y=0$ .

LHS of the second equation assumes the chance of any position $j$ in the email is independent of $j$ .

\phi_{k|y=1}=p(x_j=k|y=1).

MLE:

\begin{aligned} \phi_{k|y=0} &= \frac{\sum\limits_{i=1}^m \left(1\{y^{(i)}=0\} \sum\limits_{j=1}^{n_i} 1\{x_j^{(i)}=k\}\right)}{\sum\limits_{i=1}^m 1\{y^{(i)}=0\}\cdot n_i} \\ &=p(x_j=k|y=0) \end{aligned}

Comment: the English meaning of the above formula?

→ Look at all the words in all of your non-spam emails $(y=0)$ , and what fraction of those words is the word “drugs” $(k)$ ?

Your estimate of the chance of the word “drugs” appearing in the non-spam email in some position in that email.

To implement Laplace smoothing with the formula, you would add $1$ to the numerator and $10000$ to the denominator.

For most problems, you will find that logistic regression works better in terms of delivering a higher accuracy than Naive Bayes. But the advatages of Naive Bayes …

Computationally very efficient
Relatively quick to implement
Does not require an iterative gradient descent thing
Code lines is relatively short

Advice

When you get started on a machine learning project, start by implementing something quick and dirty, which has been implemented in most complicated possible learning algorithms.

→ You can then better understand how it’s performing.

→ Do error analysis.

→ Use that to drive your development.

Build a very complicated algorithm at the onset $\sim$ not recommended

Implement something quickly $\sim$ recommended

What if some spammers deliberately misspell words, e.g., …

$\text{m}\phi\text{rt}\text{gag}3$ for “mortgage”?

For GDA and Naive Bayes …

despite their relatively lower accuracy, they are very quick to train and non-iterative.

Support Vector Machines

→ help us to find potentially very very “non-linear” decision boundaries.

One of the reasons SVMs are used today is that it is relatively a turn-key algorithm, which means it doesn’t have too many parameters to fiddle with.

SVMs are not as effective as neural networks for many problems but …

one great property of SVMs is that it is a turn-key.

Optimal margin classifier (separable case) (in this lecture)

linearly separable

Kernels (next lecture)

ex.

x\longmapsto\phi(x):\mathbb{R}^2\longrightarrow\mathbb{R}^n

where $n$ can be a sufficiently large number, even infinity.

(Q. How do we define infinite-dimensional image of $x$ ?)

Inseparable case (next lecture)

Functional margin

h_\theta(x)=g(\theta^T x)

Predict “1” if $\theta^T x ≥ 0$ , otherwise “0.”

If $y^{(i)}=1$ , hope that $\theta^T x \gg 0$ .

If $y^{(i)}=0$ , hope that $\theta^T x \ll 0$ .

Geometric margin

Ref: lecture-notes-cs229, Ch. 06 SVM figures…

Previously…

h_\theta(x)=g(\theta^T x)

where $x\in\mathbb{R}^{n+1}, x_0=1$ . Rewrite the above as

h_{w,b}(x)=g(w^Tx+b)

where $w\in\mathbb{R}^n,b\in\mathbb{R}$ .

Functional margin of hyperplane defined by $(w,b)$ w.r.t. $(x^{(i)}, y^{(i)})$ :

\hat \gamma^{(i)}=y^{(i)}(w^Tx^{(i)}+b).

If $y^{(i)}=1$ , want $w^Tx^{(i)}+b\gg0$ .

If $y^{(i)}=-1$ , want $w^Tx^{(i)}+b\ll0$ .

Want $\hat\gamma^{(i)}\gg0$ .

If $\hat\gamma^{(i)}>0$ , $h(x^{(i)})=y^{(i)}$ .

Functional margin w.r.t. training set

\hat\gamma^{(i)}=\min_i\hat\gamma^{(i)},\;\;i=1,\cdots,m.

(w,b)\longrightarrow\left(\frac{w}{||w||}, \frac{b}{||w||}\right)

Geometric margin

The Euclidean distance from point $(x^{(i)}, y^{(i)})$ to the decision boundary $w^Tx+b=0$ .

Geometric margin of hyperplane defined by $(w,b)$ w.r.t. $(x^{(i)}, y^{(i)})$ :

\gamma^{(i)}=\frac{y^{(i)}(w^Tx^{(i)}+b)}{||w||}.

Relation between functional margin and geometric margin:

\text{geometric\;margin}=\frac{\text{functional\;margin}}{||w||}.

Geometric margin with training set:

\gamma=\min_i\gamma^{(i)}.

cf. $\hat\gamma$ : Functional margin, $\gamma$ : Geometric margin.

Optimal margin classifier:

Choose $w,b$ to maximize $\gamma$ .

Mathematically,

\begin{aligned}\max_{\gamma,w,b}\;\;&\gamma\\\text{s.t.}\;\;&\frac{y^{(i)}(w^Tx^{(i)}+b)}{||w||}\ge\gamma,\;\;i=1,\cdots,m.\end{aligned}

Or you can reformulate this problem into the equivalent problem:

\begin{aligned}\min_{w,b}\;\;&\frac{1}{2}||w||^2\\\text{s.t.}\;\;&y^{(i)}(w^Tx^{(i)}+b)\ge1.\end{aligned}

cryptnomy

이전 포스트

Lecture 05. GDA & Naive Bayes

다음 포스트

Lecture 06. Support Vector Machines

CS229: Machine Learning

Outline

Support Vector Machines

Lecture 05. GDA & Naive Bayes

Lecture 07. Kernels

0개의 댓글

관련 채용 정보