Lecture 05. GDA & Naive Bayes

cryptnomy·2022년 11월 23일

ML cs229 machine learning

CS229: Machine Learning

목록 보기

5/18

Lecture video link: https://youtu.be/nt63k3bfXS0

Outline

Generative Discriminant Analysis (GDA)
Generative vs. Discriminative comparison
Naive Bayes

Discriminative:

Learn:

(Or learn $h_\theta(x) = \begin{cases} 0 \\ 1\end{cases}.$ )

Generative:

Learn $p(x|y)$ .

( $x$ : features, $y$ : class)

$p(y)$ : class prior

Bayes rule:

$p(y=1|x)=\frac{p(x|y=1)p(y=1)}{p(x)} \\ p(x)=p(x|y=1)p(y=1) + p(x|y=0)p(y=0)$

Gaussian Discriminant Analysis (GDA)

Assume $x\in\mathbb{R}^n$ (i.e., drop $x_0=1$ .) and $p(x|y)$ is Gaussian.

$z\sim\mathcal{N}(\vec\mu,\Sigma),\;z\in\mathbb{R}^n, \vec\mu\in\mathbb{R}^n, \Sigma\in\mathbb{R}^n\times\mathbb{R}^n.$

Then

\begin{aligned} \mathbb{E}[z]&=\mu \\ \text{Cov}(z)&=\mathbb{E}[(z-\mu)(z-\mu)^T]\\&=\mathbb{E}zz^T-(\mathbb{E}z)(\mathbb{E}z)^T \end{aligned}

(cf. notation: $\mathbb{E}[z]=\mathbb{E}z$ ) and

p(z)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp{\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)}.

Below are useful figures describing some pdfs:

(Source: https://youtu.be/nt63k3bfXS0 12 min. ~ 16 min.)

GDA model

$p(x|y=0)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp{\left(-\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)\right)}$

$p(x|y=1)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp{\left(-\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)\right)}$

$p(y) = \phi^y(1-\phi)^{1-y}$

Parameters: $\mu_0\in\mathbb{R}^n, \mu_1\in\mathbb{R}^n, \Sigma\in\mathbb{R}^{n\times n}, \phi\in\mathbb{R}$ .

Training Set

${\{x^{(i)}, y^{(i)}\}}^M_{i=1}$

Joint likelihood

\begin{aligned} \mathcal{L}(\phi,\mu_0,\mu_1,\Sigma) &= \prod_{i=1}^m p(x^{(i)}, y^{(i)}; \phi, \mu_0, \mu_1, \Sigma) \\ &= \prod_{i=1}^m p(x^{(i)}|y^{(i)})p(y^{(i)}). \end{aligned}

Discriminative:

Conditional likelihood

\mathcal{L}(\theta)=\prod_{i=1}^m p(y^{(i)}|x^{(i)};\theta).

Maximum likelihood estimation:

\max_{\theta,\mu_0,\mu_1,\Sigma}\; l(\phi,\mu_0,\mu_1,\Sigma).

Then parameters are determined as

\begin{aligned} \phi&=\frac{1}{m}\sum\limits_{i=1}^{m} y^{(i)}=\frac{1}{m}\sum\limits_{i=1}^{m} 1\{y^{(i)}=1\} \\ \mu_0&=\frac{\sum\limits_{i=1}^{m} 1\{y^{(i)}=0\}x^{(i)}}{\sum\limits_{i=1}^{m} 1\{y^{(i)}=0\}} \\ \mu_1&=\frac{\sum\limits_{i=1}^{m} 1\{y^{(i)}=1\}x^{(i)}}{\sum\limits_{i=1}^{m} 1\{y^{(i)}=1\}} \\ \Sigma&=\frac{1}{m}\sum_{i=1}^m (x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T \end{aligned}

where indicator notation is defined as

\begin{aligned}1\{\text{true}\}&=1 \;\;\text{and}\;\;1\{\text{false}\}=0 .\end{aligned}

Prediction:

\begin{aligned} \argmax_y p(y|x) &= \argmax_y \frac{p(x|y)p(y)}{p(x)} \\ &= \argmax_y p(x|y)p(y). \end{aligned}

GDA vs. Logistic regression

(Source: https://youtu.be/nt63k3bfXS0 39 min.)

Q. Why do we use 2 separate means and a single covariance matrix?

A. It is actually very reasonable to choose two covariance matrices $\Sigma_0$ and $\Sigma_1$ and they’ll work okay. But you double the number of parameters roughly and you end up with a decision boundary that isn’t linear anymore, which is not an unreasonable algorithm to design.

In short, to make a decision boundary be linear.

Comparison to logistic regression

For fixed $\phi, \mu_0, \mu_1, \Sigma$ , let’s plot $p(y=1|x;\phi, \mu_0, \mu_1, \Sigma)$ as a function of $x$ .

p(y=1|x; \phi, \mu_0, \mu_1, \Sigma)=\frac{p(x|y=1; \mu_0, \mu_1, \Sigma)p(y=1;\phi)}{p(x; \phi, \mu_0, \mu_1, \Sigma)}.

(Cf. $p(y=1;\phi)=\phi$ : Bernoulli probability.)

Comparison

Comment: if you don’t know if your data is Gaussian or Poisson and you use logistic regression, you don’t need to worry about it. It’ll work fine either way.

Q. Assumption when given small data vs. big data?

A. Deciding which distribution to assume will allow you to drive much greater performance than a lower-skilled team would be able to.

Naive Bayes

ex. e-mail spam; Assume $n=10000$ words.

aardvark

aardwolf

…

buy

…

zymurgy

$x\in\{0,1\}^n$

$x_i=1\{\text{word}\;i\;\text{appears}\;\text{in}\;\text{e-mail}\}$

Want to model $p(x|y), p(y)$ .

$2^{10000}$ possible values of $x$ .

Assume $x_i’s$ are conditionally independent given $y$ .

\begin{aligned} p(x_1,\cdots,x_{10000}|y)&=p(x_1|y)p(x_2|y,x_1)\cdots p(x_{10000}|y,x_1,\cdots,x_{9999}) \\ &\stackrel{\mathclap{\tiny\text{assume}}}{=} p(x_1|y)p(x_2|y)\cdots p(x_{10000}|y) \\ &= \prod_{i=1}^{10000} p(x_i|y) \end{aligned}

which is called “conditional independence assumption,” or “the Naive Bayes assumption.”

Parameters:

\begin{aligned} \phi_{j|y=1} &= p(x_j=1|y=1) \\ \phi_{j|y=0} &= p(x_j=1|y=0) \\ \phi_y &= p(y=1). \end{aligned}

Joint likelihood:

\mathcal{L}(\phi_y, \phi_{j|y}) = \prod_{i=1}^m p(x^{(i)}, y^{(i)}; \phi_y, \phi_{j|y}).

MLE:

\begin{aligned} \phi_y&=\frac{1}{m}\sum_{i=1}^m 1\{y^{(i)}=1\} \\ \phi_{j|y=1}&=\frac{\sum\limits_{i=1}^m 1\{x_j^{(i)}=1, y^{(i)}=1\}}{\sum\limits_{i=1}^m 1\{y^{(i)}=1\}}. \end{aligned}

cryptnomy

이전 포스트

Lecture 04. Perceptron & Generalized Linear Model

다음 포스트

Lecture 05. GDA & Naive Bayes

CS229: Machine Learning

Outline

Gaussian Discriminant Analysis (GDA)

Naive Bayes

Lecture 04. Perceptron & Generalized Linear Model

Lecture 06. Support Vector Machines

0개의 댓글

관련 채용 정보