Lecture 05. GDA & Naive Bayes

cryptnomy·2022년 11월 23일
0

CS229: Machine Learning

목록 보기
5/18
post-thumbnail

Lecture video link: https://youtu.be/nt63k3bfXS0

Outline

  • Generative Discriminant Analysis (GDA)
  • Generative vs. Discriminative comparison
  • Naive Bayes

Discriminative:

Learn:

(Or learn hθ(x)={01.h_\theta(x) = \begin{cases} 0 \\ 1\end{cases}.)

Generative:

Learn p(xy)p(x|y).

(xx: features, yy: class)

p(y)p(y): class prior

Bayes rule:

p(y=1x)=p(xy=1)p(y=1)p(x)p(x)=p(xy=1)p(y=1)+p(xy=0)p(y=0)p(y=1|x)=\frac{p(x|y=1)p(y=1)}{p(x)} \\ p(x)=p(x|y=1)p(y=1) + p(x|y=0)p(y=0)

Gaussian Discriminant Analysis (GDA)

Assume xRnx\in\mathbb{R}^n (i.e., drop x0=1x_0=1.) and p(xy)p(x|y) is Gaussian.

zN(μ,Σ),  zRn,μRn,ΣRn×Rn.z\sim\mathcal{N}(\vec\mu,\Sigma),\;z\in\mathbb{R}^n, \vec\mu\in\mathbb{R}^n, \Sigma\in\mathbb{R}^n\times\mathbb{R}^n.

Then

E[z]=μCov(z)=E[(zμ)(zμ)T]=EzzT(Ez)(Ez)T\begin{aligned} \mathbb{E}[z]&=\mu \\ \text{Cov}(z)&=\mathbb{E}[(z-\mu)(z-\mu)^T]\\&=\mathbb{E}zz^T-(\mathbb{E}z)(\mathbb{E}z)^T \end{aligned}

(cf. notation: E[z]=Ez\mathbb{E}[z]=\mathbb{E}z) and

p(z)=1(2π)n/2Σ1/2exp(12(xμ)TΣ1(xμ)).p(z)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp{\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)}.

Below are useful figures describing some pdfs:




(Source: https://youtu.be/nt63k3bfXS0 12 min. ~ 16 min.)

GDA model

p(xy=0)=1(2π)n/2Σ1/2exp(12(xμ0)TΣ1(xμ0))p(x|y=0)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp{\left(-\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)\right)}

p(xy=1)=1(2π)n/2Σ1/2exp(12(xμ1)TΣ1(xμ1))p(x|y=1)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp{\left(-\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)\right)}

p(y)=ϕy(1ϕ)1yp(y) = \phi^y(1-\phi)^{1-y}

Parameters: μ0Rn,μ1Rn,ΣRn×n,ϕR\mu_0\in\mathbb{R}^n, \mu_1\in\mathbb{R}^n, \Sigma\in\mathbb{R}^{n\times n}, \phi\in\mathbb{R}.

Training Set

{x(i),y(i)}i=1M{\{x^{(i)}, y^{(i)}\}}^M_{i=1}

Joint likelihood

L(ϕ,μ0,μ1,Σ)=i=1mp(x(i),y(i);ϕ,μ0,μ1,Σ)=i=1mp(x(i)y(i))p(y(i)).\begin{aligned} \mathcal{L}(\phi,\mu_0,\mu_1,\Sigma) &= \prod_{i=1}^m p(x^{(i)}, y^{(i)}; \phi, \mu_0, \mu_1, \Sigma) \\ &= \prod_{i=1}^m p(x^{(i)}|y^{(i)})p(y^{(i)}). \end{aligned}

Discriminative:

Conditional likelihood

L(θ)=i=1mp(y(i)x(i);θ).\mathcal{L}(\theta)=\prod_{i=1}^m p(y^{(i)}|x^{(i)};\theta).

Maximum likelihood estimation:

maxθ,μ0,μ1,Σ  l(ϕ,μ0,μ1,Σ).\max_{\theta,\mu_0,\mu_1,\Sigma}\; l(\phi,\mu_0,\mu_1,\Sigma).

Then parameters are determined as

ϕ=1mi=1my(i)=1mi=1m1{y(i)=1}μ0=i=1m1{y(i)=0}x(i)i=1m1{y(i)=0}μ1=i=1m1{y(i)=1}x(i)i=1m1{y(i)=1}Σ=1mi=1m(x(i)μy(i))(x(i)μy(i))T\begin{aligned} \phi&=\frac{1}{m}\sum\limits_{i=1}^{m} y^{(i)}=\frac{1}{m}\sum\limits_{i=1}^{m} 1\{y^{(i)}=1\} \\ \mu_0&=\frac{\sum\limits_{i=1}^{m} 1\{y^{(i)}=0\}x^{(i)}}{\sum\limits_{i=1}^{m} 1\{y^{(i)}=0\}} \\ \mu_1&=\frac{\sum\limits_{i=1}^{m} 1\{y^{(i)}=1\}x^{(i)}}{\sum\limits_{i=1}^{m} 1\{y^{(i)}=1\}} \\ \Sigma&=\frac{1}{m}\sum_{i=1}^m (x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T \end{aligned}

where indicator notation is defined as

1{true}=1    and    1{false}=0.\begin{aligned}1\{\text{true}\}&=1 \;\;\text{and}\;\;1\{\text{false}\}=0 .\end{aligned}

Prediction:

arg maxyp(yx)=arg maxyp(xy)p(y)p(x)=arg maxyp(xy)p(y).\begin{aligned} \argmax_y p(y|x) &= \argmax_y \frac{p(x|y)p(y)}{p(x)} \\ &= \argmax_y p(x|y)p(y). \end{aligned}

GDA vs. Logistic regression

(Source: https://youtu.be/nt63k3bfXS0 39 min.)

Q. Why do we use 2 separate means and a single covariance matrix?

A. It is actually very reasonable to choose two covariance matrices Σ0\Sigma_0 and Σ1\Sigma_1 and they’ll work okay. But you double the number of parameters roughly and you end up with a decision boundary that isn’t linear anymore, which is not an unreasonable algorithm to design.

In short, to make a decision boundary be linear.

Comparison to logistic regression

For fixed ϕ,μ0,μ1,Σ\phi, \mu_0, \mu_1, \Sigma, let’s plot p(y=1x;ϕ,μ0,μ1,Σ)p(y=1|x;\phi, \mu_0, \mu_1, \Sigma) as a function of xx.

p(y=1x;ϕ,μ0,μ1,Σ)=p(xy=1;μ0,μ1,Σ)p(y=1;ϕ)p(x;ϕ,μ0,μ1,Σ).p(y=1|x; \phi, \mu_0, \mu_1, \Sigma)=\frac{p(x|y=1; \mu_0, \mu_1, \Sigma)p(y=1;\phi)}{p(x; \phi, \mu_0, \mu_1, \Sigma)}.

(Cf. p(y=1;ϕ)=ϕp(y=1;\phi)=\phi: Bernoulli probability.)

Comparison

Comment: if you don’t know if your data is Gaussian or Poisson and you use logistic regression, you don’t need to worry about it. It’ll work fine either way.

Q. Assumption when given small data vs. big data?

A. Deciding which distribution to assume will allow you to drive much greater performance than a lower-skilled team would be able to.

Naive Bayes

ex. e-mail spam; Assume n=10000n=10000 words.

a

aardvark

aardwolf

buy

zymurgy

x{0,1}nx\in\{0,1\}^n

xi=1{word  i  appears  in  e-mail}x_i=1\{\text{word}\;i\;\text{appears}\;\text{in}\;\text{e-mail}\}

Want to model p(xy),p(y)p(x|y), p(y).

2100002^{10000} possible values of xx.

Assume xisx_i’s are conditionally independent given yy.

p(x1,,x10000y)=p(x1y)p(x2y,x1)p(x10000y,x1,,x9999)=assumep(x1y)p(x2y)p(x10000y)=i=110000p(xiy)\begin{aligned} p(x_1,\cdots,x_{10000}|y)&=p(x_1|y)p(x_2|y,x_1)\cdots p(x_{10000}|y,x_1,\cdots,x_{9999}) \\ &\stackrel{\mathclap{\tiny\text{assume}}}{=} p(x_1|y)p(x_2|y)\cdots p(x_{10000}|y) \\ &= \prod_{i=1}^{10000} p(x_i|y) \end{aligned}

which is called “conditional independence assumption,” or “the Naive Bayes assumption.”

Parameters:

ϕjy=1=p(xj=1y=1)ϕjy=0=p(xj=1y=0)ϕy=p(y=1).\begin{aligned} \phi_{j|y=1} &= p(x_j=1|y=1) \\ \phi_{j|y=0} &= p(x_j=1|y=0) \\ \phi_y &= p(y=1). \end{aligned}

Joint likelihood:

L(ϕy,ϕjy)=i=1mp(x(i),y(i);ϕy,ϕjy).\mathcal{L}(\phi_y, \phi_{j|y}) = \prod_{i=1}^m p(x^{(i)}, y^{(i)}; \phi_y, \phi_{j|y}).

MLE:

ϕy=1mi=1m1{y(i)=1}ϕjy=1=i=1m1{xj(i)=1,y(i)=1}i=1m1{y(i)=1}.\begin{aligned} \phi_y&=\frac{1}{m}\sum_{i=1}^m 1\{y^{(i)}=1\} \\ \phi_{j|y=1}&=\frac{\sum\limits_{i=1}^m 1\{x_j^{(i)}=1, y^{(i)}=1\}}{\sum\limits_{i=1}^m 1\{y^{(i)}=1\}}. \end{aligned}

0개의 댓글