Lecture video link: https://youtu.be/nt63k3bfXS0
Outline
Generative Discriminant Analysis (GDA)
Generative vs. Discriminative comparison
Naive Bayes
Discriminative:
Learn:
(Or learn h θ ( x ) = { 0 1 . h_\theta(x) = \begin{cases} 0 \\ 1\end{cases}. h θ ( x ) = { 0 1 . )
Generative:
Learn p ( x ∣ y ) p(x|y) p ( x ∣ y ) .
(x x x : features, y y y : class)
p ( y ) p(y) p ( y ) : class prior
Bayes rule:
p ( y = 1 ∣ x ) = p ( x ∣ y = 1 ) p ( y = 1 ) p ( x ) p ( x ) = p ( x ∣ y = 1 ) p ( y = 1 ) + p ( x ∣ y = 0 ) p ( y = 0 ) p(y=1|x)=\frac{p(x|y=1)p(y=1)}{p(x)} \\ p(x)=p(x|y=1)p(y=1) + p(x|y=0)p(y=0) p ( y = 1 ∣ x ) = p ( x ) p ( x ∣ y = 1 ) p ( y = 1 ) p ( x ) = p ( x ∣ y = 1 ) p ( y = 1 ) + p ( x ∣ y = 0 ) p ( y = 0 )
Gaussian Discriminant Analysis (GDA)
Assume x ∈ R n x\in\mathbb{R}^n x ∈ R n (i.e., drop x 0 = 1 x_0=1 x 0 = 1 .) and p ( x ∣ y ) p(x|y) p ( x ∣ y ) is Gaussian.
z ∼ N ( μ ⃗ , Σ ) , z ∈ R n , μ ⃗ ∈ R n , Σ ∈ R n × R n . z\sim\mathcal{N}(\vec\mu,\Sigma),\;z\in\mathbb{R}^n, \vec\mu\in\mathbb{R}^n, \Sigma\in\mathbb{R}^n\times\mathbb{R}^n. z ∼ N ( μ , Σ ) , z ∈ R n , μ ∈ R n , Σ ∈ R n × R n .
Then
E [ z ] = μ Cov ( z ) = E [ ( z − μ ) ( z − μ ) T ] = E z z T − ( E z ) ( E z ) T \begin{aligned} \mathbb{E}[z]&=\mu \\ \text{Cov}(z)&=\mathbb{E}[(z-\mu)(z-\mu)^T]\\&=\mathbb{E}zz^T-(\mathbb{E}z)(\mathbb{E}z)^T \end{aligned} E [ z ] Cov ( z ) = μ = E [ ( z − μ ) ( z − μ ) T ] = E z z T − ( E z ) ( E z ) T
(cf. notation: E [ z ] = E z \mathbb{E}[z]=\mathbb{E}z E [ z ] = E z ) and
p ( z ) = 1 ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 exp ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) . p(z)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp{\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)}. p ( z ) = ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 1 exp ( − 2 1 ( x − μ ) T Σ − 1 ( x − μ ) ) .
Below are useful figures describing some pdfs:
(Source: https://youtu.be/nt63k3bfXS0 12 min. ~ 16 min.)
GDA model
p ( x ∣ y = 0 ) = 1 ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 exp ( − 1 2 ( x − μ 0 ) T Σ − 1 ( x − μ 0 ) ) p(x|y=0)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp{\left(-\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)\right)} p ( x ∣ y = 0 ) = ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 1 exp ( − 2 1 ( x − μ 0 ) T Σ − 1 ( x − μ 0 ) )
p ( x ∣ y = 1 ) = 1 ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 exp ( − 1 2 ( x − μ 1 ) T Σ − 1 ( x − μ 1 ) ) p(x|y=1)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp{\left(-\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)\right)} p ( x ∣ y = 1 ) = ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 1 exp ( − 2 1 ( x − μ 1 ) T Σ − 1 ( x − μ 1 ) )
p ( y ) = ϕ y ( 1 − ϕ ) 1 − y p(y) = \phi^y(1-\phi)^{1-y} p ( y ) = ϕ y ( 1 − ϕ ) 1 − y
Parameters: μ 0 ∈ R n , μ 1 ∈ R n , Σ ∈ R n × n , ϕ ∈ R \mu_0\in\mathbb{R}^n, \mu_1\in\mathbb{R}^n, \Sigma\in\mathbb{R}^{n\times n}, \phi\in\mathbb{R} μ 0 ∈ R n , μ 1 ∈ R n , Σ ∈ R n × n , ϕ ∈ R .
Training Set
{ x ( i ) , y ( i ) } i = 1 M {\{x^{(i)}, y^{(i)}\}}^M_{i=1} { x ( i ) , y ( i ) } i = 1 M
Joint likelihood
L ( ϕ , μ 0 , μ 1 , Σ ) = ∏ i = 1 m p ( x ( i ) , y ( i ) ; ϕ , μ 0 , μ 1 , Σ ) = ∏ i = 1 m p ( x ( i ) ∣ y ( i ) ) p ( y ( i ) ) . \begin{aligned} \mathcal{L}(\phi,\mu_0,\mu_1,\Sigma) &= \prod_{i=1}^m p(x^{(i)}, y^{(i)}; \phi, \mu_0, \mu_1, \Sigma) \\ &= \prod_{i=1}^m p(x^{(i)}|y^{(i)})p(y^{(i)}). \end{aligned} L ( ϕ , μ 0 , μ 1 , Σ ) = i = 1 ∏ m p ( x ( i ) , y ( i ) ; ϕ , μ 0 , μ 1 , Σ ) = i = 1 ∏ m p ( x ( i ) ∣ y ( i ) ) p ( y ( i ) ) .
Discriminative :
Conditional likelihood
L ( θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) . \mathcal{L}(\theta)=\prod_{i=1}^m p(y^{(i)}|x^{(i)};\theta). L ( θ ) = i = 1 ∏ m p ( y ( i ) ∣ x ( i ) ; θ ) .
Maximum likelihood estimation:
max θ , μ 0 , μ 1 , Σ l ( ϕ , μ 0 , μ 1 , Σ ) . \max_{\theta,\mu_0,\mu_1,\Sigma}\; l(\phi,\mu_0,\mu_1,\Sigma). θ , μ 0 , μ 1 , Σ max l ( ϕ , μ 0 , μ 1 , Σ ) .
Then parameters are determined as
ϕ = 1 m ∑ i = 1 m y ( i ) = 1 m ∑ i = 1 m 1 { y ( i ) = 1 } μ 0 = ∑ i = 1 m 1 { y ( i ) = 0 } x ( i ) ∑ i = 1 m 1 { y ( i ) = 0 } μ 1 = ∑ i = 1 m 1 { y ( i ) = 1 } x ( i ) ∑ i = 1 m 1 { y ( i ) = 1 } Σ = 1 m ∑ i = 1 m ( x ( i ) − μ y ( i ) ) ( x ( i ) − μ y ( i ) ) T \begin{aligned} \phi&=\frac{1}{m}\sum\limits_{i=1}^{m} y^{(i)}=\frac{1}{m}\sum\limits_{i=1}^{m} 1\{y^{(i)}=1\} \\ \mu_0&=\frac{\sum\limits_{i=1}^{m} 1\{y^{(i)}=0\}x^{(i)}}{\sum\limits_{i=1}^{m} 1\{y^{(i)}=0\}} \\ \mu_1&=\frac{\sum\limits_{i=1}^{m} 1\{y^{(i)}=1\}x^{(i)}}{\sum\limits_{i=1}^{m} 1\{y^{(i)}=1\}} \\ \Sigma&=\frac{1}{m}\sum_{i=1}^m (x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T \end{aligned} ϕ μ 0 μ 1 Σ = m 1 i = 1 ∑ m y ( i ) = m 1 i = 1 ∑ m 1 { y ( i ) = 1 } = i = 1 ∑ m 1 { y ( i ) = 0 } i = 1 ∑ m 1 { y ( i ) = 0 } x ( i ) = i = 1 ∑ m 1 { y ( i ) = 1 } i = 1 ∑ m 1 { y ( i ) = 1 } x ( i ) = m 1 i = 1 ∑ m ( x ( i ) − μ y ( i ) ) ( x ( i ) − μ y ( i ) ) T
where indicator notation is defined as
1 { true } = 1 and 1 { false } = 0. \begin{aligned}1\{\text{true}\}&=1 \;\;\text{and}\;\;1\{\text{false}\}=0 .\end{aligned} 1 { true } = 1 and 1 { false } = 0 .
Prediction:
arg max y p ( y ∣ x ) = arg max y p ( x ∣ y ) p ( y ) p ( x ) = arg max y p ( x ∣ y ) p ( y ) . \begin{aligned} \argmax_y p(y|x) &= \argmax_y \frac{p(x|y)p(y)}{p(x)} \\ &= \argmax_y p(x|y)p(y). \end{aligned} y a r g m a x p ( y ∣ x ) = y a r g m a x p ( x ) p ( x ∣ y ) p ( y ) = y a r g m a x p ( x ∣ y ) p ( y ) .
GDA vs. Logistic regression
(Source: https://youtu.be/nt63k3bfXS0 39 min.)
Q. Why do we use 2 separate means and a single covariance matrix?
A. It is actually very reasonable to choose two covariance matrices Σ 0 \Sigma_0 Σ 0 and Σ 1 \Sigma_1 Σ 1 and they’ll work okay. But you double the number of parameters roughly and you end up with a decision boundary that isn’t linear anymore, which is not an unreasonable algorithm to design.
In short, to make a decision boundary be linear.
Comparison to logistic regression
For fixed ϕ , μ 0 , μ 1 , Σ \phi, \mu_0, \mu_1, \Sigma ϕ , μ 0 , μ 1 , Σ , let’s plot p ( y = 1 ∣ x ; ϕ , μ 0 , μ 1 , Σ ) p(y=1|x;\phi, \mu_0, \mu_1, \Sigma) p ( y = 1 ∣ x ; ϕ , μ 0 , μ 1 , Σ ) as a function of x x x .
p ( y = 1 ∣ x ; ϕ , μ 0 , μ 1 , Σ ) = p ( x ∣ y = 1 ; μ 0 , μ 1 , Σ ) p ( y = 1 ; ϕ ) p ( x ; ϕ , μ 0 , μ 1 , Σ ) . p(y=1|x; \phi, \mu_0, \mu_1, \Sigma)=\frac{p(x|y=1; \mu_0, \mu_1, \Sigma)p(y=1;\phi)}{p(x; \phi, \mu_0, \mu_1, \Sigma)}. p ( y = 1 ∣ x ; ϕ , μ 0 , μ 1 , Σ ) = p ( x ; ϕ , μ 0 , μ 1 , Σ ) p ( x ∣ y = 1 ; μ 0 , μ 1 , Σ ) p ( y = 1 ; ϕ ) .
(Cf. p ( y = 1 ; ϕ ) = ϕ p(y=1;\phi)=\phi p ( y = 1 ; ϕ ) = ϕ : Bernoulli probability.)
Comparison
Comment: if you don’t know if your data is Gaussian or Poisson and you use logistic regression, you don’t need to worry about it. It’ll work fine either way.
Q. Assumption when given small data vs. big data?
A. Deciding which distribution to assume will allow you to drive much greater performance than a lower-skilled team would be able to.
Naive Bayes
ex. e-mail spam; Assume n = 10000 n=10000 n = 1 0 0 0 0 words.
a
aardvark
aardwolf
…
buy
…
zymurgy
x ∈ { 0 , 1 } n x\in\{0,1\}^n x ∈ { 0 , 1 } n
x i = 1 { word i appears in e-mail } x_i=1\{\text{word}\;i\;\text{appears}\;\text{in}\;\text{e-mail}\} x i = 1 { word i appears in e-mail }
Want to model p ( x ∣ y ) , p ( y ) p(x|y), p(y) p ( x ∣ y ) , p ( y ) .
2 10000 2^{10000} 2 1 0 0 0 0 possible values of x x x .
Assume x i ’ s x_i’s x i ’ s are conditionally independent given y y y .
p ( x 1 , ⋯ , x 10000 ∣ y ) = p ( x 1 ∣ y ) p ( x 2 ∣ y , x 1 ) ⋯ p ( x 10000 ∣ y , x 1 , ⋯ , x 9999 ) = assume p ( x 1 ∣ y ) p ( x 2 ∣ y ) ⋯ p ( x 10000 ∣ y ) = ∏ i = 1 10000 p ( x i ∣ y ) \begin{aligned} p(x_1,\cdots,x_{10000}|y)&=p(x_1|y)p(x_2|y,x_1)\cdots p(x_{10000}|y,x_1,\cdots,x_{9999}) \\ &\stackrel{\mathclap{\tiny\text{assume}}}{=} p(x_1|y)p(x_2|y)\cdots p(x_{10000}|y) \\ &= \prod_{i=1}^{10000} p(x_i|y) \end{aligned} p ( x 1 , ⋯ , x 1 0 0 0 0 ∣ y ) = p ( x 1 ∣ y ) p ( x 2 ∣ y , x 1 ) ⋯ p ( x 1 0 0 0 0 ∣ y , x 1 , ⋯ , x 9 9 9 9 ) = assume p ( x 1 ∣ y ) p ( x 2 ∣ y ) ⋯ p ( x 1 0 0 0 0 ∣ y ) = i = 1 ∏ 1 0 0 0 0 p ( x i ∣ y )
which is called “conditional independence assumption,” or “the Naive Bayes assumption.”
Parameters:
ϕ j ∣ y = 1 = p ( x j = 1 ∣ y = 1 ) ϕ j ∣ y = 0 = p ( x j = 1 ∣ y = 0 ) ϕ y = p ( y = 1 ) . \begin{aligned} \phi_{j|y=1} &= p(x_j=1|y=1) \\ \phi_{j|y=0} &= p(x_j=1|y=0) \\ \phi_y &= p(y=1). \end{aligned} ϕ j ∣ y = 1 ϕ j ∣ y = 0 ϕ y = p ( x j = 1 ∣ y = 1 ) = p ( x j = 1 ∣ y = 0 ) = p ( y = 1 ) .
Joint likelihood:
L ( ϕ y , ϕ j ∣ y ) = ∏ i = 1 m p ( x ( i ) , y ( i ) ; ϕ y , ϕ j ∣ y ) . \mathcal{L}(\phi_y, \phi_{j|y}) = \prod_{i=1}^m p(x^{(i)}, y^{(i)}; \phi_y, \phi_{j|y}). L ( ϕ y , ϕ j ∣ y ) = i = 1 ∏ m p ( x ( i ) , y ( i ) ; ϕ y , ϕ j ∣ y ) .
MLE:
ϕ y = 1 m ∑ i = 1 m 1 { y ( i ) = 1 } ϕ j ∣ y = 1 = ∑ i = 1 m 1 { x j ( i ) = 1 , y ( i ) = 1 } ∑ i = 1 m 1 { y ( i ) = 1 } . \begin{aligned} \phi_y&=\frac{1}{m}\sum_{i=1}^m 1\{y^{(i)}=1\} \\ \phi_{j|y=1}&=\frac{\sum\limits_{i=1}^m 1\{x_j^{(i)}=1, y^{(i)}=1\}}{\sum\limits_{i=1}^m 1\{y^{(i)}=1\}}. \end{aligned} ϕ y ϕ j ∣ y = 1 = m 1 i = 1 ∑ m 1 { y ( i ) = 1 } = i = 1 ∑ m 1 { y ( i ) = 1 } i = 1 ∑ m 1 { x j ( i ) = 1 , y ( i ) = 1 } .