Lecture 14. Expectation-Maximization Algorithms

cryptnomy·2022년 11월 25일
0

CS229: Machine Learning

목록 보기
14/18
post-thumbnail

Lecture video link: https://youtu.be/rVfZHWTwXSA

Outline

  • Unsupervised learning
  • K-means clustering
  • Mixture of Gaussian
  • EM (Expectation-Maximization) algorithm
  • Derivation of GM

What happens during K-means clustering:

(Source: https://youtu.be/rVfZHWTwXSA?t=2m40s)

K-means Clustering

Data: {x(1),,x(m)}\{x^{(1)},\cdots,x^{(m)}\}

  1. Initialize cluster centroids μ1,,μkRd\mu_1,\cdots,\mu_k\in\mathbb{R}^d randomly.
  2. Repeat until convergence:
    1. Set c(i):=arg minjx(i)μjc^{(i)}:=\argmin\limits_j||x^{(i)}-\mu_j||

      (”color the points”).

    2. For j=1,,k,j=1,\cdots,k,

      μj:=i=1m1{c(i)=j}x(i)i=1m1{c(i)=j}\mu_j:=\frac{\sum\limits_{i=1}^m1\{c^{(i)}=j\}x^{(i)}}{\sum\limits_{i=1}^m1\{c^{(i)}=j\}}

      (”moves the cluster centroids”).

J(cassignments,μcentroids)=i=1mx(i)μc(i)2J(\underset{\mathclap{\substack{\uparrow\\\text{assignments}}}}{c},\overset{\mathclap{\substack{\text{centroids}\\\downarrow}}}{\mu})=\sum_{i=1}^m||x^{(i)}-\mu_{c^{(i)}}||^2

Density estimation

For aircraft engine,

(Source: https://youtu.be/rVfZHWTwXSA?t=17m42s)

Anomaly detection

→ Model p(x)p(x)

p(x)<ϵanomalyp(x)<\epsilon\Rightarrow\text{anomaly}.

Mixture of Gaussians model

Problem: When applying an algorithm very similar to GDA to fit a model, the problem with this density estimation problems is that you don’t know which example actually came from which Gaussian.

→ EM algorithm comes in.

Suppose there’s a latent (hidden/unobserved) random variable zz, and x(i),z(i)x^{(i)},z^{(i)} are distributed

p(x(i),z(i))=p(x(i)z(i))p(z(i))p(x^{(i)},z^{(i)})=p(x^{(i)}|z^{(i)})p(z^{(i)})

where z(i)Multinomial(ϕ)z^{(i)}\sim\text{Multinomial}(\phi) and

x(i)z(i)=jN(μj,Σj).x^{(i)}|z^{(i)}=j\sim\mathcal{N}(\mu_j,\Sigma_j).

If we knew the z(i)z^{(i)}’s, we can use MLE:

l(ϕ,μ,Σ)=i=1mlogp(x(i),z(i);ϕ,μ,Σ)ϕj=1mi=1m1{z(i)=j}μj=i=1m1{z(i)=j}x(i)i=1m1{z(i)=j}Σj=i=1m1{z(i)=j}(x(i)μj)(x(i)μj)Ti=1m1{z(i)=j}.\begin{aligned}l(\phi,\mu,\Sigma)&=\sum_{i=1}^m\log p(x^{(i)},z^{(i)};\phi,\mu,\Sigma)\\ \phi_j&=\frac{1}{m}\sum_{i=1}^m1\{z^{(i)}=j\}\\ \mu_j&=\frac{\sum\limits_{i=1}^m1\{z^{(i)}=j\}x^{(i)}}{\sum\limits_{i=1}^m1\{z^{(i)}=j\}}\\ \Sigma_j&=\frac{\sum\limits_{i=1}^m1\{z^{(i)}=j\}(x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum\limits_{i=1}^m1\{z^{(i)}=j\}}.\end{aligned}

EM (expectation-maximization)

E-step (Guess value of z(i)z^{(i)}’s):

Set

wj(i)=p(z(i)=jx(i);ϕ,μ,Σ)=p(x(i)z(i)=j)p(z(i)=j)l=1kp(x(i)z(i)=l)p(z(i)=l)\begin{aligned}w^{(i)}_j&=p\left(z^{(i)}=j|x^{(i)};\phi,\mu,\Sigma\right)\\&=\frac{p(x^{(i)}|z^{(i)}=j)p(z^{(i)}=j)}{\sum\limits_{l=1}^kp(x^{(i)}|z^{(i)}=l)p(z^{(i)}=l)}\end{aligned}

where

p(x(i)z(i)=j)=1(2π)n/2Σj1/2exp(12(x(i)μj)Σj1(x(i)μj)T)p(z(i)=j)=ϕj          zMultinomial(ϕj)p(x^{(i)}|z^{(i)}=j)=\frac{1}{(2\pi)^{n/2}|\Sigma_j|^{1/2}}\exp\left(-\frac{1}{2}(x^{(i)}-\mu_j)\Sigma_j^{-1}(x^{(i)}-\mu_j)^T\right)\\p(z^{(i)}=j)=\phi_j\;\;\;\;\;z\sim\text{Multinomial}(\phi_j)

for every i,ji,j.

M-step:

ϕj1mi=1mwj(i)μji=1mwj(i)x(i)i=1mwj(i)Σji=1mwj(i)(x(i)μj)(x(i)μj)Ti=1mwj(i).\begin{aligned}\phi_j&\leftarrow\frac{1}{m}\sum_{i=1}^mw_j^{(i)}\\ \mu_j&\leftarrow\frac{\sum\limits_{i=1}^mw_j^{(i)}x^{(i)}}{\sum\limits_{i=1}^mw_j^{(i)}}\\ \Sigma_j&\leftarrow\frac{\sum\limits_{i=1}^mw_j^{(i)}(x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum\limits_{i=1}^mw_j^{(i)}}. \end{aligned}

wj(i)w_j^{(i)} is how much x(i)x^{(i)} is assigned to the μj\mu_j Gaussian.

Jensen’s inequality

Let ff be a convex function on R\mathbb R (e.g., f’’(x)>0f’’(x)>0).

Let XX e a random variable. Then

f(EX)E[f(X)].f(\mathbb EX)\le\mathbb E[f(X)].

(Source: https://people.duke.edu/~ccc14/cspy/14_ExpectationMaximization.html)

Further, if f’’(x)>0f’’(x)>0 (ff is strictly convex), then

E[f(X)]=f(EX)X  is a constant.\mathbb E[f(X)]=f(\mathbb EX)\Longleftrightarrow X\;\text{is a constant}.

Have model for p(x,z;θ)p(x,z;\theta).

Only observe xx.

l(θ)=i=1mlogp(x(i);θ)=i=1mz(i)logp(x(i),z(i);θ)\begin{aligned}l(\theta)&=\sum_{i=1}^m\log p(x^{(i)};\theta)\\&=\sum_{i=1}^m\sum_{z^{(i)}}\log p(x^{(i)},z^{(i)};\theta)\end{aligned}

Want: arg maxθl(θ)\argmax\limits_\theta l(\theta).

(Source: https://youtu.be/rVfZHWTwXSA?t=1h3m17s)

E-step: Construct a lower bound for θ\theta at jthj^{th} iteration. Draw the green curve in the figure.

M-step: Find the maximum of the green curve and update θ\theta to the maximum.

EM algorithm only does converge to local optimum.

maxθilogp(x(i);θ)=ilogz(i)p(x(i),z(i);θ)=ilogz(i)Qi(z(i))[p(x(i),z(i);θ)Qi(z(i))]=ilogEz(i)Qi[p(x(i),z(i);θ)Qi(z(i))]iEz(i)Qi[logp(x(i),z(i);θ)Qi(z(i))]=iz(i)Qi(z(i))logp(x(i),z(i);θ)Qi(z(i))\begin{aligned} &\max_\theta\sum_i\log p(x^{(i)};\theta)\\ &=\sum_i\log\sum_{z^{(i)}}p(x^{(i)},z^{(i)};\theta)\\ &=\sum_i\log\sum_{z^{(i)}}Q_i(z^{(i)})\left[\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\right]\\ &=\sum_i\log\mathbb E_{z^{(i)}\sim Q_i}\left[\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\right]\\ &\ge\sum_i\mathbb E_{z^{(i)}\sim Q_i}\left[\log\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\right]\\ &=\sum_i\sum_{z^{(i)}}Q_i(z^{(i)})\log\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})} \end{aligned}

where Qi(z(i))Q_i(z^{(i)}) is a probability distribution, i.e., z(i)Qi(z(i))=1\sum\limits_{z^{(i)}}Q_i(z^{(i)})=1.

On a given iteration of EM (with parameter θ\theta), we want:

logEz(i)Qi[p(x(i),z(i);θ)Qi(z(i))]=Ez(i)Qi[logp(x(i),z(i);θ)Qi(z(i))].\log\mathbb E_{z^{(i)}\sim Q_i}\left[\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\right]=\mathbb E_{z^{(i)}\sim Q_i}\left[\log\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}\right].

For the above equation to hold, we need

p(x(i),z(i))Qi(z(i))=constant.\frac{p(x^{(i)},z^{(i)})}{Q_i(z^{(i)})}=\text{constant}.

Set Qi(z(i))p(x(i),z(i);θ)Q_i(z^{(i)})\propto p(x^{(i)},z^{(i)};\theta)

z(i)Qi(z(i))=1Qi(z(i))=p(x(i),z(i);θ)z(i)p(x(i),z(i);θ)=p(z(i)x(i);θ).\begin{aligned} \sum_{z^{(i)}}&Q_i(z^{(i)})=1\\ Q_i(z^{(i)})&=\frac{p(x^{(i)},z^{(i)};\theta)}{\sum\limits_{z^{(i)}}p(x^{(i)},z^{(i)};\theta)}\\ &=p(z^{(i)}|x^{(i)};\theta). \end{aligned}

Summary

E-step:

Set

Qi(z(i)):=p(z(i)x(i);θ).Q_i(z^{(i)}):=p(z^{(i)}|x^{(i)};\theta).

M-step:

θ:=arg maxθiz(i)Qi(z(i))logp(x(i),z(i);θ)Qi(z(i)).\theta:=\argmax_\theta\sum_i\sum_{z^{(i)}}Q_i(z^{(i)})\log\frac{p(x^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})}.

0개의 댓글