Lecture 15. EM Algorithm & Factor Analysis

cryptnomy·2022년 11월 25일
0

CS229: Machine Learning

목록 보기
15/18
post-thumbnail

Outline

  • EM convergence
  • Gaussian propotion
  • Factor analysis
  • Gaussian marginals vs. conditionals
  • EM steps

Recap

E-step:

Qi(z(i)):=p(z(i)x(i);θ).Q_i(z^{(i)}):=p(z^{(i)}|x^{(i)};\theta).

M-step:

θ:=arg maxθiz(i)Qi(z(i))logp(x(i),z(i);θ)Qi(z(i)).\theta:=\argmax_\theta\sum_i\sum_{z ^{(i)}}Q_i(z ^{(i)})\log\frac{p(x ^{(i)},z ^{(i)};\theta)}{Q_i(z ^{(i)})}.

p(x(i),z(i))=p(x(i)z(i))p(z(i))p(x^{(i)},z^{(i)})=p(x^{(i)}|z^{(i)})p(z^{(i)})

z(i)Multinomial(ϕ)z^{(i)}\sim\text{Multinomial}(\phi)

[p(z(i)=j)=ϕj][p(z^{(i)}=j)=\phi_j]

x(i)z(i)=jN(μj,Σj)x^{(i)}|z^{(i)}=j\sim\mathcal N(\mu_j,\Sigma_j)

E-step:

wj(i)=Qi(z(i)=j)=p(z(i)=jx(i);ϕ,μ,Σ).w_j^{(i)}=Q_i(z^{(i)}=j)=p(z^{(i)=j}|x^{(i)};\phi,\mu,\Sigma).

M-step:

maxϕ,μ,Σ  iz(i)Qi(z(i))logp(x(i),z(i);ϕ,μ,Σ)Qi(z(i))=maxijwj(i)log1(2π)n/2Σj1/2exp(12(x(i)μj)Σj1(x(i)μj)T)ϕjwj(i)=:maxf\begin{aligned}&\max_{\phi,\mu,\Sigma}\;\sum_i\sum_{z^{(i)}}Q_i(z ^{(i)})\log\frac{p(x ^{(i)},z ^{(i)};\phi,\mu,\Sigma)}{Q_i(z ^{(i)})}\\&=\max\sum_i\sum_jw ^{(i)}_j\log\frac{\frac{1}{(2\pi)^{n/2}|\Sigma_j|^{1/2}} \exp\left(-\frac{1}{2}(x ^{(i)}-\mu_j)\Sigma_j^{-1} (x ^{(i)}-\mu_j)^T\right)\phi_j}{w_j^{(i)}}\\&=:\max f\end{aligned}
μjf=set0μj=iwj(i)x(i)iwj(i).\begin{aligned}&\nabla_{\mu_j}f\stackrel{\text{set}}{=}0\\&\Longrightarrow\mu_j=\frac{\sum\limits_iw_j ^{(i)}x ^{(i)}}{\sum\limits_iw_j^{(i)}}.\end{aligned}

wj(i)w_j^{(i)} … “strength” with which x(i)x ^{(i)} is assigned to Gaussian jj.

p(z(i)=jx(i);)p(z ^{(i)}=j|x ^{(i)};\cdots)
ϕjf=set0ϕj=iwj(i)ijwj(i)=1niwj(i).\begin{aligned}&\nabla_{\phi_j}f\stackrel{\text{set}}{=}0\\&\Longrightarrow\phi_j=\frac{\sum\limits_iw_j ^{(i)}}{\sum\limits_i\sum\limits_jw_j^{(i)}}= \frac{1}{n} \sum\limits_iw_j ^{(i)}.\end{aligned}

Define

J(θ,Q)=iz(i)Qi(z(i))logp(x(i),z(i);θ)Qi(z(i)).J(\theta,Q)=\sum_i\sum_{z ^{(i)}}Q_i(z ^{(i)})\log\frac{p(x ^{(i)},z ^{(i)};\theta)}{Q_i(z ^{(i)})}.

We know

l(θ)J(θ,Q)l(\theta)\ge J(\theta,Q)

for any θ,Q\theta, Q.

E-step:

Maximize JJ w.r.t. QQ.

M-step:

Maximize JJ w.r.t. θ\theta.

Mixture of Gaussian

Say n=2,m=100n=2,m=100, or mnm\gg n

nmn\approx m or nmn\gg m

Model as single Gaussian:

XN(μ,Σ)X\sim\mathcal N(\mu,\Sigma)

MLE:

μ=1mix(i)Σ=1mi(x(i)μ)(x(i)μ)T\begin{aligned}\mu&=\frac{1}{m}\sum_ix^{(i)}\\\Sigma&=\frac{1}{m}\sum_i(x^{(i)}-\mu)(x^{(i)}-\mu)^T\end{aligned}

If mnm\le n, the Σ\Sigma will be singular / non-invertible.

Option 1. Constrain ΣRn×n\Sigma\in\mathbb R^{n\times n} to be diagonal.

Σ=[σ12σ22σn2]\Sigma=\begin{bmatrix}\sigma_1^2&&&\\&\sigma_2^2&&\\&&\ddots&\\&&&\sigma_n^2\end{bmatrix}

MLE:

σj2=1mi(xj(i)μj)2.\sigma_j^2=\frac{1}{m}\sum_i(x_j^{(i)}-\mu_j)^2.

Q. Problem with this assumption?

A. The modeling assumes that all of your features are uncorrelated. I.e., if you have temperature sensors in a room, it’s not a good assumption to assume the temperatures at all points of the room are completely uncorrelated.

Option 2. (Stronger assumption)

Constrain Σ\Sigma to be Σ=σ2I\Sigma=\sigma^2I.

Σ=[σ2σ2σ2]\Sigma=\begin{bmatrix}\sigma^2&&&\\&\sigma^2&&\\&&\ddots&\\&&&\sigma^2\end{bmatrix}

MLE:

σ2=1mnij(xj(i)μj)2.\sigma^2=\frac{1}{mn}\sum_i\sum_j(x_j^{(i)}-\mu_j)^2.

What the factor analysis do?

→ It captures some of the correlations but that doesn’t run into a uninvertible covariance matrices that the naive Gaussian model does, even with 100 dimensional data and 30 examples.

Q. ?

A. Common thing to do is apply Wishart prior … add a small diagonal value to the MLE: Σ+ϵI\Sigma+\epsilon I.

This technically takes away non-invertible matrix problem. But it’s not the best model for a lot of datasets.

Q. Why use option 2 which is even worse than option 1?

A. To develop a description for factor analysis.

Andrew Ng comment:

These days large tech companies work on similar problems.

One of the really overlooked parts of the machine learning world is small data problems. A lot of practical applications of machine learning including class projects. Small data problems feel like blind spots, or a gap of a lot of the work done in the AI world today.

Q. Why don’t we use the same algorithms with big data?

A. Andrew Ng thinks in the machine learning world we are not very good at understanding the scaling. We don’t actually have a good understanding of how to modify our algorithms.

(Facebook recently (2020) published a paper which handles 3.5 billion images and the result was cool.)

Framework:

p(x,z)=p(xz)p(z)p(x,z)=p(x|z)p(z)

zhiddenz\sim\text{hidden}

I.e. for d=3,m=100,n=30d=3,m=100,n=30,

zN(0,I),zRd  (d<n)z\sim\mathcal N(0,I), z\in\mathbb R^d\;(d<n)

x=μ+Λz+ϵx=\mu+\Lambda z+\epsilon

where ϵN(0,Ψ)\epsilon\sim\mathcal N(0,\Psi).

Parameters:

μRn,ΛRn×d,ΨRn×n,diagonal\mu\in\mathbb R^n,\Lambda\in\mathbb R^{n\times d},\Psi\in\mathbb R^{n\times n},\text{diagonal}

Equivalently,

xzN(μ+Λz,I)x|z\sim\mathcal N(\mu+\Lambda z,I)

Diagonal noise ϵ\epsilon means each sensor is independent of the noise at every other sensor.

Example 1.

zR,xR2,d=1,n=2,m=7z\in\mathbb R,x\in\mathbb R^2,d=1,n=2,m=7

and zN(0,1)z\sim\mathcal N(0,1).

(Source: https://youtu.be/tw6cmL5STuY?t=49m42s)

Say Λ=[21],μ=[00],Λz+μR2,Ψ=[1002]\Lambda=\begin{bmatrix}2\\1\end{bmatrix}, \mu=\begin{bmatrix}0\\0\end{bmatrix},\Lambda z+\mu\in\mathbb R^2,\Psi=\begin{bmatrix}1&0\\0&2\end{bmatrix}.

The red crosses here are a typical sample drwan from this model.

Example 2.

zR2,xR3,d=2,n=2,m=5z\in\mathbb R^2,x\in\mathbb R^3,d=2,n=2,m=5

(Source: https://youtu.be/tw6cmL5STuY?t=52m16s; cool animation 😀)

Compute Λz+μ\Lambda z+\mu.

Factor analysis can take very high dimensional data, e.g., 100 dimensional data.

Multivariate Gaussian

x=[x1x2]x=\begin{bmatrix}x_1\\x_2\end{bmatrix}

where

x1Rr,x2Rs,xRr+s.x_1\in\mathbb R^r,x_2\in\mathbb R^s,x\in\mathbb R^{r+s}.
xN(μ,Σ2)x\sim\mathcal N(\mu,\Sigma^2)

where

μ=[μ1μ2],Σ=[Σ11Σ12Σ21Σ22].\mu=\begin{bmatrix}\mu_1\\\mu_2\end{bmatrix},\Sigma=\begin{bmatrix}\Sigma_{11}&\Sigma_{12}\\\Sigma_{21}&\Sigma_{22}\end{bmatrix}.

Marginal: p(x1)=?p(x_1)=?

p(x)=p(x1,x2)p(x)=p(x_1,x_2)
x2p(x1,x2)dx2=p(x1)p(x1,x2)=1(2π)n/2Σ1/2exp(12(x1μ1x2μ2)T(Σ11Σ12Σ21Σ22)1(x1μ1x2μ2))\int_{x_2}p(x_1,x_2)dx_2=p(x_1)\\p(x_1,x_2)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}\begin{pmatrix}x_1-\mu_1\\x_2-\mu_2\end{pmatrix}^T\begin{pmatrix}\Sigma_{11}&\Sigma_{12}\\\Sigma_{21}&\Sigma_{22}\end{pmatrix}^{-1}\begin{pmatrix}x_1-\mu_1\\x_2-\mu_2\end{pmatrix}\right)

Conditional: p(x1x2)=  ?p(x_1|x_2)=\;?

x1x2N(μ12,Σ12)μ12=μ1+Σ12Σ221(x2μ2)Σ12=Σ11Σ12Σ221Σ21x_1|x_2\sim\mathcal N(\mu_{1|2},\Sigma_{1|2})\\\mu_{1|2}=\mu_1+\Sigma_{12}\Sigma^{-1}_{22}(x_2-\mu_2)\\\Sigma_{1|2}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}

Derive p(x,z)p(x,z).

(zx)N(μx,z,Σ)zN(0,I)x=μ+Λz+ϵ\begin{pmatrix}z\\x\end{pmatrix}\sim\mathcal N(\mu_{x,z},\Sigma)\\z\sim\mathcal N(0,I)\\x=\mu+\Lambda z+\epsilon

Ez=0Ex=E[μ+Λz+ϵ]=μ\mathbb Ez=0\\\mathbb Ex=\mathbb E[\mu+\Lambda z+\epsilon]=\mu

μx,z=[0μ]ddimndimΣ=[Σ11Σ12Σ21Σ22]                d              n=[E(zEz)(zEz)TE(zEz)(xEx)TE(xEx)(zEz)TE(xEx)(xEx)T]\begin{aligned}\mu_{x,z}&=\begin{bmatrix}0\\\mu\end{bmatrix}\begin{matrix}\updownarrow d-\text{dim}\\\updownarrow n-\text{dim}\end{matrix}\\\Sigma&=\begin{bmatrix}\Sigma_{11}&\Sigma_{12}\\\Sigma_{21}&\Sigma_{22}\end{bmatrix}\\&\;\;\;\;\;\;\;\;d\;\;\;\;\;\;\;n\\&=\begin{bmatrix}\mathbb E(z-\mathbb Ez)(z-\mathbb Ez)^T&\mathbb E(z-\mathbb Ez)(x-\mathbb Ex)^T\\\mathbb E(x-\mathbb Ex)(z-\mathbb Ez)^T&\mathbb E(x-\mathbb Ex)(x-\mathbb Ex)^T\end{bmatrix}\end{aligned}

E.g.,

Σ22=E(xEx)(xEx)T=E[(Λz+μ+ϵμ)(Λz+μ+ϵμ)T]=E[ΛzzTΛT+ΛzϵT+ϵzTΛT+ϵϵT]        (E[ϵ]=0=E[ϵT])=E[ΛzzTΛT]+E[ϵϵT]=ΛE[zzT]ΛT+Ψ=ΛΛT+Ψ  (zN(0,I)).\begin{aligned}\Sigma_{22}&=\mathbb E(x-\mathbb Ex)(x-\mathbb Ex)^T\\&=\mathbb E[(\Lambda z+\mu+\epsilon-\mu)(\Lambda z+\mu+\epsilon-\mu)^T]\\&=\mathbb E[\Lambda zz^T\Lambda^T+\Lambda z\epsilon^T+\epsilon z^T\Lambda^T+\epsilon\epsilon^T]\\&\;\;\;\;(\mathbb E[\epsilon]=0=\mathbb E[\epsilon^T])\\&=\mathbb E[\Lambda zz^T\Lambda^T]+\mathbb E[\epsilon\epsilon^T]\\&=\Lambda\mathbb E[zz^T]\Lambda^T+\Psi\\&=\Lambda\Lambda^T+\Psi\;(\because z\sim\mathcal N(0,I)).\end{aligned}

Likewise, we can calculate each Σij\Sigma_{ij} and we obtain

Σ=[IΛTΛΛΛT+Ψ].\Sigma=\begin{bmatrix}I&\Lambda^T\\\Lambda&\Lambda\Lambda^T+\Psi\end{bmatrix}.
[zx]N([0μ],[IΛTΛΛΛT+Ψ])\begin{bmatrix}z\\x\end{bmatrix}\sim\mathcal N\left(\begin{bmatrix}0\\\mu\end{bmatrix},\begin{bmatrix}I&\Lambda^T\\\Lambda&\Lambda\Lambda^T+\Psi\end{bmatrix}\right)

E-step:

Qi(z(i))=p(z(i)x(i);θ)Q_i(z^{(i)})=p(z^{(i)}|x^{(i)};\theta)
z(i)x(i)N(μz(i)x(i),Σz(i)x(i))z^{(i)}|x^{(i)}\sim\mathcal N(\mu_{z^{(i)}|x^{(i)}},\Sigma_{z^{(i)}|x^{(i)}})

where

μz(i)x(i)=0+ΛT(ΛΛT+Ψ)1(x(i)μ)Σz(i)x(i)=IΛT(ΛΛT+Ψ)1Λ.\mu_{z^{(i)}|x^{(i)}}=\vec0+\Lambda^T(\Lambda\Lambda^T+\Psi)^{-1}(x^{(i)}\mu)\\\Sigma_{z^{(i)}|x^{(i)}}=I-\Lambda^T(\Lambda\Lambda^T+\Psi)^{-1}\Lambda.

M-step:

Qi(z(i))=1(2π)d/2Σz(i)x(i)1/2exp(12(z(i)μz(i)x(i))Σz(i)x(i)1(z(i)μz(i)x(i))T)Q_i(z^{(i)})=\frac{1}{(2\pi)^{d/2}|\Sigma_{z^{(i)}|x^{(i)}}|^{1/2}}\exp\left(-\frac{1}{2}(z^{(i)}-\mu_{z^{(i)}|x^{(i)}})\Sigma_{z^{(i)}|x^{(i)}}^{-1}(z^{(i)}-\mu_{z^{(i)}|x^{(i)}})^T\right)
z(i)Qi(z(i))z(i)dz(i)=Ez(i)Qi[z(i)]=μz(i)x(i)\begin{aligned}\int_{z^{(i)}}Q_i(z^{(i)})z^{(i)}dz^{(i)}&=\mathbb E_{z^{(i)}\sim Q_i}[z^{(i)}]\\&=\mu_{z^{(i)}|x^{(i)}}\end{aligned}
θ:=arg maxθiz(i)Qi(z(i))logp(x(i),z(i))Qi(z(i))dz(i)=arg maxθiEz(i)Qi[logp(x(i),z(i))Qi(z(i))].\begin{aligned}\theta:&=\argmax_\theta\sum_i\int_{z^{(i)}}Q_i(z^{(i)})\log\frac{p(x^{(i)},z^{(i)})}{Q_i(z^{(i)})}dz^{(i)}\\&=\argmax_\theta\sum_i\mathbb E_{z^{(i)}\sim Q_i}\left[\log\frac{p(x^{(i)},z^{(i)})}{Q_i(z^{(i)})}\right].\end{aligned}

Plug in Gaussian density to numerator and denominator, respectively.

0개의 댓글