Lecture 15. EM Algorithm & Factor Analysis

cryptnomy·2022년 11월 25일

ML cs229 machine learning

CS229: Machine Learning

목록 보기

15/18

Outline

EM convergence
Gaussian propotion
Factor analysis
Gaussian marginals vs. conditionals
EM steps

Recap

E-step:

Q_i(z^{(i)}):=p(z^{(i)}|x^{(i)};\theta).

M-step:

\theta:=\argmax_\theta\sum_i\sum_{z ^{(i)}}Q_i(z ^{(i)})\log\frac{p(x ^{(i)},z ^{(i)};\theta)}{Q_i(z ^{(i)})}.

$p(x^{(i)},z^{(i)})=p(x^{(i)}|z^{(i)})p(z^{(i)})$

$z^{(i)}\sim\text{Multinomial}(\phi)$

$[p(z^{(i)}=j)=\phi_j]$

$x^{(i)}|z^{(i)}=j\sim\mathcal N(\mu_j,\Sigma_j)$

E-step:

w_j^{(i)}=Q_i(z^{(i)}=j)=p(z^{(i)=j}|x^{(i)};\phi,\mu,\Sigma).

M-step:

\begin{aligned}&\max_{\phi,\mu,\Sigma}\;\sum_i\sum_{z^{(i)}}Q_i(z ^{(i)})\log\frac{p(x ^{(i)},z ^{(i)};\phi,\mu,\Sigma)}{Q_i(z ^{(i)})}\\&=\max\sum_i\sum_jw ^{(i)}_j\log\frac{\frac{1}{(2\pi)^{n/2}|\Sigma_j|^{1/2}} \exp\left(-\frac{1}{2}(x ^{(i)}-\mu_j)\Sigma_j^{-1} (x ^{(i)}-\mu_j)^T\right)\phi_j}{w_j^{(i)}}\\&=:\max f\end{aligned}

\begin{aligned}&\nabla_{\mu_j}f\stackrel{\text{set}}{=}0\\&\Longrightarrow\mu_j=\frac{\sum\limits_iw_j ^{(i)}x ^{(i)}}{\sum\limits_iw_j^{(i)}}.\end{aligned}

$w_j^{(i)}$ … “strength” with which $x ^{(i)}$ is assigned to Gaussian $j$ .

p(z ^{(i)}=j|x ^{(i)};\cdots)

\begin{aligned}&\nabla_{\phi_j}f\stackrel{\text{set}}{=}0\\&\Longrightarrow\phi_j=\frac{\sum\limits_iw_j ^{(i)}}{\sum\limits_i\sum\limits_jw_j^{(i)}}= \frac{1}{n} \sum\limits_iw_j ^{(i)}.\end{aligned}

Define

J(\theta,Q)=\sum_i\sum_{z ^{(i)}}Q_i(z ^{(i)})\log\frac{p(x ^{(i)},z ^{(i)};\theta)}{Q_i(z ^{(i)})}.

We know

l(\theta)\ge J(\theta,Q)

for any $\theta, Q$ .

E-step:

Maximize $J$ w.r.t. $Q$ .

M-step:

Maximize $J$ w.r.t. $\theta$ .

Mixture of Gaussian

Say $n=2,m=100$ , or $m\gg n$

$n\approx m$ or $n\gg m$

Model as single Gaussian:

X\sim\mathcal N(\mu,\Sigma)

MLE:

\begin{aligned}\mu&=\frac{1}{m}\sum_ix^{(i)}\\\Sigma&=\frac{1}{m}\sum_i(x^{(i)}-\mu)(x^{(i)}-\mu)^T\end{aligned}

If $m\le n$ , the $\Sigma$ will be singular / non-invertible.

Option 1. Constrain $\Sigma\in\mathbb R^{n\times n}$ to be diagonal.

\Sigma=\begin{bmatrix}\sigma_1^2&&&\\&\sigma_2^2&&\\&&\ddots&\\&&&\sigma_n^2\end{bmatrix}

MLE:

\sigma_j^2=\frac{1}{m}\sum_i(x_j^{(i)}-\mu_j)^2.

Q. Problem with this assumption?

A. The modeling assumes that all of your features are uncorrelated. I.e., if you have temperature sensors in a room, it’s not a good assumption to assume the temperatures at all points of the room are completely uncorrelated.

Option 2. (Stronger assumption)

Constrain $\Sigma$ to be $\Sigma=\sigma^2I$ .

\Sigma=\begin{bmatrix}\sigma^2&&&\\&\sigma^2&&\\&&\ddots&\\&&&\sigma^2\end{bmatrix}

MLE:

\sigma^2=\frac{1}{mn}\sum_i\sum_j(x_j^{(i)}-\mu_j)^2.

What the factor analysis do?

→ It captures some of the correlations but that doesn’t run into a uninvertible covariance matrices that the naive Gaussian model does, even with 100 dimensional data and 30 examples.

Q. ?

A. Common thing to do is apply Wishart prior … add a small diagonal value to the MLE: $\Sigma+\epsilon I$ .

This technically takes away non-invertible matrix problem. But it’s not the best model for a lot of datasets.

Q. Why use option 2 which is even worse than option 1?

A. To develop a description for factor analysis.

Andrew Ng comment:

These days large tech companies work on similar problems.

One of the really overlooked parts of the machine learning world is small data problems. A lot of practical applications of machine learning including class projects. Small data problems feel like blind spots, or a gap of a lot of the work done in the AI world today.

Q. Why don’t we use the same algorithms with big data?

A. Andrew Ng thinks in the machine learning world we are not very good at understanding the scaling. We don’t actually have a good understanding of how to modify our algorithms.

(Facebook recently (2020) published a paper which handles 3.5 billion images and the result was cool.)

Framework:

$p(x,z)=p(x|z)p(z)$

$z\sim\text{hidden}$

I.e. for $d=3,m=100,n=30$ ,

$z\sim\mathcal N(0,I), z\in\mathbb R^d\;(d<n)$

$x=\mu+\Lambda z+\epsilon$

where $\epsilon\sim\mathcal N(0,\Psi)$ .

Parameters:

$\mu\in\mathbb R^n,\Lambda\in\mathbb R^{n\times d},\Psi\in\mathbb R^{n\times n},\text{diagonal}$

Equivalently,

x|z\sim\mathcal N(\mu+\Lambda z,I)

Diagonal noise $\epsilon$ means each sensor is independent of the noise at every other sensor.

Example 1.

z\in\mathbb R,x\in\mathbb R^2,d=1,n=2,m=7

and $z\sim\mathcal N(0,1)$ .

(Source: https://youtu.be/tw6cmL5STuY?t=49m42s)

Say $\Lambda=\begin{bmatrix}2\\1\end{bmatrix}, \mu=\begin{bmatrix}0\\0\end{bmatrix},\Lambda z+\mu\in\mathbb R^2,\Psi=\begin{bmatrix}1&0\\0&2\end{bmatrix}$ .

The red crosses here are a typical sample drwan from this model.

Example 2.

z\in\mathbb R^2,x\in\mathbb R^3,d=2,n=2,m=5

(Source: https://youtu.be/tw6cmL5STuY?t=52m16s; cool animation 😀)

Compute $\Lambda z+\mu$ .

Factor analysis can take very high dimensional data, e.g., 100 dimensional data.

Multivariate Gaussian

x=\begin{bmatrix}x_1\\x_2\end{bmatrix}

where

x_1\in\mathbb R^r,x_2\in\mathbb R^s,x\in\mathbb R^{r+s}.

x\sim\mathcal N(\mu,\Sigma^2)

where

\mu=\begin{bmatrix}\mu_1\\\mu_2\end{bmatrix},\Sigma=\begin{bmatrix}\Sigma_{11}&\Sigma_{12}\\\Sigma_{21}&\Sigma_{22}\end{bmatrix}.

Marginal: $p(x_1)=?$

p(x)=p(x_1,x_2)

\int_{x_2}p(x_1,x_2)dx_2=p(x_1)\\p(x_1,x_2)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}\begin{pmatrix}x_1-\mu_1\\x_2-\mu_2\end{pmatrix}^T\begin{pmatrix}\Sigma_{11}&\Sigma_{12}\\\Sigma_{21}&\Sigma_{22}\end{pmatrix}^{-1}\begin{pmatrix}x_1-\mu_1\\x_2-\mu_2\end{pmatrix}\right)

Conditional: $p(x_1|x_2)=\;?$

x_1|x_2\sim\mathcal N(\mu_{1|2},\Sigma_{1|2})\\\mu_{1|2}=\mu_1+\Sigma_{12}\Sigma^{-1}_{22}(x_2-\mu_2)\\\Sigma_{1|2}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}

Derive $p(x,z)$ .

\begin{pmatrix}z\\x\end{pmatrix}\sim\mathcal N(\mu_{x,z},\Sigma)\\z\sim\mathcal N(0,I)\\x=\mu+\Lambda z+\epsilon

$\mathbb Ez=0\\\mathbb Ex=\mathbb E[\mu+\Lambda z+\epsilon]=\mu$

\begin{aligned}\mu_{x,z}&=\begin{bmatrix}0\\\mu\end{bmatrix}\begin{matrix}\updownarrow d-\text{dim}\\\updownarrow n-\text{dim}\end{matrix}\\\Sigma&=\begin{bmatrix}\Sigma_{11}&\Sigma_{12}\\\Sigma_{21}&\Sigma_{22}\end{bmatrix}\\&\;\;\;\;\;\;\;\;d\;\;\;\;\;\;\;n\\&=\begin{bmatrix}\mathbb E(z-\mathbb Ez)(z-\mathbb Ez)^T&\mathbb E(z-\mathbb Ez)(x-\mathbb Ex)^T\\\mathbb E(x-\mathbb Ex)(z-\mathbb Ez)^T&\mathbb E(x-\mathbb Ex)(x-\mathbb Ex)^T\end{bmatrix}\end{aligned}

E.g.,

\begin{aligned}\Sigma_{22}&=\mathbb E(x-\mathbb Ex)(x-\mathbb Ex)^T\\&=\mathbb E[(\Lambda z+\mu+\epsilon-\mu)(\Lambda z+\mu+\epsilon-\mu)^T]\\&=\mathbb E[\Lambda zz^T\Lambda^T+\Lambda z\epsilon^T+\epsilon z^T\Lambda^T+\epsilon\epsilon^T]\\&\;\;\;\;(\mathbb E[\epsilon]=0=\mathbb E[\epsilon^T])\\&=\mathbb E[\Lambda zz^T\Lambda^T]+\mathbb E[\epsilon\epsilon^T]\\&=\Lambda\mathbb E[zz^T]\Lambda^T+\Psi\\&=\Lambda\Lambda^T+\Psi\;(\because z\sim\mathcal N(0,I)).\end{aligned}

Likewise, we can calculate each $\Sigma_{ij}$ and we obtain

\Sigma=\begin{bmatrix}I&\Lambda^T\\\Lambda&\Lambda\Lambda^T+\Psi\end{bmatrix}.

\begin{bmatrix}z\\x\end{bmatrix}\sim\mathcal N\left(\begin{bmatrix}0\\\mu\end{bmatrix},\begin{bmatrix}I&\Lambda^T\\\Lambda&\Lambda\Lambda^T+\Psi\end{bmatrix}\right)

E-step:

Q_i(z^{(i)})=p(z^{(i)}|x^{(i)};\theta)

z^{(i)}|x^{(i)}\sim\mathcal N(\mu_{z^{(i)}|x^{(i)}},\Sigma_{z^{(i)}|x^{(i)}})

where

\mu_{z^{(i)}|x^{(i)}}=\vec0+\Lambda^T(\Lambda\Lambda^T+\Psi)^{-1}(x^{(i)}\mu)\\\Sigma_{z^{(i)}|x^{(i)}}=I-\Lambda^T(\Lambda\Lambda^T+\Psi)^{-1}\Lambda.

M-step:

Q_i(z^{(i)})=\frac{1}{(2\pi)^{d/2}|\Sigma_{z^{(i)}|x^{(i)}}|^{1/2}}\exp\left(-\frac{1}{2}(z^{(i)}-\mu_{z^{(i)}|x^{(i)}})\Sigma_{z^{(i)}|x^{(i)}}^{-1}(z^{(i)}-\mu_{z^{(i)}|x^{(i)}})^T\right)

\begin{aligned}\int_{z^{(i)}}Q_i(z^{(i)})z^{(i)}dz^{(i)}&=\mathbb E_{z^{(i)}\sim Q_i}[z^{(i)}]\\&=\mu_{z^{(i)}|x^{(i)}}\end{aligned}

\begin{aligned}\theta:&=\argmax_\theta\sum_i\int_{z^{(i)}}Q_i(z^{(i)})\log\frac{p(x^{(i)},z^{(i)})}{Q_i(z^{(i)})}dz^{(i)}\\&=\argmax_\theta\sum_i\mathbb E_{z^{(i)}\sim Q_i}\left[\log\frac{p(x^{(i)},z^{(i)})}{Q_i(z^{(i)})}\right].\end{aligned}

Plug in Gaussian density to numerator and denominator, respectively.

cryptnomy

이전 포스트

Lecture 14. Expectation-Maximization Algorithms

다음 포스트

Lecture 15. EM Algorithm & Factor Analysis

CS229: Machine Learning

Outline

Mixture of Gaussian

Multivariate Gaussian

Lecture 14. Expectation-Maximization Algorithms

Lecture 16. Independent Component Analysis & RL

0개의 댓글

관련 채용 정보