Ai tech Day15

Lee·2021년 2월 5일

Generative Models Part 1

Suppose we are given images of dogs.
We want to learn a probability distribution $p(x)$ such that
- Generation: If we sample $x_{new}$ ~ $p(x)$ , $x_{new}$ should look like a dog (sampling).
- Density estimation: $p(x)$ should be high if $x$ looks like a dog, and low otherwise (anomaly detection).
  - Also known as, explicit models.
- Unsupervised representation learning: We should be able to learn what these images have in common, e.g., ears, tail, etc (feature learning).
Then, how can we represent ?

Bernoulli distribution: (biased) coin flip
- $D$ = { $Heads$ , $Tails$ }
- Specify $P(X=Heads) = p$ . Then $P(X = Tails) = 1 - p$ .
- Write: $X$ ~ $Ber(p)$ .
Categorical distribution: (biased) m-sided dice
- $D$ = { $1,\cdots,m$ }
- Specify $P(Y=i)=p_i$ ,such that $\sum_{i=1}^{m} p_i = 1$
- Write: $Y$ ~ $Cat(p_1, \cdots,p_m)$
Example
- Modeling an RGB joint distribution (of a single pixel)
  - (r, g, b) ∼ p(R,G, B)
  - Number of cases? $256 \times 256 \times 256$
  - How many parameters do we need to specify? $255 \times 255 \times 255$

Example
- Suppose we have $X_1, \dots,X_n$ of $n$ binary pixels (a binary image).
  - How many possible states? $2 \times2 \times \cdots \times 2 = 2^n$
  - Sampling from $p(x_1, \dots,x_n)$ generates an image.
  - How many parameters to specify $p(x_1,\dots,x_n)$ ? $2^n - 1$

What if $X_1,\dots,X_n$ are independent then? $p(x_1,\dots,x_n)=p(x_1)p(x_2)\cdots p(x_n)$
How many possible states? $2^n$
How many parameters to specify $p(x_1,\dots,x_n)$ ? $n$
$2^n$ entries can be described by just $n$ numbers! But this independence assumption is too strong to model useful distributions.

Three important rules
- Chain rule:
  $p(x_1,\dots,x_n)=p(x_1)p(x_2|x_1)p(x_3|x_1,x_2)\cdots p(x_n|x_1, \dots,x_{n-1} )$
  - How many parameters?
    $p(x_1)$ : 1 parameter
    $p(x_2|x_1)$ : 2 parameters ( $p(x_2|x_1 = 0)$ and $p(x_2|x_1 = 1)$ )
    $p(x_3|x_1,x_2)$ : 4 parameters Hence, $1 + 2 + 2^2 + \cdots + 2^{n-1} = 2^n-1$ , which is the same as before.
- Bayes rule: $p(x|y)=\cfrac{p(x,y)}{p(y)}= \cfrac{p(y|x)p(x)}{p(y)}$
- Conditional independence: $if \; \; x\ \bot\ y,\ then\; \; p(x|y,z)=p(x|z)$
Now, suppose $X_{i+1} \bot X_1,\dots,X_{i-1}|X_i$ (Markov assumption), then
$p(x_1,\dots,x_n)=p(x_1)p(x_2|x_1)p(x_3|x_2)\cdots p(x_n|x_{n-1})$
How many parameters?
$2n - 1$
Hence, by leveraging the Markov assumption, we get exponential reduction on the number of parameters.
Auto-regressive models leverage this conditional independency.

Suppose we have $28 \times 28$ binary pixels.
Our goal is to learn $p(x)=p(x_1,\dots,x_{784})$ over $x \in \{0,1\}^{784}$
How can we parametrize $p(x)$ ?
- Let's use the chain rule to factor the joint distribution.
- $p(x_{1:784})=p(x_1)p(x_2|x_1)p(x_3|x_{1:2})\cdots$
- This is called an autoregressive model.
- Note that we need an ordering of all random variables.

초보 개발자입니다