[DL Basic] 09. Generative Models 1

Angie·2022년 2월 10일

DeepLearning boostcampAITech

부스트캠프 AI Tech

목록 보기

11/11

Deep Generative Models

Generative Models

Suppose we are given images of dogs
We want to learn a probalility distribustion $p(x)$ such that
- Generation: if we sample $x_{new} \sim p(x), x_{new}$ should look like a dog (sampling)
- Density estimation: $p(x)$ should be high if $x$ looks like a dog, and low otherwise (anomaly detection: 이상행동감지) → discriminative model 같이 동작, GA모델은 discriminative model 포함
  - Also known as, explicit models. 입력이 주어졌을 때, 얘에 대한 확률값을 얻어낼 수 있음
  - c.f. implicit: 단순히 generation만 할 수 있음
- Unsupervised representation learning: We should be able to learn what these images have in common, e.g., ears, tail, etc (feature learning) → but 약간의 논란 여지 있음
Then, how can we represent $p(x)$ ?
1. $x$ 라는 입력이 들어갔을 때 어떤 값이 나오는 것
2. $x$ 를 샘플링 할 수 있는 어떤 모델

Basic Discrete Distributions

이산 분포

Bernoulli distribution: (biased) coin flip
- $D$ = {heads, tails}
- Specify $P(X= \text{Heads}) = P$ . Then $P(X = \text{Tails}) = 1 - p$ .
- Write: $X \sim \text{Ber}(p)$ .
Categorical distribution: (biased) m-sided dice
- $D = { \{1, ...,m\}}$
- Specify $P(Y = i)=P_i$ , such that $\sum\limits_{i=1}^{m}p_i = 1$
- Write: $Y \sim \text{Cat}(p_1, ..., p_m)$

Example

modeling an RGB joint distribution (of a single pixel)
- $(r,g,b) \sim p(R,G,B)$
  - r,g,b가 서로 independent
- number of cases?
  - 256 x 256 x 256
- How many parameters do we need to specify?
  - 256 x 256 x 256 -1
    
    → 하나의 픽셀에 대해 fully discribution 하기 위해서 필요한 파라메터의 수가 엄청 많음
Suppose we have $X_1, ..., X_n$ of $n$ binary pixels (a binary image) 28 x 28 = 784
- how many possible states?
  - $2 \times2\times ... \times2 = 2^n$
  - $2^{784}$
- sampling from $p(x_1, ..., x_n)$ generates an image
- how many parameters to specify $p(x_1, ..., x_n)$ ?
  - $2^n -1$
  - $2^{783}$

기계학습에서, 파라미터 수가 많아질수록 학습은 더욱 더 어려움

Structure Through Independence

n개의 픽셀들이 모두 independent하다고 생각하면? (말이 안되긴 함,,, 인접한 픽셀은 비슷할 것)

what if $X_1, ..., X_n$ are independent, then
- $p(x_1, ..., x_n) = p(x_1)p(x_2)...p(x_n)$
how many possible states?
- $2^n$
how many parameters to specify $p(x_1, ..., x_N)$ ?
- $n$
- 각각의 픽셀에 대해 파라미터 한개만 있으면 되고 independent
$2^n$ entries can be described by just $n$ numbers! But this independence assumption is too strong to model useful distributions. : 파라미터의 수 $2^n$ → $n$

indendent는 표현할 수 있는 이미지가 너무 적어서 일반적으로 우리가 아는 이미지 만들 수 X
그래서 우리는 $2^n$ 과 $n$ 방법 사이 어딘가 중간을 찾아야함

Conditional Independence

Three important rules
Chain rule
- $p(x_1, ..., x_n) = p(x_1)p(x_2 \vert x_1)p(x_3 \vert x_1, x_2)...p(x_n\vert x_1, ..., x_{n-1})$
- n개의 joint distribution을 n개의 condition distribution으로 변환
- $x_1$ 과 $x_n$ 이 independent 이던 아니던 상관 X
Bayes’ rule
- $p(x \vert y) = \frac{p(x,y)}{p(y)}=\frac{p(y\vert x)p(x)}{p(y)}$
Conditional independence
- if $x\bot y \vert z$ , then $p(x\vert y,z) = p(x\vert z)$
- 가정: $z$ 가 주어졌을 때, $x$ 와 $y$ 가 independent하다.

Using the chain rule,
- $p(x_1, ...,x_n) = p(x_1)p(x_2\vert x_1)p(x_3 \vert x_1, x_2)...p(x_n\vert x_1,...,x_{n-1})$
how many parameters?
- $p(x_1)$ : 1 parameter
- $p(x_2\vert x_1)$ : 2 parameters (one per $p(x_2\vert x_1 = 0)$ and one per $p(x_2\vert x_1 = 1)$ )
- $p(x_3\vert x_1, x_2)$ : 4 parameters
- Hence, $1 +2+2^2+...+2^{n-1}=2^n-1$ , which is the same as before.
why?
- 아무것도 바뀐 건 없음. joint distribution을 chain rule을 통해 condition distribution으로 곱으로 표현, 우리가 어떤 가정도 하지 않음 (e.g. conditional independence)
  → fully dependent 모델과 같은 수의 parameter를 가지겠네?

Now, suppose $X_{i+1} \bot X_1,...,X_{i-1}\vert X_i$ (Markov assumption),
- $i+1$ 번째 pixel은 $i$ 번째 pixel에만 dependent
- $p(x_1, ...,x_n) = p(x_1)p(x_2\vert x_1)p(x_3\vert x_2)...p(x_n\vert x_{n-1})$
how many parameters?
- $2n-1$
Hence, by leveraging the Markov assumption, we get exponential reduction on the number of parameters
- chain rule만 가지고 joint distribution을 쪼개면 파라미터 수는 달라지지 않음.
  쪼갠 다음 Markov assumption하여 conditional independence를 통해서 파라미터를 $2n-1$ 로 줄임 (fully independent model 보다는 많음)
Auto-regressive models leverage this conditional independency
- conditional independent assumption을 어떻게 줄이느냐에 따라 파라미터 수를 잘 바꿀 수 있음

Auto-regressive Model

Suppose we have 28 x 28 binary pixels.
Our goal is to learn $p(x)=p(x_1, ..,x_{784})$ over $x\in \{0,1\}^{784}$ .
How can we parametrize $p(x)$ ?
- Let’s use the chain rule to factor the joint distribution.
- $p(x_{1:784})=p(x_1)p(x_2\vert x_1)p(x_3\vert x_{1:2})...$
- This is called an autoregressive model
  - 정보가 이전 정보들에 dependent
    : Markov와 같이 이전 하나에만 dependent 해도, 전체에 다 dependent해도 autoregressive model임!
  - 이전 n개를 고려: AR-n 모델
- Note that we need an ordering of all random variables
  - depent를 위해 순서가 중요!
  - 이미지를 순서? 명확하지 않음 → 순서를 어케하냐에 따라 성능/방법론 달라질 수도,,

NADE: Neural Autoregressive Density Estimator

i번째 pixel을 1번째부터 i-1까지 dependent하게 함
- 첫번째 pixel 확률분포를 어느 것에도 dependent 하지 않은 상태로 만들고
  두번째 pixel에 대한 확률을 첫번째 pixel에만 dependent하게
- pixel값을 입력으로 받는 neural network를 만들어서 single scalar가 나오면 sigmoid를 통과해 0~1의 수로 만듦
- neural network 입장에서는 입력 차원이 계속 달라짐 → weight가 계속 커지게 됨
The probability distribution of $i$ -th pixel is
- $p(x_i\vert x_{1:i-1})=\sigma(\alpha_ih_i+b_i)$ where $h_i = \sigma(W_{<i}x_{1:i-1}+c)$

explicit model: 입력이 주어졌을 때, 얘에 대한 확률값을 얻어낼 수 있음

implicit: 단순히 generation만 할 수 있음

NADE is an explicit model that can compute the density of the given inputs
- density: probability density
how can we compute the density of the given image?
- suppose we have a binary image with 784 binary pixels, $\{x_1, x_2 ,...,x_{784}\}$ .
- Then, the joint probability is computed by
  - $p(x_1, ...,x_{784}) = p(x_1)p(x_2\vert x_1)...p(x_{784}\vert x_{1:783})$
    - where each conditional probability $p(x_i\vert x_{1:i-1})$ is computed independently
  - joint distribution을 chain rule을 통해 conditional distribution으로 쪼개.
    우리의 모델이 1번째 pixel에 대한 확률분포를 알고 있고, 첫번째 pixel이 주어졌을 때 두번째 pixel의 확률분포를 알고,... → 각각을 independent하게 다 집어넣음
  - 다 곱하면 매우 작은 확률 값이 하나 나오겠지?
- In case of modeling continuous random variables, a mixture of Gaussain can be used
  - binary pixel의 output은 sigmoid 통과해서 끝이지만 continuous output이라면 마지막 layer에 gaussain mixture을 활용해서 continuous distribution을 만듦

Pixel RNN

이미지에 있는 pixel들을 만들어낼꺼야 (generative model)
We can also use RNNs to define an auto-regressive model.
For example, for an $n \times n$ RGB image,
- $p(x) = \Pi_{i=1}^{n^2}p(x_{i,R}\vert x_{<i})p(x_{i,G}\vert x_{<i},x_{i,R})p(x_{i,B}\vert x_{<i},x_{i,R},x_{i,G})$
  - $p(x_{i,R}\vert x_{<i})$ : Prob. i-th R
  - $p(x_{i,G}\vert x_{<i},x_{i,R})$ : Prob. i-th G
  - $p(x_{i,B}\vert x_{<i},x_{i,R},x_{i,G})$ : Prob. i-th B
- R을 먼저 → G → B 순서로 만들어

이전에는 auto regressive model을 fully connected layer를 통해 만들었지만 pixel RNN은 recurrent neural network를 만듦

There are two model architectures in Pixel RNN based on the ordering of chain
- Row LSTM: 위쪽 정보 활용
- Diagonal BiLSTM: 이전 정보 전부 활용

Angie

Hi there 👋

이전 포스트

[DL Basic] 09. Generative Models 1

부스트캠프 AI Tech

Generative Models

Basic Discrete Distributions

Example

Structure Through Independence

Conditional Independence

Auto-regressive Model

NADE: Neural Autoregressive Density Estimator

Pixel RNN

[DL Basic] 05. Modern CNN - 1x1 convolution의 중요성

0개의 댓글