Diffusion

ese2o·2024년 5월 28일

Diffusion Model

Diffusion 모델은 Forward Process로 매 timestep t마다 추가한 Gaussian noise $q\left(x_t \mid x_{t-1}\right)$ 를 다시 $p_\theta\left(x_{t-1} \mid x_t\right)$ 로 복원하는 과정인 Backward Process를 학습하는 생성모델 알고리즘이다.

1) Score-Based generative model: learn to model distributions with increasing levels of Gaussian noise

2) use annealed Langevin dynamics: undo this noise

image gradually appears out of Gaussian noise

Diffusion 모델은 이 두 과정을 각각 forward/backward process의 개념으로 정의하고 사용한다.

1) gradually turning data to noise, and 2) learning the inverse of this process.

iterative하게 시행

Markov Chain

: Markov 성질을 갖는 이산확률과정

Markov 성질: “특정 상태의 확률(t+1)은 오직 현재(t)의 상태에 의존한다”
이산확률과정: 이산적인 시간(0초, 1초, 2초, ..) 속에서의 확률적 현상

P[s_{t+1}|s_t] = P[s_{t+1}|s_{1, .., t}]

ex. “내일의 날씨는 오늘의 날씨만 보고 알 수 있다.” (내일의 날씨는 오로지 오늘의 날씨 만을 조건부로 하는 확률적 과정)

Forward process $q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)$

defined by user
score-based model의 annealing과 유사

Steps

data sample $x_0 \sim q\left(x_0\right)$
Markov chain : gradually adds noise to the data, producing sequence of increasingly noisy samples $\mathbf{x}_1, \ldots \mathbf{x}_T$
at each step t, we sample $x_t$ from the following Markov operator

q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{\alpha_t} \mathbf{x}_{t-1}, 1-\alpha_t \mathbf{I}\right)

where $\alpha_t \in(0,1), \alpha_t \rightarrow 0$
alpha는 noise 주입 정도이다.

최종 $\mathbf{x}_T$ 는 standard Gaussian distribution $N(0, 1)$ .

\begin{gathered} q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right):=\prod_{t=1}^T q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right) \\ q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right):=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}\right) \end{gathered}

Gaussian Distribution( $\mathcal{N}$ )에서 나온 noise를 data에 더해준다.
$q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)$ 수식을 보면, 우도(likelihood)의 형태를 보이고 있다.
$\beta$ 는 diffusion rate(variance schedule)로 분산이 divergence하는 것을 방지해준다.
하지만 위의 $\beta$ 를 이용해서 수식을 전개하면 0~T의 모든 수식을 step by step으로 전개해야 하여 메모리 소모가 크고 시간도 오래 걸린다.
- 이를 해소하기 위해 $\alpha$ 를 이용한다.

\begin{gathered} \alpha_t:=1-\beta_t \text { and } \bar{\alpha}_t:=\prod_{s=1}^t \alpha_s \\ q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{\bar{\alpha}_t} \mathbf{x}_0,\left(1-\bar{\alpha}_t\right) \mathbf{I}\right) \end{gathered}

$\mathbf{x}_t$ 는 Gaussian Distribution에서 나오는 값이기 때문에 평균을 기준으로 어느 정도 분산으로 치우친 값을 가질 것이다.
$\alpha$ 를 이용하여 한 번에 전개가 가능하다.

T steps of diffusion

q\left(\mathbf{x}^{(0 \cdots T)}\right)=q\left(\mathbf{x}^{(0)}\right) \prod_{t=1}^T q\left(\mathbf{x}^{(t)} \mid \mathbf{x}^{(t-1)}\right)

\begin{array}{rlrl} \mathbf{x}_t & =\sqrt{\alpha_t} \mathbf{x}_{t-1}+\sqrt{1-\alpha_t} \epsilon_t & \boldsymbol{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ & =\sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2}+\sqrt{1-\alpha_t \alpha_{t-1}} \overline{\boldsymbol{\epsilon}}_t & & \\ & =\ldots & & \\ & =\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} & \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ \end{array}

첫 번째줄: definition of Markov operator (next xt = previous xt + gaussian noise with extra multiplier that is defined by the schedule)

$x_0$ 을 얻을 때까지 반복한다.

최종 형태: rescaled $x_0$ + gaussian noise (scheduling variables 포함)

q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{\bar{\alpha}_t} \mathbf{x}_0,\left(1-\bar{\alpha}_t\right) \mathbf{I}\right)

annealed langevin dynamics

기존 space에 noise를 점진적으로 추가하는 과정이다. #Forward Process

처음에는 큰 노이즈를 추가하고, 뒤로 갈수록 노이즈의 양을 줄여 모델이 데이터 군집을 제대로 향하도록 유도한다.

이 노이즈를 제거해 원하는 데이터를 추출할 수 있다. #Backward Process

Backward process $p\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)$ : Diffusion Process

get the inverse of the above Markov chain
Gaussian noise를 제거해가며 특정한 패턴을 만들어가는 과정이다.
우리는 이 process를 알지는 못하지만 데이터로부터 학습할 수 있도록 할 것이다.

Model:

backward process를 학습하는 Diffusion 모델의 확률 분포

p_\theta\left(\mathbf{x}_{0: T}\right)=p\left(\mathbf{x}_T\right) \prod_{t=1}^T p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)

p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right), \boldsymbol{\Sigma}_\theta\left(\mathbf{x}_t, t\right)\right)

우리는 이미지 $\mathbf{x}_0$ 을 얻을 때까지 $p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)$ 로부터 noise $\mathbf{x}_T$ 를 샘플링한다.

여기서 학습 대상은 $\boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right), \boldsymbol{\Sigma}_\theta\left(\mathbf{x}_t, t\right)$ 이다.
평균과 분산을 잘 근사할 수 있도록 학습시킨다.

우리는 $q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)$ 으로부터 $p\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)$ 를 학습시키는 것이 목표다.

따라서 $p_\theta\left(x_{t-1} \mid x_t\right)$ 가 $q\left(x_t \mid x_{t-1}\right)$ 에 근사해야 하고, 분포 차이를 최소화하는 KL Divergence로 학습 목표를 나타낼 수 있다.

D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)\right.

log likelihood

data point $\mathbf{x}_0$ 에서의 likelihood
evidence $p\left(\mathbf{x}_0\right)$ 의 lower bound

\begin{aligned} & \log p_\theta\left(\mathbf{x}_0\right) \geq \log p_\theta\left(\mathbf{x}_0\right)-D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)\right) \\ & =\log p_\theta\left(\mathbf{x}_0\right)-\mathbb{E}_{\mathbf{x}_{1: T} \sim q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)}\left[\log \frac{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)}{p_\theta\left(\mathbf{x}_{0: T}\right) / p_\theta\left(\mathbf{x}_0\right)}\right] \\ & =\log p_\theta\left(\mathbf{x}_0\right)-\mathbb{E}_q\left[\log \frac{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)}{p_\theta\left(\mathbf{x}_{0: T}\right)}+\log p_\theta\left(\mathbf{x}_0\right)\right] \\ & =-\mathbb{E}_q\left[\log \frac{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)}{p_\theta\left(\mathbf{x}_{0: T}\right)}\right] \\ & \end{aligned}

Diffusion 모델의 ELBO

E_q\left[\log \frac{q\left(x_{1: T} \mid x_0\right)}{p_\theta\left(x_{0: T}\right)}\right]

이를 최대화하면,

$p\left(\mathbf{x}_0\right)$ 를 최대화하고 $D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)\right.$ 를 최소화시킬 수 있다.

VAE와 learning objective가 유사한데, $p_\theta$ 만을 최적화시킨다는 점이 다르다.

Loss Function : diffusion loss

위의 log likelihood를 최대화시키는 것이 목적

\mathbb{E}_q[\underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T \mid \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_T\right)\right)}_{L_T}+\sum_{t=2}^T \underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)\right)}_{L_{t-1}} \underbrace{-\log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)}_{L_0}]

The prior loss (Regularization)

L_T=D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T \mid \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_T\right)\right)

compares final $\mathbf{x}_T$ and is often zero by construction

모델의 입력분포 p가 노이즈의 집합인 q를 잘 표현하는지를 측정, 알맞게 설정한 경우 0에 수렴한다.

The reconstruction term

L_0=-\log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)

is the probability of the true $\mathbf{x}_0$ given the model’s “best guess” $\mathbf{x}_1$ .

모델의 최종 출력결과인 x1과 원본 데이터인 x0을 비교해 모델이 원본 데이터를 얼마나 잘 복원했는지 측정한다. 잘 학습되었을수록 0에 수렴한다.

diffusion loss

L_t=D_{\mathrm{KL}}\left(q\left(\mathbf{x}_t \mid \mathbf{x}_{t+1}, \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_t \mid \mathbf{x}_{t+1}\right)\right) \text { for } 1 \leq t \leq T-1

measure whether the learned backward process $p_\theta\left(\mathbf{x}_t \mid \mathbf{x}_{t+1}\right)$ looks like the real backward process $q\left(\mathbf{x}_t \mid \mathbf{x}_{t+1}, \mathbf{x}_0\right)$ .

학습된 backward process인 $p_\theta\left(\mathbf{x}_t \mid \mathbf{x}_{t+1}\right)$ 가 실제 backward process인 $q\left(\mathbf{x}_t \mid \mathbf{x}_{t+1}, \mathbf{x}_0\right)$ 를 얼마나 잘 근사하는지 측정한다.
Diffusion loss가 작을수록 noise가 잘 제거된다는 뜻이다.

Noise Parameterization : denoise from the noisy data

diffusion loss $L_{t}$ 을 noise parameterization을 수행함으로써 정규화시킬 수 있다. 최종적으로는 noisy data를 denoise시키는 작업으로 볼 수 있다.

앞서 Forward Process에서 $q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{\bar{\alpha}_t} \mathbf{x}_0,\left(1-\bar{\alpha}_t\right) \mathbf{I}\right)$ 를 증명했다.

정의와 베이즈정리를 통해, $q(x_{t-1}|x_t, x_0)$ 와 $p_\theta(x_{t-1}|x_t)$ 가 모두 가우시안 분포를 따른다는 것을 알 수 있다. 따라서, 목적식을 가우시안 분포 간의 KL divergence로 다시 나타낼 수 있다.

\tilde{\boldsymbol{\mu}}_t=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_t\right)

\boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right)=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right)\right)

\begin{aligned} L_t & =\mathbb{E}_{\mathbf{x}_0, \epsilon}\left[C_1 \cdot\left\|\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right)-\mu_\theta\left(\mathbf{x}_t, t\right)\right\|^2\right] \\ & =\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[C_2 \cdot\left\|\boldsymbol{\epsilon}_t-\boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right)\right\|^2\right] \\ & =\mathbb{E}_{\mathbf{x}_0, \epsilon}\left[C_2 \cdot\left\|\boldsymbol{\epsilon}_t-\boldsymbol{\epsilon}_\theta\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \epsilon_t, t\right)\right\|^2\right] \end{aligned}

(C1과 C2는 >0인 상수)

Diffusion training process

Sample a datapoint $\mathbf{x}_0$
Sample a time step t uniformly from 1,2,..., T.
Sample noise $\epsilon \sim \mathcal{N}(0, I)$
Generate noisy $\mathbf{x}_t=\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \epsilon_t$
Take a gradient step on $\left\|\epsilon_t-\epsilon_\theta\left(\mathbf{x}_t, t\right)\right\|^2$
repeat until convergence

Ancestral Sampling : 새로운 데이터 생성

Diffusion process:

p_\theta\left(\mathbf{x}_{0: T}\right)=p\left(\mathbf{x}_T\right) \prod_{t=1}^T p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)

이 모델을 통해 데이터 $x_{t-1}$ 을 생성할 수 있다.

x_{t-1} \sim p_\theta\left(\mathbf{x}_{t-t} \mid \mathbf{x}_t\right) \text { for } t=T, T-1, \ldots, 1

sampling process for a diffusion model:

\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right)\right)+\sigma_t \cdot \mathbf{z}

for $t \in\{T, T-1, \ldots, 1\}$

Ancestral Sampling vs. Langevin Dynamics

recall that in Langevin Dynamics we repeatedly perform the update:

\mathbf{x}_t=\mathbf{x}_{t-1}+\frac{\alpha_t}{2} s_\theta\left(\mathbf{x}, \sigma_{\ell}\right)+\sqrt{\alpha_t} \epsilon_t

recall that score model $s_\theta(\tilde{\mathbf{x}})$ and the denoiser $\epsilon(\tilde{\mathbf{x}})$ are related:

s_\theta(\tilde{\mathbf{x}}) \approx \nabla_{\tilde{\mathbf{x}}} \log q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x})=\frac{\epsilon}{\sigma} \approx \frac{\epsilon(\tilde{\mathbf{x}})}{\sqrt{1-\bar{\alpha}_t}}

\mathbf{x}_{\text {new }}=\mathbf{x}+\frac{\alpha_t}{2} \nabla_x \log p_\theta(\mathbf{x})+\sqrt{\alpha_t} \epsilon_t

recall sampling process for a diffusion model:

\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right)\right)+\sigma_t \cdot \mathbf{z}

결국 annealed Langevin dynamics랑 형태가 같다.

Conditional diffusion model

y(label, noised image)가 주어지면 이미지 x를 생성하는 모델로 활용된다.

\nabla_{\mathbf{x}} \log p(\mathbf{x} \mid \mathbf{y})=\nabla_{\mathbf{x}} \log p(\mathbf{x})+\nabla_{\mathbf{x}} \log p(\mathbf{y} \mid \mathbf{x})

우변은 이미 알고 있는 정보 -> Langevin dynamics를 통해 좌변을 샘플링할 수 있다.

Variational Diffusion model, Latent Diffusion model로 응용될 수 있다.

참고: Cornell Tech CS 6785 Lecture 13

ese2o

이전 포스트

Statistical Methods #5

다음 포스트

Diffusion