[논문리뷰] Denoising Diffusion Probabilistic Models (DDPM, 2020)

김민서·2024년 7월 7일

Diffusion Models

목록 보기

3/8

0. Abstract

1. Introduction

기존 generative models
- GANs, autoregressive, flow-based, VAE
DPM
- = a parameterized Markov chain
- 아직까지는 high quality samples에 적용하기는 어렵다
Contributions
- Diffusion models가 high quality samples도 만들 수 있다는 것을 보임
- certain parameterization에서 denoising score matching과 같다는 것을 보임

2. Background

DPM의 내용인데 notation이 조금 다르다
- latent variable models of the form $p_{\theta}(\mathrm{x}_{0}) := \int p_{\theta}(\mathrm{x}_{0:T})\ d\mathrm{x}_{1:T}$
- where $\mathrm{x}_{1},...,\mathrm{x}_{T}$ are latents of the same dimensionality as the data $\mathrm{x}_{0}\ \sim \ q(\mathrm{x}_{0})$
reverse process
- joint distribution $p_{\theta}(\mathrm{x}_{0:T})$ := reverse process
- $p(\mathrm{x}_{T})=\mathcal{N}(\mathrm{x}_{T};0,I)$ 에서 시작하는 Markov chain (with learned Gaussian)으로 정의된다

\begin{aligned} &p_{\theta}(\mathrm{x}_{0:T}):=p(\mathrm{x}_{T})\ \Pi_{t=1}^{T}\ p_{\theta}(\mathrm{x}_{t-1}|\mathrm{x}_{t}),\\ &\mathrm{where}\ \ \ p_{\theta}(\mathrm{x}_{t-1}|\mathrm{x}_{t}):= \mathcal{N}( \mathrm{x}_{t-1};\mu_{\theta}(\mathrm{x}_{t},t), \Sigma_{\theta}(\mathrm{x}_{t}, t))\\ \end{aligned}

forward process
- 다른 latent variable models와 달리 approximate posterior $q(\mathrm{x}_{1:T}|\mathrm{x}_{0})$ 이 fixed되어 있다
- $q(\mathrm{x}_{1:T}|\mathrm{x}_{0})$ := forward process (or diffusion process)
- Markov chain that gradually adds Gaussian noise to the data according to a fixed variance schedule $\beta_{1},...,\beta_{T}$

\begin{aligned} &q(\mathrm{x}_{1:T}|\mathrm{x}_{0}):=\Pi_{t=1}^{T}q(\mathrm{x}_{t}|\mathrm{x}_{t-1}),\\ &\mathrm{where}\ \ \ q(\mathrm{x}_{t}|\mathrm{x}_{t-1}):=\mathcal{N}(\mathrm{x}_{t};\sqrt{1-\beta}\mathrm{x}_{t-1},\beta_{t}I) \end{aligned}

Loss
- $p_{\theta}$ 모델의 negative log likelihood를 minimize하는 방향으로 학습
- 이 때 variational bound를 이용 $\mathbb{E}[-\mathrm{log}\ p_{\theta}(\mathrm{x}_{0})]\ \leq\ \mathbb{E}_{q}\left[-\mathrm{log}\frac{p_{\theta}(\mathrm{x}_{0:T})}{q(\mathrm{x}_{1:T}|\mathrm{x}_{0})}\right]\ =\ \mathbb{E}_{q}\left[-\mathrm{log}\ p(\mathrm{x}_{T})-\underset{t\geq 1}{\Sigma}\mathrm{log}\ \frac{p_{\theta}(\mathrm{x}_{t-1}|\mathrm{x}_{t})}{q(\mathrm{x}_{t}|\mathrm{x}_{t-1})}\right]\ =:L$
- Comparison with DPM notations
  - negative log likelihood
    - DDPM : $\mathbb{E}[-\mathrm{log}\ p_{\theta}(\mathrm{x}_{0})]$
    - DPM : $-L=-\int d\mathrm{x}^{(0)} q(\mathrm{x}^{(0)})\ \mathrm{log}\ p(\mathrm{x}^{(0)})$
  - variational bound
    - DDPM : $\mathbb{E}_{q}[-\mathrm{log}\ p(\mathrm{x}_{T})-\underset{t\geq 1}{\Sigma}\mathrm{log}\ \frac{p_{\theta}(\mathrm{x}_{t-1}|\mathrm{x}_{t})}{q(\mathrm{x}_{t}|\mathrm{x}_{t-1})}]$
    - DPM : $\int d\mathrm{x}^{(0...T)} q(\mathrm{x}^{(0...T)})\ \mathrm{log}[p(\mathrm{x}^{(T)})∏_{t=1}^{T} \frac{p(\mathrm{x}^{(t-1)}|\mathrm{x}^{(t)})}{q(\mathrm{x}^{(t)}|\mathrm{x}^{(t-1)})}]$
one-step forward diffusion
- sampling $\mathrm{x}_{t}$ at an arbitrary timestep $t$ can be done in closed form
- Use notation $\alpha_{t}:=1-\beta_{t}$ and $\bar{\alpha}_{t}:=\Pi_{s=1}^{t}\alpha_{s}$ , then $q(\mathrm{x}_{t}|\mathrm{x}_{0})=\mathcal{N}(\mathrm{x}_{t};\sqrt{\bar{\alpha}}\ \mathrm{x}_{0},\ (1-\bar{\alpha}_{t})I)$
- Proof
  - 출처: https://kimjy99.github.io/%EB%85%BC%EB%AC%B8%EB%A6%AC%EB%B7%B0/ddpm/
Rewrite Loss using KLD
- DPM의 Appendix B와 같은 내용
$=\mathbb{E}_{q}[D_{KL}(q(\mathrm{x}_{T}|\mathrm{x}_{0})\ ||\ p(\mathrm{x}_{T}) )\ +\ \underset{t>1}{\Sigma}D_{KL}(q(\mathrm{x}_{t-1}|\mathrm{x}_{t},\mathrm{x}_{0})\ ||\ p_{\theta}(\mathrm{x}_{t-1}|\mathrm{x}_{t}))\ +\ \mathrm{log}p_{\theta}(\mathrm{x}_{0}|\mathrm{x}_{1})]$
- divided into three parts
  - $L_{T}=D_{KL}(q(\mathrm{x}_{T}|\mathrm{x}_{0})\ ||\ p(\mathrm{x}_{T}) )$
  - $L_{t-1}=\underset{t>1}{\Sigma}D_{KL}(q(\mathrm{x}_{t-1}|\mathrm{x}_{t},\mathrm{x}_{0})\ ||\ p_{\theta}(\mathrm{x}_{t-1}|\mathrm{x}_{t}))$
  - $L_{0}=\mathrm{log}\ p_{\theta}(\mathrm{x}_{0}|\mathrm{x}_{1})$
directly compare forward & backward process
- backward process $p_{\theta}(\mathrm{x}_{t-1}|\mathrm{x}_{t})$ 와 forward posterior(GT) $q(\mathrm{x}_{t-1}|\mathrm{x}_{t})$ 를 비교
- $q(\mathrm{x}_{t-1}|\mathrm{x}_{t})$ 는 계산하기 어렵지만 $\mathrm{x}_{0}$ condition이 추가로 주어지면 쉽게 계산할 수 있다
$\begin{aligned} &q(\mathrm{x}_{t-1}|\mathrm{x}_{t},\mathrm{x}_{0})=\mathcal{N}(\mathrm{x}_{t-1};\tilde{\mu}(\mathrm{x}_{t},\mathrm{x}_{0}),\tilde{\beta}_{t}I),\\ &\mathrm{where}\ \ \ \tilde{\mu}_{t}(\mathrm{x}_{t},\mathrm{x}_{0}):=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_{t}}{1-\bar{\alpha}_{t}}\mathrm{x}_{0}+\frac{\sqrt{\alpha_{t}}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}}\mathrm{x}_{t}\ \ \ \mathrm{and}\ \ \ \tilde{\beta}_{t}:=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}\\\end{aligned}$
- Proof
- 이제 Loss 내의 모든 KLD comparisons는 Gaussian으로만 이루어지게 됨
- can be calculated in a Rao-Blackwellized fashion with closed form expressions, instead of high variance Monte Carlo estimates

3. Diffusion models and denoising autoencoders

3.1. Forward process and $L_{T}$

$L_{T}=D_{KL}(q(\mathrm{x}_{T}|\mathrm{x}_{0})\ ||\ p(\mathrm{x}_{T}) )$
$\beta_{t}$ 를 learnable하게 설정할 수도 있지만 DDPM에서는 fixed schedule로 고정하였음
그러면 posterior $q(\mathrm{x}_{T}|\mathrm{x}_{0})$ 에 learnable parameter가 없으므로 $L_{T}$ 는 constant

3.2. Reverse process and $L_{1:T-1}$

$L_{t-1}=\underset{t>1}{\Sigma}D_{KL}(q(\mathrm{x}_{t-1}|\mathrm{x}_{t},\mathrm{x}_{0})\ ||\ p_{\theta}(\mathrm{x}_{t-1}|\mathrm{x}_{t}))$
$p_{\theta}(\mathrm{x}_{t-1}|\mathrm{x}_{t})=\mathcal{N}(\mathrm{x}_{t-1};\mu_{\theta}(\mathrm{x}_{t},t),\Sigma_{\theta}(\mathrm{x}_{t},t))\ \ \mathrm{for}\ \ 1<t\leq T$ 에서 $\mu_{\theta},\Sigma_{\theta}$ 를 어떻게 디자인할까?
$\Sigma_{\theta}(\mathrm{x}_{t},t)$
- $\Sigma_{\theta}(\mathrm{x}_{t},t)=\sigma_{t}^{2}I$ 로, sigma는 timestep에 따른 fixed constant로 설정하였음 (X train)
- $\sigma_{t}^{2}$ 는 $\tilde{\beta}_{t}:=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}$ 로 놓아도 되지만 ( $q(\mathrm{x}_{t-1}|\mathrm{x}_{t},\mathrm{x}_{0})$ 의 sigma) 실험적으로는 그냥 $\beta_{t}$ 로 놓는 거랑 별 차이 없었음
- 따라서 $\Sigma_{\theta}(\mathrm{x}_{t},t)=\beta_{t} I$ 로 설정
$\mu_{\theta}(\mathrm{x}_{t},t)$
- $L_{t-1}$ 을 $\mu$ 를 이용해서 다시 쓰면,
  $L_{t-1}=\mathbb{E}_{q}\left[\frac{1}{2\sigma_{t}^{2}}||\tilde{\mu}_{t}(\mathrm{x}_{t},\mathrm{x}_{0})-\mu_{\theta}(\mathrm{x}_{t},t)||^{2} \right]+C$
- 여기서 $\tilde{\mu}$ 는 forward process posterior mean이고, 우리의 $\mu_{\theta}$ 모델이 이걸 예측하도록 하게 만들면 된다
introducing $\epsilon$
- $\mathrm{x}_{t}(\mathrm{x}_{0},\epsilon)=\sqrt{\bar{\alpha}_{t}}\mathrm{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon \ \ \ \mathrm{for}\ \ \ \epsilon \sim \mathcal{N}(0,I)$
  - using one-step diffusion
  - 이걸로 $L_{t-1}$ 식을 reparameterizing하면 좀 더 단순화시킬 수 있다
  - $\tilde{\mu}_{t}$ 계산 과정에 넣어보면 증명됨
$\begin{aligned}L_{t-1}&=\mathbb{E}_{\mathrm{x}_{0},\epsilon}\left[\frac{1}{2\sigma_{t}^{2}}\left|\left|\tilde{\mu}_{t}(\mathrm{x}_{t}(\mathrm{x}_{0},\epsilon),\frac{1}{\sqrt{\bar{\alpha}_{t}}}\mathrm{x}_{t}(\mathrm{x}_{0},\epsilon)-\sqrt{1-\bar{\alpha}_{t}}\epsilon)-\mu_{\theta}(\mathrm{x}_{t}(\mathrm{x}_{0},\epsilon),t)\right|\right|^{2} \right]\\ &= \mathbb{E}_{\mathrm{x}_{0},\epsilon}\left[\frac{1}{2\sigma_{t}^{2}}\left|\left|\frac{1}{\sqrt{\alpha_{t}}}\left(\mathrm{x}_{t}(\mathrm{x}_{0},\epsilon) - \frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon \right)-\mu_{\theta}(\mathrm{x}_{t}(\mathrm{x}_{0},\epsilon),t) \right|\right|^{2}\right]\end{aligned}$
- 이제 $\mu_{\theta}$ 는 $\frac{1}{\sqrt{\alpha_{t}}}\left(\mathrm{x}_{t}- \frac{\beta}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon \right)\ \ \ \mathrm{given}\ \ \ t$ 를 예측해야 함.
  $\mu_{\theta}(\mathrm{x}_{t},t)=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathrm{x}_{t}- \frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon_{\theta}(\mathrm{x}_{t},t) \right)$
  - 여기서 $\epsilon_{\theta}$ 는 $\mathrm{x}_{t}$ 의 noise를 예측하는 모델
- $\mu$ 를 $\epsilon$ 으로 바꿔서 $L_{t-1}$ 을 한번 더 단순화시키면, $L_{t-1} = \mathbb{E}_{\mathrm{x}_{0},\epsilon}\left[\frac{\beta_{t}^{2}}{2\sigma_{t}^{2}\alpha_{t}(1-\bar{\alpha}_{t})}\left|\left|\epsilon - \epsilon_{\theta}(\sqrt{\bar{\alpha}_{t}}\mathrm{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon, t)\right|\right|^{2} \right]$
이제 $L_{t-1}$ 은 denoising score matching과 같은 꼴이 된다
- $L_{t-1}$ 는 Langevin-like reverse process의 variational bound와 같아짐

3.3. Data scaling, reverse process decoder, and $L_{0}$

이미지는 {0,1, ..., 255}에서 [-1, 1]로 scaled 되어서 네트워크에 들어간다
마지막 reverse process의 결과는 다시 {0,1, ..., 255} 이미지로 변환되어야 하니까 마지막 $p_{\theta}$ 는 예외적으로 아래와 같이 정의한다

3.4. Simplified training objective

L_{simple}(\theta):=\mathbb{E}_{t,\mathrm{x}_{0},\epsilon}\left[\left|\left|\epsilon - \epsilon_{\theta}(\sqrt{\bar{\alpha}_{t}}\mathrm{x}_{0}\ +\ \sqrt{1-\bar{\alpha}_t}\epsilon,t) \right|\right|^{2}\right]

DDPM의 최종 Loss term
$t$ is uniform between 1 and $T$
- $t=1$ 일 때는 3.3.절의 $L_{0}$ term을 예외적으로 따른다 (ignore $\sigma_{1}^{2}$ and edge effects)
- $t>1$ 일 때는 3.2.절의 $L_{t-1}$ term에서 weight를 뺀 것
weight를 왜 뺐나?
- weight $\frac{\beta_{t}^{2}}{2\sigma_{t}^{2}\alpha_{t}(1-\bar{\alpha}_{t})}$ 는 $t$ 가 작아짐에 따라 커진다
- 따라서 $\mathrm{x}_{0}$ 에 가까운 매우 작은 noise를 제거하는 단계에 학습이 집중됨
- 모든 timestep이 동일한 loss 비중을 가지게 하려면 이 weight를 빼는 게 낫다
- NCSN에서도 timestep에 따른 loss 비중 맞춰주려고 weight를 조작했음

4. Experiments

김민서

이전 포스트

[논문리뷰] Generative Modeling by Estimating Gradients of the Data Distribution (NCSN, 2019)

다음 포스트

[논문리뷰] Denoising Diffusion Probabilistic Models (DDPM, 2020)

Diffusion Models

0. Abstract

1. Introduction

2. Background

3. Diffusion models and denoising autoencoders

3.1. Forward process and $L_{T}$

3.2. Reverse process and $L_{1:T-1}$

3.3. Data scaling, reverse process decoder, and $L_{0}$

3.4. Simplified training objective

4. Experiments

[논문리뷰] Generative Modeling by Estimating Gradients of the Data Distribution (NCSN, 2019)

[논문리뷰] Score-based Generative Modeling through Stochastic Differential Equations (Score-SDE, 2021)

0개의 댓글

[논문리뷰] Denoising Diffusion Probabilistic Models (DDPM, 2020)

Diffusion Models

0. Abstract

1. Introduction

2. Background

3. Diffusion models and denoising autoencoders

3.1. Forward process and LTL_{T}LT​

3.2. Reverse process and L1:T−1L_{1:T-1}L1:T−1​

3.3. Data scaling, reverse process decoder, and L0L_{0}L0​

3.4. Simplified training objective

4. Experiments

5. Related Work

[논문리뷰] Generative Modeling by Estimating Gradients of the Data Distribution (NCSN, 2019)

[논문리뷰] Score-based Generative Modeling through Stochastic Differential Equations (Score-SDE, 2021)

0개의 댓글

3.1. Forward process and $L_{T}$

3.2. Reverse process and $L_{1:T-1}$

3.3. Data scaling, reverse process decoder, and $L_{0}$