[논문분석] Denoising Diffusion Implicit Models (DDIM)

김종해·2023년 5월 12일

[논문분석] 이미지 생성

목록 보기

4/4

<논문>
[arXiv] Denoising Diffusion Implicit Models

<참고자료>
[Reference] Christopher M Bishop, Pattern Recognition and Machine Learning, 2006
[tistory] DDIM : Denoising Diffusion Implicit Models
[page] [논문리뷰] DDIM : Denoising Diffusion Implicit Model

DDIM은 DDPMs의 후속 논문이다. DDPMs의 단점인 '느린 Sampling 속도'를 해결하고자 새로운 Sampling 방안을 제시하였다. 이 포스팅은 DDPMs를 안다는 전제 하에 진행될 것이므로 DDPMs를 먼저 알고 오는 것을 권장한다.

1. Introduction

기존의 이미지 생성 모델인 VAE는 다양한 이미지를 생성할 수 있지만 quality가 낮았고, GAN은 높은 quality의 이미지를 생성할 수 있지만 다양성이 낮았다. 이와 달리 DDPMs의 diffusion model은 높은 quality의 이미지를 다양하게 생성할 수 있었지만, Sampling 속도가 느리다는 단점이 있었다. diffusion model의 구조 상, Pure Gaussian Noise에 $T(=1000)$ 번의 Denoising Process를 거쳐야 이미지를 생성할 수 있기 때문이다.

DDIM은 non-Markovian diffusion process를 활용하여 DDPMs에서의 Sampling 속도를 10배 이상 향상시킨다. 또한 consistency를 향상시켜, 비슷한 위치에서 $\mathbf{x}_T$ 를 Sampling 한다면 비슷한 이미지를 얻을 수 있다고 한다.

시작하기에 앞서, DDIM에서 사용하는 $\alpha_t$ 는 DDPMs와 다르다. DDPMs는 $\alpha_t = 1-\beta_t$ 인 반면, DDIM에서 $\alpha_t = \prod_{i=1}^T(1-\beta_t)$ 이다. 즉, DDPMs에서의 $\bar{\alpha}_t$ 가 DDIM에서는 $\alpha_t$ 라 쓰이는 것이다. 이러한 이유는 깔끔함 등이 있는데, 이 포스팅에서는 DDPMs 기준의 $\alpha_t$ 를 쓸 것이다.

2. 개요

DDPMs의 느린 Sampling 속도는 무엇 때문일까? DDIM은 Markov Chain을 원인으로 보았다. 이미지를 Sampling 하는 데 $T(=1000)$ 번씩이나 Denoising process를 거치지 말고, 몇 단계씩 건너뛰며 Sampling 속도를 높이자는 것이 DDIM의 주장이다. 이를 위해 non-Markovian Forward process와 non-Markovian Reverse process를 제안하고, Reverse process가 $T(=1000)$ 번의 단계를 거치는 것이 아닌 부분수열(subsequence)에 따라 움직일것을 제안한다.

process가 바뀌면 Loss function이 바뀌는 것이 일반적인데, DDIM의 핵심 중 하나는 Loss function이 바뀌었음에도 불구하고 최적해의 위치는 바뀌지 않는다는 것이다. 즉 파라미터 $\theta$ 가 최적인 순간이 DDPMs, DDIM 모두 같다는 것이며, 따라서 새롭게 학습을 진행할 필요가 없다. 그래서 보통 학습은 DDPMs의 방법으로, Sampling은 DDIM의 방법으로 진행한다고 한다.

3. Non-Markovian process

우선 DDPMs에서의 Loss function을 다시 떠올려보자.

L_{simple}(\theta) = \mathrm{E}_{{t}, \mathbf{x}_0, \epsilon} \bigg[ (\epsilon - \epsilon_{\theta}(\mathbf{x}_t, t))^2 \bigg]

Simple Loss function은 기존의 KL Divergence의 합으로부터, $t$ 에 대한 계수식 $(={\beta_t^2 \over 2\sigma_t^2\alpha_t (1-\bar{\alpha}_t)})$ 을 제거하고 합을 평균으로 바꾼 것이다. 즉 Simple Loss function은 다음과 같이 재구성할 수 있다.

L_\gamma(\epsilon_\theta) \coloneqq \sum_{t=1}^T \gamma_t \mathrm{E}_{\mathbf{x}_0 \sim q(\mathbf{x}_0),\epsilon_t \sim N(0, I)} \bigg[ ||\epsilon_\theta^{(t)}(\mathbf{x}_t, t) - \epsilon_t||_2^2 \bigg]

DDPMs는 본래 $\gamma_t ={\beta_t^2 \over 2\sigma_t^2\alpha_t (1-\bar{\alpha}_t)}$ 였으나, 간단한 Loss function으로 설계하기 위해 $\gamma_t=1$ 로 두었다. DDIM에서는 보다 일반적인 형태를 다루기 위해 $\gamma_t$ 를 유지한다.

DDIM에서 중요하게 본 포인트는 Loss function이 marginal distribution인 $q(\mathbf{x}_t~|~\mathbf{x}_0)$ 에 의해서만 결정되고, joint distribution인 $q(\mathbf{x}_{1:T}~|~\mathbf{x}_0)$ 에는 영향을 받지 않는다는 것이다. [1] 즉, marginal distribution만 유지한다면 joint distribution은 무엇이 와도 상관없다고 해석하여, 같은 marginal distribution을 갖는 다른 process를 고려한다. 그 중, Markov process를 다룬 DDPMs와 달리 non-Markov process를 정의할 것이다.

3-1. Definition

위에서 언급한 바와 같이 non-Markovian process $q_\sigma$ 를 새롭게 설계하는데, 이 process는 $q_\sigma(\mathbf{x}_t~|~\mathbf{x}_0) = N(\mathbf{x}_t~;~\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I)$ 를 만족해야 한다. 이를 고려하여 $q_\sigma$ 를 정의하자.

q_\sigma(\mathbf{x}_{1:T}~|~\mathbf{x}_0) \coloneqq q_\sigma(\mathbf{x}_1~|~\mathbf{x}_0) \prod_{t=2}^T q_\sigma(\mathbf{x}_t~|~\mathbf{x}_{t-1}, \mathbf{x}_0)

위의 식을 잘 살펴보면, $\mathbf{x}_0$ 가 주어졌을 때 우선 $\mathbf{x}_1$ 을 구하고 $(=q_\sigma(\mathbf{x}_1~|~\mathbf{x}_0))$ , $q_\sigma(\mathbf{x}_t~|~\mathbf{x}_{t-1}, \mathbf{x}_0)$ 을 이용하여 $\mathbf{x}_2$ , $\mathbf{x}_3$ , $\cdots$ , $\mathbf{x}_T$ 를 순차적으로 구하는 것이다. 아래 이미지와 이 DDPMs(왼쪽)는 직전 이미지만을 참고했다면, DDIM(오른쪽)은 직전과 처음 이미지를 참고하는 것이다.

또한 Bayes Theorem에 의해,

\begin{aligned} q_\sigma(\mathbf{x}_{1:T}~|~\mathbf{x}_0) &= q_\sigma(\mathbf{x}_1~|~\mathbf{x}_0) \prod_{t=2}^T q_\sigma(\mathbf{x}_t~|~\mathbf{x}_{t-1}, \mathbf{x}_0) \\ &= q_\sigma(\mathbf{x}_1~|~\mathbf{x}_0) \prod_{t=2}^T {q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t,\mathbf{x}_0)~q_\sigma(\mathbf{x}_t~|~\mathbf{x}_0) \over q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_0)} \\ &= q_\sigma(\mathbf{x}_T~|~\mathbf{x}_0) \prod_{t=2}^T q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t,\mathbf{x}_0) \end{aligned}

이다. 어떠한 forward process든 pure Gaussian Noise를 목표로 함은 같으므로

q_\sigma(\mathbf{x}_T~|~\mathbf{x}_0) \coloneqq N(\sqrt{\bar{\alpha}_T}\mathbf{x}_0, (1-\bar{\alpha}_T)I)

로 정의할 수 있다.
또한 모든 $t>1$ 에 대하여

q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t, \mathbf{x}_0) = N(\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2} \cdot {\mathbf{x}_t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0 \over \sqrt{1-\bar{\alpha}_t}}, \sigma_t^2I)

라 정의하자. 위의 두 정의로부터, 모든 $t$ 에 대하여

q_\sigma(\mathbf{x}_t~|~\mathbf{x}_0) = N(\mathbf{x}_t~;~\sqrt{\bar{\alpha}_t},~~(1-\bar{\alpha}_t)I)

를 만족할 수 있다. 논문에서는 이에 대한 증명을 Appendix B의 Lemma 1에 남겨두었다. [2]

위 성질이 증명되면 non-Markovian forward process를 구할 수 있다. Bayes Theorem에 의해

q_\sigma(\mathbf{x}_t~|~\mathbf{x}_{t-1},\mathbf{x}_0) = {q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t, \mathbf{x}_0) q_\sigma(\mathbf{x}_t~|~\mathbf{x}_0) \over q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_0)}

이므로 $q_\sigma(\mathbf{x}_t~|~\mathbf{x}_{t-1},\mathbf{x}_0)$ 를 forward process라 할 수 있다. 또한, $\mathbf{x}_t$ 가 $\mathbf{x}_{t-1}$ 과 $\mathbf{x}_0$ 의 영향을 동시에 받으므로 non-Markov process라 할 수 있다. [3]

$q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t, \mathbf{x}_0)$ 의 꼴은 마치 DDPMs의 $q(\mathbf{x}_{t-1}~|~\mathbf{x}_t, \mathbf{x}_0)$ 와 유사하다. DDPMs에서 $\bar{\Sigma}$ 의 값인 $\sigma_t^2={\beta_t(1-\bar{\alpha}_{t-1}) \over 1-\bar{\alpha}_t}$ 를 대입하면

\begin{aligned} q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t,\mathbf{x}_0) &= N(\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2} \cdot {\mathbf{x}_t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0 \over \sqrt{1-\bar{\alpha}_t}},~~\sigma_t^2I) \\ &= N\bigg({\sqrt{\alpha}(1-\bar{\alpha}_{t-1}) \over 1-\bar{\alpha}_t}\mathbf{x}_t + {\sqrt{\bar{\alpha}_{t-1}}\beta_t \over 1-\bar{\alpha}_t}\mathbf{x}_0,~~{\beta_t(1-\bar{\alpha}_{t-1}) \over 1-\bar{\alpha}_t}I\bigg) \\ &= q(\mathbf{x}_{t-1}~|~\mathbf{x}_t, \mathbf{x}_0) \end{aligned}

q_\sigma(\mathbf{x}_t~|~\mathbf{x}_{t-1},\mathbf{x}_0) = N(\sqrt{\alpha_t}\mathbf{x}_{t-1},~~\beta_tI) = q(\mathbf{x}_t~|~\mathbf{x}_{t-1}) \kern{85pt}

이고, DDPMs에서의 식과 같음을 알 수 있다. 특히 forward process는 자연스럽게 Markov property를 가지게 됨을 알 수 있다. 즉, $q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t, \mathbf{x}_0)$ 는 DDPMs의 경우를 포함하는 일반적인 식이라 볼 수 있다.

실제 Sampling 하는 과정을 generative process라 하며, $p_\theta$ 를 이용하여 나타낸다. DDIM에서 generative process는 $q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t,\mathbf{x}_0)$ 를 닮아야 하는데, 구하고자 하는 $\mathbf{x}_0$ 가 조건으로 걸려있는 상태이다. 하지만 $\mathbf{x}_0$ 를 실제로는 모르는 상황이므로, $\mathbf{x}_t$ 를 이용하여 $\mathbf{x}_0$ 의 예측값을 구하고 이를 대체값으로 쓸 것이다. 만약 $\mathbf{x}_t$ 가 가진 $\epsilon_\theta^{(t)}$ 를 알 수 있다면 (예측할 수 있다면),

\mathbf{x}_0 \approx {\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t}\epsilon_\theta^{(t)} \over \sqrt{\bar{\alpha}_t}} \coloneqq f_\theta^{(t)}(\mathbf{x}_t)

에 의해 $\mathbf{x}_0$ 를 예측할 수 있고,

p_\theta^{(t)}(\mathbf{x}_{t-1}~|~\mathbf{x}_t) = \begin{cases} N(f_\theta^{(1)}(\mathbf{x}_1),~~\sigma_1^2I) &\text{if } t=1\\ q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t,~~f_\theta^{(t)}(\mathbf{x}_t)) &\text{otherwise} \\ \end{cases}

와 같이 generative process를 구성할 수 있다. $\mathbf{x}_t$ 로부터 $\epsilon_\theta^{(t)}$ 를 예측하는 것은 DDPMs의 네트워크가 하는 일과 같다. 이로부터 $p_\theta(\mathbf{x}_{0:T})$ 를 다음과 같이 정의할 수 있다.

p_\theta(\mathbf{x}_{0:T}) = p_\theta(\mathbf{x}_T)\prod_{t=1}^T p_\theta^{(t)}(\mathbf{x}_{t-1}~|~\mathbf{x}_t)

Comparison

DDPMs와 process를 비교해보면 forward process는 $q(\mathbf{x}_t~|~\mathbf{x}_{t-1})$ 에서 $q_\sigma(\mathbf{x}_t~|~\mathbf{x}_{t-1},\mathbf{x}_0)$ 으로, reverse process는 $q(\mathbf{x}_{t-1}~|~\mathbf{x}_t,\mathbf{x}_0)$ 에서 $q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t,\mathbf{x}_0)$ 으로 바뀌었음을 알 수 있다. 또한 generative process 역시

p_\theta(\mathbf{x}_{t-1}~|~\mathbf{x}_t) = \begin{cases} {1 \over \sqrt{\alpha_t}}\bigg(\mathbf{x}_t - {\beta_t \over \sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(\mathbf{x}_t,t) \bigg) &\text{if } t=1\\ N\bigg({1 \over \sqrt{\alpha_t}}\bigg(\mathbf{x}_t - {\beta_t \over \sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(\mathbf{x}_t,t) \bigg),~~\beta_tI\bigg) &\text{otherwise} \\ \end{cases}

에서

p_\theta^{(t)}(\mathbf{x}_{t-1}~|~\mathbf{x}_t) = \begin{cases} N(f_\theta^{(1)}(\mathbf{x}_1),~~\sigma_1^2I) &\text{if } t=1\\ q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t,~~f_\theta^{(t)}(\mathbf{x}_t)) &\text{otherwise} \\ \end{cases}

로 바뀌었다. 두 논문 모두 일반적으로 $q(\mathbf{x}_{t-1}~|~\mathbf{x}_t,\mathbf{x}_0)$ 를 이용하나, 이미지를 최종적으로 생성하는 $t=1$ 에서 다른 process를 사용함을 알 수 있다.

3-2. Loss Function

DDIM에서 정의한 process에 따라 Loss function $(=J_\sigma(\epsilon_\theta))$ 을 재구성해야 한다. 다만 DDPMs에서의 Loss function과 동일한 논리흐름을 가지므로, Loss function의 형식 역시 같다. 즉

\begin{aligned} J_\sigma(\epsilon_\theta) &\coloneqq \mathrm{E}_{\mathbf{x}_{0:T}\sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ \log q_\sigma(\mathbf{x}_{1:T}~|~\mathbf{x}_0)-\log p_\theta(\mathbf{x}_{0:T}) \bigg] \\ &\equiv \mathrm{E}_{\mathbf{x}_{0:T}\sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ \sum_{t=2}^T D_{KL}(q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t,\mathbf{x}_0)~||~p_\theta^{(t)}(\mathbf{x}_{t-1}~|~\mathbf{x}_t)) - \log p_\theta^{(1)}(\mathbf{x}_0~|~\mathbf{x}_1) \bigg] \end{aligned}

이며, 기댓값 내부의 두 항 중 KL Divergence의 합 항은

\begin{aligned} \mathrm{E}&_{\mathbf{x}_{0:T} \sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ D_{KL}(q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t,\mathbf{x}_0)~||~p_\theta^{(t)}(\mathbf{x}_{t-1}~|~\mathbf{x}_t)) \bigg] \\ &= \mathrm{E}_{\mathbf{x}_{0:T} \sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ D_{KL}(q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t,\mathbf{x}_0)~||~q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t,f_\theta^{(t)}(\mathbf{x}_t))) \bigg] \\ &\equiv \mathrm{E}_{\mathbf{x}_{0:T} \sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ {||\mathbf{x}_0-f_\theta^{(t)}(\mathbf{x}_t)||_2^2 \over 2\sigma_t^2} \bigg] \\ &= \mathrm{E}_{\mathbf{x}_{0:T} \sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ {||\epsilon_t - \epsilon_\theta^{(t)}(\mathbf{x}_t)||_2^2 \over 2d\sigma_t^2\bar{\alpha}_t} \bigg] \end{aligned}

이다. 여기서 $\equiv$ 는 학습에 영향을 주지 않을 선에서 같음을 의미한다. 두 Multivariate Gaussian Distribution의 KL Divergence를 계산할 때 등장하는 '학습과 관계없는 상수항'을 신경쓰지 않기 위해 사용한다.

또한 나머지 항은

\begin{aligned} \mathrm{E}&_{\mathbf{x}_{0:T} \sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ -\log p_\theta^{(1)}(\mathbf{x}_0~|~\mathbf{x}_1) \bigg] \kern{135pt} \\ &\equiv \mathrm{E}_{\mathbf{x}_{0:T} \sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ {||\mathbf{x}_0-f_\theta^{(1)}(\mathbf{x}_1)||_2^2 \over 2\sigma_1^2} \bigg] \\ &= \mathrm{E}_{\mathbf{x}_{0:T} \sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ {||\epsilon_1 - \epsilon_\theta^{(1)}(\mathbf{x}_1)||_2^2 \over 2d\sigma_1^2\bar{\alpha}_1} \bigg] \end{aligned}

이다. 즉 $J_\sigma(\epsilon_\theta)$ 를 다시 표현하면

J_\sigma(\epsilon_\theta) \equiv \sum_{t=1}^T {1 \over 2d\sigma_t^2\bar{\alpha}_t} \mathrm{E} \bigg[ ||\epsilon_t-\epsilon_\theta^{(t)}(\mathbf{x}_t)||_2^2\bigg]

이며, 이는 Non-Markovian process 단락에서 $\gamma_t={1 \over 2d\sigma_t^2\bar{\alpha}_t}$ 일 때의 $L_\gamma$ 와 같다. 이로부터, 임의의 $\epsilon_\theta$ 에 대하여 어떤 $\gamma_t\in\reals$ , $C\in\reals$ 가 존재하여 $J_\sigma(\epsilon_\theta) = L_\gamma + C$ 를 만족하다. [4]

위의 결론은 DDIM을 완성하는 중요한 포인트다. 기존의 DDPMs의 표현식으로부터 $\sigma_t$ 를 사용하는 일반화 식으로 process를 재구성하였기 때문에 Loss function은 달라져야 하지만, 위 결론에 의해 우리는 어떤 $L_\gamma+C$ 로 Loss function을 대체할 수 있다.

또한 $\gamma$ 는 최적해를 찾는 데 영향을 주지 않는다. 이는 곧 특정 $L_\gamma$ 를 사용해도 최적해의 값은 변하지 않는다는 뜻이고, 따라서 우리가 잘 알고있는 (DDPMs에서 사용하는) $L_1$ 으로 대체할 수 있다는 의미이다. 만약 DDPMs에서 $\epsilon$ -예측 네트워크를 학습시켰다면, 이를 추가학습 없이 DDIM에서 사용할 수 있다.

3-3. Sampling

DDPMs의 $\epsilon$ -예측 네트워크를 이미 학습시켰다고 가정하자. 즉 $\mathbf{x}_t$ 가 가진 $\epsilon_\theta^{(t)}(\mathbf{x}_t)$ 를 예측할 수 있으며, 따라서 $t \ge 2$ 에서

p_\theta^{(t)}(\mathbf{x}_{t-1}~|~\mathbf{x}_t) = q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t,~~f_\theta^{(t)}(\mathbf{x}_t))

로부터

\mathbf{x}_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \bigg({\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t}\epsilon_\theta^{(t)}(\mathbf{x}_t) \over \sqrt{\bar{\alpha}_t}} \bigg) +\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\cdot\epsilon_\theta^{(t)}(\mathbf{x}_t)+\sigma_t\epsilon_t

where~~\epsilon_t \sim N(0, I)

와 같이 $\mathbf{x}_{t-1}$ 을 Sampling 할 수 있다. 앞서 보인 바와 같이 $\sigma_t^2={\beta_t(1-\bar{\alpha}_{t-1}) \over 1-\bar{\alpha}_t}$ 일 때는 DDPMs의 Sampling 과정과 같아진다.

하지만 여전히 한 단계씩 Sampling을 진행하고 있으며, 이대로면 DDPMs와 큰 차이가 없을 것이다. 우리의 목표는 몇 단계를 한 번에 건너뛰어 Sampling 속도를 높이는 것이다.

4. Accelerated Sampling

$p_\theta^{(t)}(\mathbf{x}_{t-1}~|~\mathbf{x}_t)$ 는 여전히 한 단계씩 Sampling을 진행하므로 속도를 개선시키지 못한다. 따라서 Sampling 속도를 가속화하기 위해 $\lang T,~T-1,~\cdots,~2,~1\rang$ 의 순서가 아닌 $\lang \tau_S,~\tau_{S-1},~\cdots,~\tau_2,~\tau_1\rang$ 의 순서로 Sampling을 진행한다. 수열 $\lang1,~2,~\cdots,~T\rang$ 의 부분수열 $\tau = \lang\tau_1,~\tau_2,\cdots,~\tau_S\rang$ 는

$\tau_i \in \{1,~2,~\cdots,~T\}~~~~for~~every~~ i \in\{1,~2,~\cdots,~S\}$
$S < T$
$i<j$ 이면 $\tau_i < \tau_j$
$\tau_S = T$

를 만족하는 임의의 수열이다.

4-1. Definition

$\mathbf{x}_{\tau_i}$ 만을 거쳐가는 process $q_{\sigma,\tau}$ 를 다시 정의할 것이다. 흐름은 $q_\sigma$ 때와 비슷하며, $t \notin \tau$ 인 $\mathbf{x}_t$ 는 큰 관심이 없으므로 비교적 간단히 정의될 것이다.

q_{\sigma,\tau}(\mathbf{x}_{1:T}~|~\mathbf{x}_0) = q_{\sigma,\tau}(\mathbf{x}_{\tau_1}~|~\mathbf{x}_0)\prod_{i=2}^S q_{\sigma,\tau}(\mathbf{x}_{\tau_i}~|~\mathbf{x}_{\tau_{i-1}},\mathbf{x}_0) \prod_{t\notin\tau} q_{\sigma,\tau}(\mathbf{x}_t~|~\mathbf{x}_0)

$t \in \tau$ 에 대해서는 수열 상 직전 $(\mathbf{x}_{\tau_{i-1}} \to \mathbf{x}_{\tau_i})$ 이미지와 초기 이미지 $(\mathbf{x}_0)$ 를 참고하며,
$t \notin \tau$ 에 대해서는 초기 이미지 $(\mathbf{x}_0)$ 만을 참고한다.

또한 Bayes Theorem에 의해,

\begin{aligned} q_{\sigma,\tau}(\mathbf{x}_{1:T}~|~\mathbf{x}_0) &= q_{\sigma,\tau}(\mathbf{x}_{\tau_1}~|~\mathbf{x}_0)\prod_{i=2}^S q_{\sigma,\tau}(\mathbf{x}_{\tau_i}~|~\mathbf{x}_{\tau_{i-1}},\mathbf{x}_0) \prod_{t\notin\tau} q_{\sigma,\tau}(\mathbf{x}_t~|~\mathbf{x}_0) \\ &= q_{\sigma,\tau}(\mathbf{x}_{\tau_1}~|~\mathbf{x}_0)\prod_{i=2}^S {q_{\sigma,\tau}(\mathbf{x}_{\tau_{i-1}}~|~\mathbf{x}_{\tau_i},\mathbf{x}_0)~q_{\sigma,\tau}(\mathbf{x}_{\tau_i}~|~\mathbf{x}_0) \over q_{\sigma,\tau}(\mathbf{x}_{\tau_{i-1}}~|~\mathbf{x}_0)} \prod_{t\notin\tau} q_{\sigma,\tau}(\mathbf{x}_t~|~\mathbf{x}_0) \\ &= q_{\sigma,\tau}(\mathbf{x}_{\tau_S}~|~\mathbf{x}_0)\prod_{i=2}^S q_{\sigma,\tau}(\mathbf{x}_{\tau_{i-1}}~|~\mathbf{x}_{\tau_i},\mathbf{x}_0) \prod_{t\notin\tau} q_{\sigma,\tau}(\mathbf{x}_t~|~\mathbf{x}_0) \\ \end{aligned}

이다. $t \notin \tau$ 인 $\mathbf{x}_t$ 는 어차피 Sampling 단계에서 사용되지 않으니 다음과 같이 간단히 나타낸다. 또한 마지막 $(=T)$ 이미지 역시 $q_\sigma$ 때와 같이 나타낸다.

q_{\sigma,\tau}(\mathbf{x}_t~|~\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_t}\mathbf{x}_0,~~(1-\bar{\alpha}_t)I)~~~~for~~t=T,~~every~~t\notin\tau

모든 $\tau_i \in \tau$ 에 대하여

q_{\sigma,\tau}(\mathbf{x}_{\tau_{i-1}}~|~\mathbf{x}_{\tau_i},\mathbf{x}_0) = N\bigg(\sqrt{\bar{\alpha}_{\tau_{i-1}}}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_{\tau_{i-1}}-\sigma_{\tau_i}^2} \cdot {\mathbf{x}_{\tau_i}-\sqrt{\bar{\alpha}_{\tau_i}}\mathbf{x}_0 \over \sqrt{1-\bar{\alpha}_{\tau_i}}},~~\sigma_{\tau_i}^2I \bigg)

라 정의하자. 이번에도 역시 Appendix B의 Lemma 1에 의해

q_{\sigma,\tau}(\mathbf{x}_{\tau_i}~|~\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_{\tau_i}}\mathbf{x}_0,~~(1-\bar{\alpha}_{\tau_i})I)~~~~for~~every~~\tau_i \in \tau

가 성립한다. [5] 저자는 위의 정의에서 $\mathbf{x}_{\tau_i}$ 와 $\mathbf{x}_0$ 가 'chain'(기차처럼 한 줄로 이어진 형태)을 이루고, 그 외 나머지와 $\mathbf{x}_0$ 가 'star graph'(모두가 오직 $\mathbf{x}_0$ 와 연결된 형태)를 이룬다고 말한다.

실제 Sampling 역시 부분수열 $\tau$ 만을 거치며 진행된다. 가속된 generative process $p_\theta$ 는 다음과 같이 정의된다.

p_\theta(\mathbf{x}_{0:T}) \coloneqq p_\theta(\mathbf{x}_T) \prod_{i=1}^S p_\theta^{(\tau_i)} (\mathbf{x}_{\tau_{i-1}}~|~\mathbf{x}_{\tau_i}) \prod_{t \notin \tau} p_\theta^{(t)}(\mathbf{x}_0~|~\mathbf{x}_t)

이때 실질적으로 Sampling에 사용되는 $p_\theta^{(\tau_i)} (\mathbf{x}_{\tau_{i-1}}~|~\mathbf{x}_{\tau_i})$ 는 non-Markov diffusion process를 닮는 것이 합리적이므로

p_\theta^{(\tau_i)} (\mathbf{x}_{\tau_{i-1}}~|~\mathbf{x}_{\tau_i}) = q_{\sigma, \tau} (\mathbf{x}_{\tau_{i-1}}~|~\mathbf{x}_{\tau_i}, f_\theta^{(\tau_i)}(\mathbf{x}_{\tau_i}))

로 정의하며, 그 외 단계들은 다음과 같이 정의한다.

p_\theta^{(t)}(\mathbf{x}_0~|~\mathbf{x}_t) = N(f_\theta^{(t)}(\mathbf{x}_t), \sigma_t^2I)

4-2. Loss Function

앞선 Loss function에서 했던 바와 같이, 모든 $\sigma, \tau$ 에 대하여 $J_{\sigma, \tau} (\epsilon_\theta) = L_\gamma + C$ 로 표현할 수 있는 $\gamma_t$ 가 존재함을 보여야 한다. 만약 존재한다면, 임의의 부분수열만을 거치도록 설계된 generative process에 의해 만들어지는 Loss function은 어떤 $L_\gamma$ 로 대체될 수 있으며, $L_\gamma$ 의 최적해는 DDPMs에서 사용한 $L_1$ 의 최적해와 같으므로, $\epsilon$ -예측 네트워크를 따로 학습시키지 않고 (=DDPMs의 네트워크를 이용하여) 원하는 부분수열만을 거쳐가도록 Sampling 할 수 있다.

관련된 내용은 논문의 Appendix C.1에 기술되어 있지만 증명은 생략되었고, 수식이 논리적으로 전개되지 않으므로 위 증명은 생략하겠다.

4-3. Sampling

결론은 Loss Function이 달라지더라도 최적해의 위치는 DDPMs의 $L_1$ 과 같다는 것이며, 따라서 Sampling 단계에서

p_\theta(\mathbf{x}_{\tau_{i-1}}~|~\mathbf{x}_{\tau_i}) = \begin{cases} N(f_\theta^{(\tau_1)}(\mathbf{x}_{\tau_1}), \sigma_{\tau_1}^2I) &\text{if } i=1 \\ q_{\sigma, \tau}(\mathbf{x}_{\tau_{i-1}}~|~\mathbf{x}_{\tau_i}, f_\theta^{(\tau_i)}(\mathbf{x}_{\tau_i})) &\text{otherwise} \end{cases}

를 이용하여,

\mathbf{x}_{\tau_{i-1}} = \begin{cases} {\mathbf{x}_{\tau_1} - \sqrt{1-\bar{\alpha}_{\tau_1}}\epsilon_\theta^{(\tau_1)}(\mathbf{x}_{\tau_1}) \over \sqrt{\bar{\alpha}_{\tau_1}}} +\sigma_{\tau_1}\epsilon_{\tau_1} &\text{if } i=1 \\ \\ \sqrt{\bar{\alpha}_{\tau_{i-1}}} \cdot {\mathbf{x}_{\tau_i} - \sqrt{1-\bar{\alpha}_{\tau_i}}\epsilon_\theta^{(\tau_i)}(\mathbf{x}_{\tau_i}) \over \sqrt{\bar{\alpha}_{\tau_i}}} + \sqrt{1-\bar{\alpha}_{\tau_{i-1}}-\sigma_{\tau_i}^2} \cdot \epsilon_\theta^{(\tau_i)}(\mathbf{x}_{\tau_i}) + \sigma_{\tau_i}\epsilon_{\tau_i} &\text{otherwise} \end{cases}

where~~\epsilon_{\tau_i} \sim N(0, I)

와 같이 Sampling을 진행할 수 있다. Acceleration 이전과 식은 비슷하지만, 부분수열 $\tau$ 를 마음대로 정의하여 Sampling 횟수를 획기적으로 줄일 수 있다는 점이 큰 차이다.

5. Experiment

$\sigma$ 와 $\tau$ 는 실험을 진행하는 사람에 따라 변화할 수 있는 하이퍼파라미터다. DDIM은 다양한 $\sigma$ 와 $\tau$ 를 비교하여 가장 높은 성능을 보이는 순간을 찾는 실험을 진행하였다.

$S$ 는 부분수열 $\tau$ 의 길이이며, DDIM은 DDPMs와 성능을 비교하기 위해 $\eta$ 를 다음과 같이 정의하였다.

\sigma_{\tau_i} = \eta \cdot {\beta_t (1-\bar{\alpha}_{t-1}) \over 1-\bar{\alpha}_t}

$\eta = 1$ 일 때 $\sigma_{\tau_i}$ 는 DDPMs의 것과 같아지며, $\eta = 0$ 까지 줄이며 실험을 진행하였다. $\eta = 1$ , $S=1000$ 일 때가 정확히 DDPMs의 실험이다.

평가지표는 FID이며, 값이 작을수록 높은 성능을 의미한다. 전체적으로 $S$ 가 클수록, $\eta$ 가 작을수록 성능이 좋음을 알 수 있다. 이로부터 DDIM은 $\eta=0$ 을 채택하여 $\sigma_{\tau_i}=0$ 으로 둔다. $S=1000$ 이면 Sampling 시간이 오래 걸리므로, 어느정도 성능 하락을 감수하면서 속도를 10배 이상 $(S \le100)$ 향상시켰다. [6]

6. Endnotes

[1] DDPMs의 네트워크는 $\mathbf{x}_t$ 가 가진 $\epsilon_\theta^{(t)}$ 를 잘 추출하도록 훈련되며, 그 방법은 미리 준비된 정답 $(=\epsilon_\theta^{(t)})$ 으로부터 문제 $(= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\epsilon_\theta^{(t)}=\mathbf{x}_t)$ 를 만들고, 네트워크가 문제로부터 정답 $(=\epsilon_\theta^{(t)})$ 을 잘 추출하는지 비교하는 방식이다. 이때 $\mathbf{x}_t$ 를 만들기 위해 사용되는 distribution은

q(\mathbf{x}_t~|~\mathbf{x}_0) = N(\mathbf{x}_t~;~\sqrt{\bar{\alpha}_t}\mathbf{x}_0 , (1-\bar{\alpha}_t)I)

인 marginal distribution 뿐이다. 따라서 marginal distribution만 같다면, joint distribution은 무엇이든 상관없다는 아이디어다.

[2] Appendix B의 Lemma 1을 증명하자. 논문은 참고문헌으로 Christopher M Bishop 의 Pattern recognition and machine learning을 인용하였다. 해당 서적에서 참고한 수식은 다음과 같다.

Lemma 1의 증명 아이디어는 수학적 귀납법(Induction)과 유사하다. 다음 두 식

q_\sigma(\mathbf{x}_T~|~\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_T}\mathbf{x}_0, (1-\bar{\alpha}_T)I)

q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t, \mathbf{x}_0) = N \bigg(\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2} \cdot {\mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0 \over \sqrt{1-\bar{\alpha}_t}}, \sigma_t^2I \bigg)

이 정의되었을 때, 모든 $t \le T$ 에 대하여

q_\sigma(\mathbf{x}_t~|~\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I)\\ \Downarrow \\ q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0, (1-\bar{\alpha}_{t-1})I)

임을 증명하자. $a \Rightarrow b$ 는 $a$ 가 참일 때 $b$ 가 참이라는 뜻이다.

만약 증명된다면, 우리는 이미 $t=T$ 일 때인 $q_\sigma(\mathbf{x}_T~|~\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_T}\mathbf{x}_0, (1-\bar{\alpha}_T)I)$ 가 성립함을 알고 있으므로 연쇄적으로 $t=T-1,~\cdots,~2,~1$ 일 때 성립함을 보일 수 있다.

q_\sigma(\mathbf{x}_t~|~\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I)

가 참이라 가정하자. 만약 $\mathbf{x}_0$ 를 상수라고 인식한다면 다음과 같이 표현할 수 있다.

q_\sigma(\mathbf{x}_t) = N(\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I)

마찬가지로

q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t) = N \bigg(\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2} \cdot {\mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0 \over \sqrt{1-\bar{\alpha}_t}}, \sigma_t^2I \bigg)

라 표현할 수 있으며, 수식 (2.115)에 의해

\begin{aligned} q_\sigma(\mathbf{x}_{t-1}) &= q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_0) \\ &=N \bigg( \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0, \bigg( \sigma_t^2I + {1-\bar{\alpha}_{t-1}-\sigma_t^2 \over 1-\bar{\alpha}_t} \cdot (1-\bar{\alpha}_t) \bigg)I \bigg) \\ &= N(\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0, (1-\bar{\alpha}_{t-1})I) \end{aligned}

이다. 따라서 수학적 귀납법에 의해, 모든 $t \le T$ 에 대하여

q_\sigma(\mathbf{x}_t~|~\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I)

가 성립한다.

[3] $q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t$ , $\mathbf{x}_0), q_\sigma(\mathbf{x}_t~|~\mathbf{x}_0)$ 을 알고 있으므로 $q_\sigma(\mathbf{x}_t~|~\mathbf{x}_{t-1},\mathbf{x}_0)$ 를 실제로 구할 수 있다.

q_\sigma(\mathbf{x}_t~|~\mathbf{x}_{t-1},\mathbf{x}_0) = {q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_t, \mathbf{x}_0)~q_\sigma(\mathbf{x}_t~|~\mathbf{x}_0) \over q_\sigma(\mathbf{x}_{t-1}~|~\mathbf{x}_0)} = N(\mathbf{x}_t~;~\tilde{\mu}, \tilde{\Sigma}) \\ \space\\ where

\tilde{\mu} = {\sqrt{1-\bar{\alpha}_t}\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2} \over 1-\bar{\alpha}_{t-1}}\mathbf{x}_{t-1} + \bigg(\sqrt{\bar{\alpha}_t} - {\sqrt{1-\bar{\alpha}_t}\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\sqrt{\bar{\alpha}_{t-1}} \over 1-\bar{\alpha}_{t-1}}\bigg) \mathbf{x}_0, \\ \space\\ \tilde{\Sigma} = {\sigma_t^2(1-\bar{\alpha}_t) \over 1-\bar{\alpha}_{t-1}}I \kern{275pt}

으로, Multivariate Gaussian Distribution이다.

사실 $q_\sigma(\mathbf{x}_{t-1}, \mathbf{x}_0)$ 에 의해 $\mathbf{x}_0$ 를 $\mathbf{x}_{t-1}$ 로 변환할 수 있고, 이때 $q_\sigma(\mathbf{x}_t~|~\mathbf{x}_{t-1},\mathbf{x}_0)$ 는 Markov process가 된다. 하지만 forward process에서 markov 여부는 크게 중요하지 않다.

[4] 논문에서는 Appendix B의 Theorm 1에 기술되어 있다.

[5] [2]와 같은 과정이다. $t$ 대신 $\tau_i$ 가, $t-1$ 대신 $\tau_{i-1}$ 이 들어갈 뿐이다.

[6] $\eta=0$ 이면 $\sigma_{\tau_i}=0$ 이고, 이는 diffusion process와 denoising process 모두에서 무작위성을 넣지 않겠다는 의미이다. 오히려 생성모델의 다양성을 떨어트리는 인사이트로 보인다.

결과론적인 이야기지만 위 상황에 의미를 부여하면, 기존 DDPMs에서 denoising process는 $\mu_\theta$ 로 denoise를 한 뒤 $\Sigma_\theta$ 로 다시 노이즈를 넣었다고 볼 수 있다. Denoise 하는 데 노이즈를 다시 넣는 것이 직관에는 맞지 않으며, DDIM에서 $\sigma_{\tau_i}=0$ 으로 둠으로써 이러한 비효율을 해소했다고 해석할 수 있다.

김종해

이전 포스트