Denoising Diffusion Implicit Models(DDIM)

김민솔·2024년 3월 26일

Diffusion

목록 보기

3/5

DDPM(Denoising Diffusion Probabilistic Models)은 adversarial training 없이(GAN 계열 모델 x) 이미지 생성 분야에서 좋은 성능을 보였습니다. 하지만, 샘플링 과정에서 Markov chain이 사용되기 때문에 매우 많은 시간이 소요됩니다. 따라서 DDIM(Denoising Diffusion Implicit Models)에서는 Non-Markovian process로 같은 목적 함수를 사용하여 샘플링하는 방법을 고려하였습니다. 이에 대해서 간단하게 살펴보겠습니다.

본 포스트는 DDPM에 대한 기본적인 이해도가 있다는 가정 하에 작성되었습니다.

Back to DDPM !

Loss function

L_{simple}(\theta) := \mathbb{E}_{t, \mathbf{x}_0, \epsilon}[||\epsilon - \epsilon_\theta(\sqrt{\overline{\alpha_t}}\mathbf{x}_0 + \sqrt{1 - \overline{\alpha_t}}\epsilon, t)||^2]

DDIM의 핵심은 DDPM에서와 같은 Objective function으로 샘플링이 가능하게 설계하는 것입니다. (같은 목적 함수로 설계되었기 때문에, pretrained DDPM 모델에 DDIM 샘플링을 적용하는 것이 가능합니다!)

따라서 DDIM 샘플러를 설계하기 위해서는, DDPM에서의 Forward process에서 가지는 ①diffusion 커널, ②posterior, ③posterior mean 총 3가지의 조건을 만족하는 Forward process의 familiy를 찾아야 합니다.

위에서 언급한 3가지 조건 수식을 아래에서 소개한 후에, Non-Markovian process에 대해서 살펴보겠습니다.

Diffusion Kernel

q(\mathbf{x}_t|\mathbf{x}_{0}) = \mathcal{N}(\mathbf{x}_{t};\sqrt{\overline{\alpha_t}}\mathbf{x}_{0}, (1-\overline{\alpha_t})\mathbf{I})

$\overline{\alpha_t} = \prod^t_{s=1} (1 - \beta_s)$
$\mathbf{x}_t = \sqrt{\overline{\alpha_t}}\mathbf{x}_{0} + \sqrt{(1-\overline{\alpha_t})}\epsilon$

Forward process(posterior)

q(\mathbf{x}_{t-1}|\mathbf{x}_T, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde\mu_t(\mathbf{x}_t, \mathbf{x}_0), \tilde\sigma^2_t\mathbf{I})

$\tilde\mu_t(\mathbf{x}_t, \mathbf{x}_0) = a\mathbf{x}_t + b\epsilon = a\mathbf{x}_t + b\frac {\mathbf{x}_t - \sqrt{\overline{\alpha}_{t}}\mathbf{x}_0} {\sqrt{1 - \overline{\alpha}_{t}}}$
: posterior mean

Reverse process

p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t), \tilde\sigma^2_t\mathbf{I})

$\mu_\theta(\mathbf{x}_t, t) = a\mathbf{x}_t + b\epsilon_\theta(\mathbf{x}_t, t) = a\mathbf{x}_t + b\frac {\mathbf{x}_t - \sqrt{\overline{\alpha}_{t}} \hat{\mathbf{x}}_0} {\sqrt{1 - \overline{\alpha}_{t}}}$

Variational Inference for Non-Markovian Forward process

Forward process(posterior)

q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\sqrt {\overline{\alpha}_{t-1}} \mathbf{x}_0 + \sqrt {1 - \overline{\alpha}_{t-1} - \tilde\sigma^2_t} \cdot \frac {\mathbf{x}_t - \sqrt{\overline{\alpha}_{t}} {\mathbf{x}}_0} {\sqrt{1 - \overline{\alpha}_{t}}} , \tilde\sigma^2_t\mathbf{I})

Reverse process

p(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\sqrt {\overline{\alpha}_{t-1}} \hat{\mathbf{x}}_0 + \sqrt {1 - \overline{\alpha}_{t-1} - \tilde\sigma^2_t} \cdot \frac {\mathbf{x}_t - \sqrt{\overline{\alpha}_{t}} \hat{\mathbf{x}}_0} {\sqrt{1 - \overline{\alpha}_{t}}} , \tilde\sigma^2_t\mathbf{I})

Forward process의 3가지 조건을 만족하는 Non-Markovian process와 이에 따른 Reverse process입니다. 이 친구들을 활용하여 샘플링 과정까지 알아보겠습니다.

Sampling from Generalized Generative Processes

위에서 정의한 process에 의해서 denoised observation을 다음과 같이 정의할 수 있습니다.

f^{(t)}_\theta(\mathbf{x}_t) := \frac {(\mathbf{x}_t - \sqrt {1 - \overline{\alpha}_t } \cdot \epsilon^{(t)}_\theta(\mathbf{x}_t)} {\sqrt {\overline{\alpha}_t}}

Reparameterization trick을 변형하여 유도된 것입니다! ( $\mathbf{x}_t = \sqrt{\overline{\alpha_t}}\mathbf{x}_{0} + \sqrt{(1-\overline{\alpha_t})}\epsilon$ )

이를 통해 generative process를 정리하면 다음과 같습니다.

p^{(t)}_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \begin{cases} \mathcal{N}(f^{(1)}_\theta(\mathbf{x}_{1}), \sigma^2_1\mathbf{I}) &\text{if } t=1 \\ q_\sigma(\mathbf{x}_{t-1}|\mathbf{x}_t, f^{(t)}_\theta(\mathbf{x}_t)) &\text{otherwise, } \end{cases}

generative process로 구한 $p_\theta(\mathbf{x}_{1:T})$ 로부터 한 샘플 $\mathbf{x}_{t-1}$ 을 구하는 식(from $\mathbf{x}_{t}$ )은 다음과 같습니다.

\mathbf{x}_{t-1} = \sqrt {\overline{\alpha}_{t-1}}\Big( \frac {(\mathbf{x}_t - \sqrt {1 - \overline{\alpha}_t } \cdot \epsilon^{(t)}_\theta(\mathbf{x}_t)} {\sqrt {\overline{\alpha}_t}} \Big) + \sqrt {1 - \overline{\alpha}_{t-1} - \sigma^2_t}\cdot \epsilon^{(t)}_\theta(\mathbf{x}_t) + \sigma_t\epsilon_t

predicted $\mathbf{x}_{0}$ : $\frac {(\mathbf{x}_t - \sqrt {1 - \overline{\alpha}_t } \cdot \epsilon^{(t)}_\theta(\mathbf{x}_t)} {\sqrt {\overline{\alpha}_t}}$
direction pointing to $\mathbf{x}_{t}$ : $\sqrt {1 - \overline{\alpha}_{t-1} - \sigma^2_t}\cdot \epsilon^{(t)}_\theta(\mathbf{x}_t)$
random noise: $\sigma_t\epsilon_t$

이때 모든 $t$ 에 대해 $\tilde\sigma^2_t =0$ 이라면, forward process가 deterministic이 됩니다. $\mathbf{x}_{t} \rightarrow \mathbf{x}_{0}$ 의 샘플링 과정이 고정되고, 잠재 변수로부터 샘플들이 생성되는 모델인 implicit probabilistic model이 도출되며, 해당 determisitic generative process가 DDIM sampler입니다.

ODE interpretation

\frac {\mathbf{x}_{t-\Delta t}} {\sqrt{\overline{\alpha}_{t-\Delta t}}} = \frac {\mathbf{x}_{t}} {\sqrt{\overline{\alpha}_{t}}} + \Big(\sqrt {\frac {1 - {\overline{\alpha}_{t-\Delta t}}} {{\overline{\alpha}_{t-\Delta t}}}} - \sqrt{\frac {1 - {\overline{\alpha}_{t}}} {\overline{\alpha}_{t}}} \Big) \epsilon^{(t)}_\theta(\mathbf{x}_t)

generative process sampling 식을 ODE 형식으로 바꾼 식입니다.
DDIM의 샘플링 과정을 ODE로 나타낸 것입니다. ( $\sigma=0$ )

d\overline{\mathbf{x}}(t) = \epsilon^{(t)}_\theta \Big(\frac {\overline{\mathbf{x}}(t)} {\sqrt {\sigma^2 + 1}} \Big) d\sigma(t)

reparameterization: $\sqrt{\frac {1 - {\overline{\alpha}_{}}} {\overline{\alpha}_{}}} \rightarrow \sigma$ , $\frac {\mathbf{x}} {\sqrt{\overline{\alpha}}} \rightarrow \overline{\mathbf{x}}$ $\frac {} {}$

이처럼, DDIM을 하나의 ODE로 바라보는 것도 가능합니다. 또한 최적의 mode를 가질 때 해당 ODE가 “variance exploding” SDE의 probability flow ODE가 됩니다.

그림을 통해 ODE들을 확인해보면, DDIM ODE가 비교적 더 적은 곡률을 가지고 대체로 선형의 궤적을 지니는 것을 볼 수 있습니다. 이를 통해 DDIM ODE가 가지는 장점은 다음과 같습니다.

더 적은 truncation errors(근사치 추정 오류)를 가진다.
샘플링 속도가 더 빨라진다.

Experiments

DDIM은 적은 iteration을 사용하였을 때 DDPM보다 좋은 샘플 퀄리티를 낼 뿐만 아니라 샘플링 속도도 약 10배에서 50배까지 오르는 것을 확인할 수 있습니다.

\sigma_{\tau_i}(\eta) = \eta\sqrt{(1 - \alpha_{\tau_{i-1}}) / (1 - \alpha_{\tau_{i}})}\sqrt{1 - \alpha_{\tau_{i}}/ \alpha_{\tau_{i-1}}}

$\hat \sigma = \sqrt{1 - \alpha_{\tau_{i}}/ \alpha_{\tau_{i-1}}}$

$\eta = 1.0$ 과 $\hat \sigma$ 인 경우가 DDPM이고, $\eta$ 가 감소할수록 deterministic이 강해집니다. ( $\eta = 0.$ → DDIM)

CIFAR10과 CelebA 데이터셋 모두에서 $\eta$ 가 0일 때를 확인해보면, steps 수가 10 ~ 100일 때 DDIM이 DDPM보다 높은 FID score를 가지는 것을 확인할 수 있습니다. 또한 steps 수가 1000일 때도 DDIM과 DDPM의 FID 값이 큰 차이를 보이지 않는 점도 주목할 만합니다.

Nvidia 2080 Ti GPU를 사용하였을 때 sampling steps에 따라 50K images를 생성하는 속도를 비교한 그래프입니다. 적은 steps를 사용할 수록 1000 steps에 비해 약 10배 ~ 50배 빠른 DDIM의 샘플링 속도를 확인할 수 있습니다.

Progressive distillation

DDIM sampler를 하나의 증류 방식으로 고안한 모델입니다. “student” 모델이 인접한 두 샘플링 스텝을 학습하고, 이를 “teacher” 모델에서의 한 스텝이 되게 합니다. distillation stage가 지나면서 “student” 모델을 새로운 “teacher” 모델로 만들어가는 학습 방법입니다.

Algorithm

일반적인 diffusion training과 Progressive distillation의 알고리즘을 비교해보겠습니다. 일반적인 diffusion 학습에서는 data, time, noise를 샘플링하고 이를 diffusion한 후에 Loss로 최적화를 진행합니다. Progressive distillation에서는 ① t를 “student”의 수만큼 나누어 샘플링하고, ② 각각 diffusion step을 적용한 후 ③ “teacher” $\hat x$ 를 타겟으로 학습합니다. 해당 stage가 끝나면, “student” 모델로 새로운 “teacher” 모델을 만든 후에 샘플링 step을 반으로 줄입니다.

FID

Generatice model로 생성된 이미지를 평가하기 위해 사용하는 이미지 비교 방식은 크게 두 가지가 있습니다. 바로 Pixel distance와 Feature distance입니다. Pixel distance는 두 이미지의 픽셀 값을 이용한 거리 측정 방식입니다. 이는 매우 간단하기 때문에 신뢰성이 떨어집니다. Feature distance는 두 이미지를 inception model의 통과시켜 추출한 피쳐 간의 mean과 covariance를 계산하여 두 분포의 거리를 측정하는 방식입니다. 실제 data dist.와 가짜 data dist. 간의 유사도를 비교하는 방식으로, 현재도 널리 사용되는 이미지 평가 metric입니다.