DDIM(Denoising Diffusion Implicit Models)

김상윤·2024년 6월 25일

Paper Reading

목록 보기
2/2

Abstract

DDPM achieved great performance in image generation, yet require many steps of Markov chain to produce a sample. The process is very slow compared to the previous GAN approach. A more efficient way to sample will accelerate the whole process of DDPMs in the mean time keeping the training procedure same, namely DDIM. We generalize the DDPM process into non-Markovian diffusion process using the same training objective. The non-markovian process can correspond to generative process that is deterministic, giving rise to implicit models that produce high quality samples faster. DDIM achieves semantically meaningful image interpolation directly in the latent space and reconstructs observations very precisely.

Deterministic Process: In DDIM, the generative process is designed to be deterministic, meaning that given a specific initial noise, the sequence of steps taken to generate a sample is fixed and does not involve any randomness.
Reconstruct Observation well: DDIM is highly effective at generating samples that closely match the original data used during training. This low error is achieved through the deterministic and flexible nature of the reverse process in DDIM,

Introduction

Existing Models in Generative models

GANDDPM & NCSN
Better sample quality than VAE, FlowGreat sample quality without adverserial training
Very hard to optimize because training is very unstableMany denoising autoencoding models trained to denoise samples corrupted by various levels of Gaussian noise.
Mode collase issue(limited modes)Require many iterations to produce a high quality sample

DDIM

To meet the efficiency gap between GAN and DDPM, authors present DDIM. DDIM is a implicit model like GAN highy related to DDPM, in that they share the same Training Objective. By generalizing the forward(diffusion) process to non-Markovian process, it is possible to design a training objective which it happens to be exactly same as DDPM objective. The advantage of non-Markovian process is the possibility of choosing a large variety of Generative models that use the same neural network simply by choosing a different pair of (non-Markovian diffusion process, reverse Markovian generative process). By using non-Markovian diffusion process, it is feasible to achieve a shorter reverse generative Markov chain.(-> Advantage of non-markovian process)

The benefits of DDIM over DDPM is as follows

  • Sample generation quality is superior to DDPM in 10X ~ 100X accelerated sampling
  • Consistency property is better held : Starting from the same latent variable, samples with different chain lengths
  • Consistency in the DDIM allows semantically meaningful image interpolation via manipulating the initial latent variables.

Background

The parame- ters θ are learned to fit the data distribution q(x0) by maximizing a variational lower bound:

Lγ(ϵθ):=t=1TγtEx0q(x0),ϵtN(0,I)[ϵθ(t)(αtx0+1αtϵt)ϵt22]L_\gamma(\epsilon_θ):=\sum_{t=1}^{T}\gamma_tE_{x_0∼q(x0),\epsilon_t∼N(0,I)}\bigg\lbrack\Vert\epsilon_θ^{(t)} (\sqrtα_tx_0 + \sqrt{1 − αt}\epsilon_t)-\epsilon_t\Vert_2^2\bigg\rbrack

Variantional Inference for Non-Markovian Forward Process

Our observation of DDPM objective in the form of LγL_\gamma is that it only depends on the marginal distribution q(xtx0)q(x_t|x_0) but not directly on the joint distribution q(x1:Tx0)q(x_{1:T}|x_0).
Explanation

  1. Dependency on Marginal Distribution:
  • The DDPM objective focuses on how well the model can predict each noisy data point 𝑥t𝑥_t given the original data x0x_0.
  • This means that at each time step t, the objective is concerned with 𝑞(𝑥𝑡𝑥0)𝑞(𝑥_𝑡∣𝑥_0), which summarizes the effect of noise added up to that step.
  1. Not Directly on the Joint Distribution:
  • The joint distribution q(x1:Tx0)q(x_{1:T}|x_0) captures the entire sequence of noisy data points, but DDPM objective doesn't require modeling the correlation between all these points directly.
  • Instead, by focusing on the marginals, the model can simplify the training objective and avoid the complexity of dealing with the full joint distribution.

By depending only on the marginals, the DDPM objective simplifies the learning process. The model needs to learn how to handle the noise at each step without explicitly considering the dependencies between all steps.

Non-Markovian Forward Processes


The mean function is chosen in order to satisfy the condition qσ(xtx0)=N(αtx0,(1αt)I)q_\sigma(x_t|x_0)=\mathcal{N}(\sqrt{\alpha_t}x_0,(1-\alpha_t)I) for all t, so that it defines a joint inference distribution that matches the Marginal. Since each xtx_t could depend on both xt1x_{t−1} and x0x_0. The magnitude of σ controls the how stochastic the forward process is; when σ → 0 of no stochasticity, we reach an extreme case where as long as we observe x0x_0 and xtx_t for some t, then xt1x_{t−1} become known and fixed.

Proven that Markov forward process qq and non-Markovian forward process qσq_\sigma has the same Margin

Generative Processes and unified variational inference objective


Above is a variational lower bound of the DDIM, only difference with DDPM is the non-Markovian forward process qσq_\sigma. The loss can be derived the equivalent way as DDPM resulting in:

Similarly the first term is not of interest and organizing the equation respect to reverse process pθp_\theta:

The term x0(xt,ϵθ)x_0(x_t,\epsilon_\theta) is derived from the equation x0=(xt1αˉt . ϵθ(t)(xt))/αˉtx_0=(x_t - \sqrt{1-\bar{\alpha}_t}\space.\space\epsilon_\theta^{(t)}(x_t))/\sqrt{\bar{\alpha}_t}. Substituting p with q:

Sampling from generalized generative processes


Looking at the equation, the first term depicts a process of predicting x0x_0 by removing predicted noise and scaling it by a value (αt1αt)\sqrt{\frac{\alpha_{t-1}}{\alpha_t}}) and second term adding predicted noise towards the xtx_t direction. So in short, we are sampling from the origin x0x_0 and adding predicted noise to the origin to predict each timestep. So it is needless to predict every time and instead use only subset of the process. For time sequence [1,2,...T]\lbrack1,2, ... T\rbrack and sub-sequence [xτ1,xτ2,xτ3...xτs\lbrack x_{\tau_1},x_{\tau_2},x_{\tau_3}...x_{\tau_s} we can define non-Markovian forward process q(xτix0)=N(ατix0,(1ατi)I)q(x_{\tau_i}|x_0) = \mathcal{N}(\sqrt{\alpha_{\tau_i}}x_0, (1-\alpha_{\tau_i})I). Now it doesn't depend on the previos state thus no dependency on continuos sequence.

Reference

Derivation Details : https://junia3.github.io/blog/ddim
DDIM paper : https://arxiv.org/abs/2010.02502

profile
Interested in Speech Processing

0개의 댓글