Denoising Diffusion Probabilistic Models (DDPM)

김상윤·2024년 5월 22일

Paper Reading

목록 보기

1/2

Abstract

By using Diffusion Probabilistic models, high quality image synthesis is made possible. The best performing method is training on weighted variantional bound. The variantional bound comes from connecting 1. [Diffusion Probabilistic model] and 2. [denoising score matching through Langevin dynamics]. The proposed model is a lossy decompression in a progressive manner and it can be interpreted as a generalization of autoregressive decoding. The implementation is available at github.

Prerequisite

What is Diffusion?

Diffusion came from the term used in Thermodynamics, where particle or molecules spontaneously move from high concentration to low concentration. Likening an image to a liquid, its pixels act like an atom. We repricate the physical diffusion process with noise. Adding noise to the image is like the diffusion process. By gradually adding noise at successive individual steps we are destorying the information. And we call this Forward Diffusion Process. The second part of the task is to restore the image by gradually denoising it. This denoising is called the Backward Diffusion Process.

Forward Process is fundamentally a process where a transform function transforms a complex distribution to a predefined prior distribution.

x_0 \sim p_{complex} \implies \Tau(x_0) \sim p_{prior}

The data points on a complex distribution are mapped to a point on a prior distribution.	A high-level conceptual overview of the entire image space.

In DDPM the prior distribution is defined to follow a Gaussian distribution and the transformation mechanism is assumed to be a Markov Chain Process, meaning that the current state only depends on the one step previous state and the transition probability between states. Lets start with notations first.

X : Random variable for an image distributed according to probability distribution complex P(x)
- so X=x for some image from the possible set of Images in X.
n : Number of pixels in a image(HxW of pixels).
- Then X is a set of N random variables i.e, X={ $v_1, v_2 \space ... \space v_n$ } and $X\in R^{N}$ for N pixel image.

Markov Chains and what it means to have them in DDPMs

Given that a image state at timestep K is represented as Xk, the forward path can be written as follows. Thus by the property of Markov process, the output of the current time step is conditioned is only on t-1 timestep.

P(X_i = x_i|X_1 = x_1, X_2 = x_2,...,X_{i-1} = x_{i-1}) = P(X_i = x_i|X_{i-1} = x_{i-1})

In the context of diffusion process, the number of timesteps T is the number of steps needed to convert input image into a pure gaussian noise. If the Backward process is exactly the opposite of the forward process, Then it would be represented as follows.

P(X_{i-1} = x_{i-1}|X_{i} = x_{i})

Deep dive into DDPM

Forward Step

Diffusion model can be understood as a lantent variable model which get an input image and maps it to the latent space using fixed forward diffusion process q. q process is a Markov chain and the goal of the forward q process is to add noise progressively to get an approximate posterior q(x1:T|x0) where each xi is regarded as latent variables having the same dimensionality as input x0.

The total noise adding process is a joint distribution of Markov chain gradually adding Gaussian noise. The interesting part of forward process is that the variance $\beta_{t}I$ is a hyperparameter not a trainable parameter. The reason why the mean and the variance of the Gaussian distribution has the particular form is due to $\prod_{t=1}^{T} N(\sqrt{1-\beta_{t}},\beta_tI) = p(x_T) \approx N(0,1)$ when $\beta_t$ is small enough (Meaning that $\beta$ is very small).

Backward Step

The backward diffusion process, the model tries to learn the reverse denoising process of recovering the original form of the input.

p_{\theta}(x_{0:T}):=p(x_T)\prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_t)\space when \space p_{\theta}(x_{t-1}|x_t):=\mathcal{N}(\mu_\theta(x_t,t),\varSigma_\theta(x_t,t))

As mentioned earlier, $\beta$ is a very small value. This also implies that reverse process $q(x_{t-1}|x_t)$ can be estimated as a Gaussian distribution and $p(x_{t-1}|x_t)$ ,the estimation of the real distribution q, can be chosen to be Gaussian as well as parameterize the mean and variance as

p_\theta(x_{t-1}|x_t) := \mathcal{N}(x_{t-1};\mu_{\theta}(x_t, t),\varSigma_{\theta}(x_t,t))

starting from a pure Gaussian noise: $p(x_T)=\mathcal{N}(x_T;0,\Iota)$ . The whole denoising process of $p_{\theta}(x_{0:T})$ is given as $p(x_T,x_{T-1},...,x_0)=p(x_T)*p(x_{T-1}|p(x_T))*p(x_{T-2}|p(x_T)p(x_{T-1}))...*p(x_0|x_1,x_2,...x_T)=p(x_T)*p(x_{T-1}|p(x_T))...*p(x_0|x_1)$ by the Markov Assumption. Finally this chain of equations can be simplified to a product form( $\prod$ ).

Evidence lower Bound

The objective of the generative model is to model the probability distribution of the data $p\_complex$ or $q(x_0)$ . The original data space is intractable leading us to instead learn approximate distribution

\begin{aligned} p_\theta(x_0)&=\int p_\theta(x_{0:T})dx_{1:T} \\ &=\int p_\theta(x_{0:T})dx_{1:T}\frac{q(x_{1:T}|x_0)}{q(x_{1:T}|x_0)} \\ &=\int q(x_{1:T}|x_0)dx_{1:T}\frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)} \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack\frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\bigg\rbrack \end{aligned}

Now to maximize the likelihood of $p(x_0)$ , we optimize the negative log likelihood of it. By Jensen's inequality the NLL of p is described as below.

\begin{aligned} -log \space \mathbb{E}\lbrack \space p_\theta(x_0) \rbrack = \mathbb{E}\lbrack-log \space p_\theta(x_0) \rbrack &\le \mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack-log \space\frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\bigg\rbrack\\ &=\mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack-log\space p(x_T) \space - \space \sum_{t\ge1} log\frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})}\bigg\rbrack \end{aligned}

Minimizing the righthand side of the equation will minimize the evidence upper bound. Thus the objective of the the training:

\mathcal{L}=\mathbb{E}_q\bigg\lbrack-log\space p(x_T) \space - \space \sum_{t\ge1} log\frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})}\bigg\rbrack

We do not know the conditional distribution using the Transfer $q(x)$ from $x_t$ to $x_{t-1}$ during process for generation. By Bayes rule,

q(x_{t-1}|x_t) = \frac{q(x_t|x_{t-1})q(x_{t-1})}{q(x_t)} \space and \space q(x_t) = \int q(x_t|x_{t-1})q(x_{t-1})dx

This q(xt) is intractable because the distribution of 1) each time-step and 2) $q(xt/x_{t-1})$ depends on the entire data distribution space of all possible images. Instead we make a Neural Network learn a distribution given $p_\theta(x_{t-1}|x_t)$ approximating to $q(x_{t-1}|x_t)$ . With KL divergence, calculating the distance between P and Q is done.

-logp_\theta(x_0) \le\mathcal{L}=-\mathbb{E}_{q(x_{1:T}|x_0)}\lbrack logp_\theta(x_0|x_{1:T})\rbrack + D_{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T}))

KL divergence can be written as

D_{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T})) = \int q(x_{1:T}|x_0)log(q(x_{1:T}|x_0)/p_\theta(x_{1:T}))dx

which can also be formulated as an expectation

\mathbb{E}_{x\sim q}\bigg\lbrack log\frac{q(x_{1:T}|x_0)}{p_\theta(x_{1:T})}\bigg\rbrack ~or~ \mathbb{E}_{x\sim p}\bigg\lbrack -log\frac{p_\theta(x_{1:T})}{q(x_{1:T}|x_0)}\bigg\rbrack

\begin{aligned} -\log p_\theta(x_0) &\le\mathcal{L}=-~\mathbb{E}_{q(x_{1:T}|x_0)}\lbrack \log p_\theta(x_0|x_{1:T})\rbrack + \mathbb{E}_{x\sim q}\bigg\lbrack \log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{1:T})}\bigg\rbrack \\ &\le -~\mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack \log p_\theta(x_0|x_{1:T})~+~\log\frac{p_\theta(x_{1:T})}{q(x_{1:T}|x_0)}\bigg\rbrack\\ &\le -~\mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack log\frac{p_\theta(x_{1:T})}{q(x_{1:T}|x_0)}\bigg\rbrack \end{aligned}

Computing the loss

-~\mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack log\frac{p_\theta(x_{1:T})}{q(x_{1:T}|x_0)}\bigg\rbrack = \mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack-log~p(x_T)+log\frac {q(x_{T}|x_{T-1})}{p_\theta(x_{T-1}|x_T)}+...+log\frac{q(x_{1}|x_0)}{p_\theta(x_0|x_1)}\bigg\rbrack\\ =\mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack-log~p(x_T)-\sum_{t\ge1}log\frac{p_\theta(x_{t-1}|x_t)}{q(x_{t}|x_{t-1})}\bigg\rbrack\\

After some simplification the final term can be as below.

The terms below were ignored

$L_0$ – The authors got better results without this.
$L_T$ – This is the “KL divergence” between the distribution of the final latent in the forward process and the first latent in the reverse process. However, there are no neural network parameters involved here, so we can’t do anything about it except define a good variance scheduler and use large timesteps such that they both represent an Isotropic Gaussian distribution.

$L_{t-1}$ is the only loss term left which is a KL divergence between the “posterior” of the forward process (conditioned on xt and the initial sample x0), and the parameterized reverse diffusion process. Both terms are gaussian distributions as well.

q(x_{t−1}|x_{t}, x_0) = \mathcal{N}(x_{t−1};\mu_{q(x_t, x_0)},\sigma_{q(t)}^2I)~~where~~~Σq(t)=\frac{\beta_{t}(1 −\bar{α}_{t−1})}{1 − \bar{α_t}}I,~μ_q(x_t, x_0)~=~\frac{\sqrtα_t(1−\bar{α}_{t−1})x_t+\beta_t\sqrt{\bar{α}_{t−1}}x_0}{1− \bar{α}_t}

To simplify the computing process, DDPM chooses the same variance for both P and Q distribution as a constant $\beta_t$ . All we need to do is keep the mean same and the distribution will be same.
As we have kept the variance constant, minimizing KL divergence is as simple as minimizing the difference (or distance) between means (𝞵) of two gaussian distributions q and p and Loss function at each timestep can be written as

L_{t-1}=E_q\bigg\lbrack \frac1{2\sigma_t^2}\Vert \tilde{\mu}_t(x_t,x_0)-\mu_\theta(x_t,t)\Vert^2\bigg\rbrack+C

The possible approach we can take is

Directly predict $x_0$ and find $\bar{\mu}$ ,use it in the posterior function
Predict the entire $\bar{\mu}$ term
Predict the noise at each timestep. Write the $x_0$ in $\bar{\mu}$ in terms of $x_t$

We will be applying appoach 3.

We have xₜ by adding noise ϵ t times to the base image using the forward process using

\mathcal{L}_{simple}= \Vert\epsilon-\epsilon_\theta(x_t,t)\Vert^2

Eventually the model needs to train a noise estimator $\epsilon_\theta(x_t,t)$ .

Training and Inference

For training, take a random timestep and train the network to predict noise level at this timestep. At inference, go through entire T iterations of model. Starting at a noisy image $x_T$ from the normal distribution. For timestep t > 1, an additive sampled noise from the normal distribution is added to denoised $x_t$ sample to form a $x_{t-1}$ sample. The additive noise is added to approximate the distribution of x_{t-1} and to create some diversity to the DDPM model.

Acknowledge

The blog post is based on the Paper "Denoising Diffusion Probabilistic Models" by Jonathan Ho, Ajay Jain, Pieter Abbeel at the 34th Conference of Neural Information Processing Systems 2020. arxiv paper link

Reference used in this blog post

김상윤

Interested in Speech Processing

다음 포스트