Denoising Diffusion Probabilistic Models (DDPM)

κΉ€μƒμœ€Β·2024λ…„ 5μ›” 22일

Paper Reading

λͺ©λ‘ 보기
1/2

Abstract

By using Diffusion Probabilistic models, high quality image synthesis is made possible. The best performing method is training on weighted variantional bound. The variantional bound comes from connecting 1. [Diffusion Probabilistic model] and 2. [denoising score matching through Langevin dynamics]. The proposed model is a lossy decompression in a progressive manner and it can be interpreted as a generalization of autoregressive decoding. The implementation is available at github.

Prerequisite

What is Diffusion?

Diffusion came from the term used in Thermodynamics, where particle or molecules spontaneously move from high concentration to low concentration. Likening an image to a liquid, its pixels act like an atom. We repricate the physical diffusion process with noise. Adding noise to the image is like the diffusion process. By gradually adding noise at successive individual steps we are destorying the information. And we call this Forward Diffusion Process. The second part of the task is to restore the image by gradually denoising it. This denoising is called the Backward Diffusion Process.

Forward Process is fundamentally a process where a transform function transforms a complex distribution to a predefined prior distribution.

x0∼pcomplexβ€…β€ŠβŸΉβ€…β€ŠT(x0)∼ppriorx_0 \sim p_{complex} \implies \Tau(x_0) \sim p_{prior}
The data points on a complex distribution are mapped to a point on a prior distribution.A high-level conceptual overview of the entire image space.

In DDPM the prior distribution is defined to follow a Gaussian distribution and the transformation mechanism is assumed to be a Markov Chain Process, meaning that the current state only depends on the one step previous state and the transition probability between states. Lets start with notations first.

  • X : Random variable for an image distributed according to probability distribution complex P(x)
    • so X=x for some image from the possible set of Images in X.
  • n : Number of pixels in a image(HxW of pixels).
    • Then X is a set of N random variables i.e, X={v1,v2Β ...Β vnv_1, v_2 \space ... \space v_n} and X∈RNX\in R^{N} for N pixel image.

Markov Chains and what it means to have them in DDPMs

Given that a image state at timestep K is represented as Xk, the forward path can be written as follows. Thus by the property of Markov process, the output of the current time step is conditioned is only on t-1 timestep.

P(Xi=xi∣X1=x1,X2=x2,...,Xiβˆ’1=xiβˆ’1)=P(Xi=xi∣Xiβˆ’1=xiβˆ’1)P(X_i = x_i|X_1 = x_1, X_2 = x_2,...,X_{i-1} = x_{i-1}) = P(X_i = x_i|X_{i-1} = x_{i-1})

In the context of diffusion process, the number of timesteps T is the number of steps needed to convert input image into a pure gaussian noise. If the Backward process is exactly the opposite of the forward process, Then it would be represented as follows.

P(Xiβˆ’1=xiβˆ’1∣Xi=xi)P(X_{i-1} = x_{i-1}|X_{i} = x_{i})

Deep dive into DDPM

Forward Step

Diffusion model can be understood as a lantent variable model which get an input image and maps it to the latent space using fixed forward diffusion process q. q process is a Markov chain and the goal of the forward q process is to add noise progressively to get an approximate posterior q(x1:T|x0) where each xi is regarded as latent variables having the same dimensionality as input x0.


The total noise adding process is a joint distribution of Markov chain gradually adding Gaussian noise. The interesting part of forward process is that the variance Ξ²tI\beta_{t}I is a hyperparameter not a trainable parameter. The reason why the mean and the variance of the Gaussian distribution has the particular form is due to ∏t=1TN(1βˆ’Ξ²t,Ξ²tI)=p(xT)β‰ˆN(0,1)\prod_{t=1}^{T} N(\sqrt{1-\beta_{t}},\beta_tI) = p(x_T) \approx N(0,1) when Ξ²t\beta_t is small enough (Meaning that Ξ²\beta is very small).

Backward Step

The backward diffusion process, the model tries to learn the reverse denoising process of recovering the original form of the input.

pΞΈ(x0:T):=p(xT)∏t=1TpΞΈ(xtβˆ’1∣xt)Β whenΒ pΞΈ(xtβˆ’1∣xt):=N(ΞΌΞΈ(xt,t),Σθ(xt,t))p_{\theta}(x_{0:T}):=p(x_T)\prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_t)\space when \space p_{\theta}(x_{t-1}|x_t):=\mathcal{N}(\mu_\theta(x_t,t),\varSigma_\theta(x_t,t))

As mentioned earlier, Ξ²\beta is a very small value. This also implies that reverse process q(xtβˆ’1∣xt)q(x_{t-1}|x_t) can be estimated as a Gaussian distribution and p(xtβˆ’1∣xt)p(x_{t-1}|x_t),the estimation of the real distribution q, can be chosen to be Gaussian as well as parameterize the mean and variance as

pΞΈ(xtβˆ’1∣xt):=N(xtβˆ’1;ΞΌΞΈ(xt,t),Σθ(xt,t))p_\theta(x_{t-1}|x_t) := \mathcal{N}(x_{t-1};\mu_{\theta}(x_t, t),\varSigma_{\theta}(x_t,t))

starting from a pure Gaussian noise: p(xT)=N(xT;0,I)p(x_T)=\mathcal{N}(x_T;0,\Iota). The whole denoising process of pΞΈ(x0:T)p_{\theta}(x_{0:T}) is given as p(xT,xTβˆ’1,...,x0)=p(xT)βˆ—p(xTβˆ’1∣p(xT))βˆ—p(xTβˆ’2∣p(xT)p(xTβˆ’1))...βˆ—p(x0∣x1,x2,...xT)=p(xT)βˆ—p(xTβˆ’1∣p(xT))...βˆ—p(x0∣x1)p(x_T,x_{T-1},...,x_0)=p(x_T)*p(x_{T-1}|p(x_T))*p(x_{T-2}|p(x_T)p(x_{T-1}))...*p(x_0|x_1,x_2,...x_T)=p(x_T)*p(x_{T-1}|p(x_T))...*p(x_0|x_1) by the Markov Assumption. Finally this chain of equations can be simplified to a product form(∏\prod).

Evidence lower Bound

The objective of the generative model is to model the probability distribution of the data p_complexp\_complex or q(x0)q(x_0). The original data space is intractable leading us to instead learn approximate distribution

pθ(x0)=∫pθ(x0:T)dx1:T=∫pθ(x0:T)dx1:Tq(x1:T∣x0)q(x1:T∣x0)=∫q(x1:T∣x0)dx1:Tpθ(x0:T)q(x1:T∣x0)=Eq(x1:T∣x0)[pθ(x0:T)q(x1:T∣x0)]\begin{aligned} p_\theta(x_0)&=\int p_\theta(x_{0:T})dx_{1:T} \\ &=\int p_\theta(x_{0:T})dx_{1:T}\frac{q(x_{1:T}|x_0)}{q(x_{1:T}|x_0)} \\ &=\int q(x_{1:T}|x_0)dx_{1:T}\frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)} \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack\frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\bigg\rbrack \end{aligned}

Now to maximize the likelihood of p(x0)p(x_0), we optimize the negative log likelihood of it. By Jensen's inequality the NLL of p is described as below.

βˆ’logΒ E[Β pΞΈ(x0)]=E[βˆ’logΒ pΞΈ(x0)]≀Eq(x1:T∣x0)[βˆ’logΒ pΞΈ(x0:T)q(x1:T∣x0)]=Eq(x1:T∣x0)[βˆ’logΒ p(xT)Β βˆ’Β βˆ‘tβ‰₯1logpΞΈ(xtβˆ’1∣xt)q(xt∣xtβˆ’1)]\begin{aligned} -log \space \mathbb{E}\lbrack \space p_\theta(x_0) \rbrack = \mathbb{E}\lbrack-log \space p_\theta(x_0) \rbrack &\le \mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack-log \space\frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\bigg\rbrack\\ &=\mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack-log\space p(x_T) \space - \space \sum_{t\ge1} log\frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})}\bigg\rbrack \end{aligned}

Minimizing the righthand side of the equation will minimize the evidence upper bound. Thus the objective of the the training:

L=Eq[βˆ’logΒ p(xT)Β βˆ’Β βˆ‘tβ‰₯1logpΞΈ(xtβˆ’1∣xt)q(xt∣xtβˆ’1)]\mathcal{L}=\mathbb{E}_q\bigg\lbrack-log\space p(x_T) \space - \space \sum_{t\ge1} log\frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})}\bigg\rbrack

We do not know the conditional distribution using the Transfer q(x)q(x) from xtx_t to xtβˆ’1x_{t-1} during process for generation. By Bayes rule,

q(xtβˆ’1∣xt)=q(xt∣xtβˆ’1)q(xtβˆ’1)q(xt)Β andΒ q(xt)=∫q(xt∣xtβˆ’1)q(xtβˆ’1)dxq(x_{t-1}|x_t) = \frac{q(x_t|x_{t-1})q(x_{t-1})}{q(x_t)} \space and \space q(x_t) = \int q(x_t|x_{t-1})q(x_{t-1})dx

This q(xt) is intractable because the distribution of 1) each time-step and 2) q(xt/xtβˆ’1)q(xt/x_{t-1}) depends on the entire data distribution space of all possible images. Instead we make a Neural Network learn a distribution given pΞΈ(xtβˆ’1∣xt)p_\theta(x_{t-1}|x_t) approximating to q(xtβˆ’1∣xt)q(x_{t-1}|x_t). With KL divergence, calculating the distance between P and Q is done.

βˆ’logpΞΈ(x0)≀L=βˆ’Eq(x1:T∣x0)[logpΞΈ(x0∣x1:T)]+DKL(q(x1:T∣x0)∣∣pΞΈ(x1:T))-logp_\theta(x_0) \le\mathcal{L}=-\mathbb{E}_{q(x_{1:T}|x_0)}\lbrack logp_\theta(x_0|x_{1:T})\rbrack + D_{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T}))

KL divergence can be written as

DKL(q(x1:T∣x0)∣∣pθ(x1:T))=∫q(x1:T∣x0)log(q(x1:T∣x0)/pθ(x1:T))dxD_{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T})) = \int q(x_{1:T}|x_0)log(q(x_{1:T}|x_0)/p_\theta(x_{1:T}))dx

which can also be formulated as an expectation

Ex∼q[logq(x1:T∣x0)pΞΈ(x1:T)]Β orΒ Ex∼p[βˆ’logpΞΈ(x1:T)q(x1:T∣x0)]\mathbb{E}_{x\sim q}\bigg\lbrack log\frac{q(x_{1:T}|x_0)}{p_\theta(x_{1:T})}\bigg\rbrack ~or~ \mathbb{E}_{x\sim p}\bigg\lbrack -log\frac{p_\theta(x_{1:T})}{q(x_{1:T}|x_0)}\bigg\rbrack
βˆ’log⁑pΞΈ(x0)≀L=βˆ’Β Eq(x1:T∣x0)[log⁑pΞΈ(x0∣x1:T)]+Ex∼q[log⁑q(x1:T∣x0)pΞΈ(x1:T)]β‰€βˆ’Β Eq(x1:T∣x0)[log⁑pΞΈ(x0∣x1:T)Β +Β log⁑pΞΈ(x1:T)q(x1:T∣x0)]β‰€βˆ’Β Eq(x1:T∣x0)[logpΞΈ(x1:T)q(x1:T∣x0)]\begin{aligned} -\log p_\theta(x_0) &\le\mathcal{L}=-~\mathbb{E}_{q(x_{1:T}|x_0)}\lbrack \log p_\theta(x_0|x_{1:T})\rbrack + \mathbb{E}_{x\sim q}\bigg\lbrack \log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{1:T})}\bigg\rbrack \\ &\le -~\mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack \log p_\theta(x_0|x_{1:T})~+~\log\frac{p_\theta(x_{1:T})}{q(x_{1:T}|x_0)}\bigg\rbrack\\ &\le -~\mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack log\frac{p_\theta(x_{1:T})}{q(x_{1:T}|x_0)}\bigg\rbrack \end{aligned}

Computing the loss

βˆ’Β Eq(x1:T∣x0)[logpΞΈ(x1:T)q(x1:T∣x0)]=Eq(x1:T∣x0)[βˆ’logΒ p(xT)+logq(xT∣xTβˆ’1)pΞΈ(xTβˆ’1∣xT)+...+logq(x1∣x0)pΞΈ(x0∣x1)]=Eq(x1:T∣x0)[βˆ’logΒ p(xT)βˆ’βˆ‘tβ‰₯1logpΞΈ(xtβˆ’1∣xt)q(xt∣xtβˆ’1)]-~\mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack log\frac{p_\theta(x_{1:T})}{q(x_{1:T}|x_0)}\bigg\rbrack = \mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack-log~p(x_T)+log\frac {q(x_{T}|x_{T-1})}{p_\theta(x_{T-1}|x_T)}+...+log\frac{q(x_{1}|x_0)}{p_\theta(x_0|x_1)}\bigg\rbrack\\ =\mathbb{E}_{q(x_{1:T}|x_0)}\bigg\lbrack-log~p(x_T)-\sum_{t\ge1}log\frac{p_\theta(x_{t-1}|x_t)}{q(x_{t}|x_{t-1})}\bigg\rbrack\\

After some simplification the final term can be as below.

The terms below were ignored

L0L_0 – The authors got better results without this.
LTL_T – This is the β€œKL divergence” between the distribution of the final latent in the forward process and the first latent in the reverse process. However, there are no neural network parameters involved here, so we can’t do anything about it except define a good variance scheduler and use large timesteps such that they both represent an Isotropic Gaussian distribution.

Ltβˆ’1L_{t-1} is the only loss term left which is a KL divergence between the β€œposterior” of the forward process (conditioned on xt and the initial sample x0), and the parameterized reverse diffusion process. Both terms are gaussian distributions as well.

q(xtβˆ’1∣xt,x0)=N(xtβˆ’1;ΞΌq(xt,x0),Οƒq(t)2I)Β Β whereΒ Β Β Ξ£q(t)=Ξ²t(1βˆ’Ξ±Λ‰tβˆ’1)1βˆ’Ξ±tΛ‰I,Β ΞΌq(xt,x0)Β =Β Ξ±t(1βˆ’Ξ±Λ‰tβˆ’1)xt+Ξ²tΞ±Λ‰tβˆ’1x01βˆ’Ξ±Λ‰tq(x_{tβˆ’1}|x_{t}, x_0) = \mathcal{N}(x_{tβˆ’1};\mu_{q(x_t, x_0)},\sigma_{q(t)}^2I)~~where~~~Ξ£q(t)=\frac{\beta_{t}(1 βˆ’\bar{Ξ±}_{tβˆ’1})}{1 βˆ’ \bar{Ξ±_t}}I,~ΞΌ_q(x_t, x_0)~=~\frac{\sqrtΞ±_t(1βˆ’\bar{Ξ±}_{tβˆ’1})x_t+\beta_t\sqrt{\bar{Ξ±}_{tβˆ’1}}x_0}{1βˆ’ \bar{Ξ±}_t}

To simplify the computing process, DDPM chooses the same variance for both P and Q distribution as a constant Ξ²t\beta_t . All we need to do is keep the mean same and the distribution will be same.
As we have kept the variance constant, minimizing KL divergence is as simple as minimizing the difference (or distance) between means (𝞡) of two gaussian distributions q and p and Loss function at each timestep can be written as

Ltβˆ’1=Eq[12Οƒt2βˆ₯ΞΌ~t(xt,x0)βˆ’ΞΌΞΈ(xt,t)βˆ₯2]+CL_{t-1}=E_q\bigg\lbrack \frac1{2\sigma_t^2}\Vert \tilde{\mu}_t(x_t,x_0)-\mu_\theta(x_t,t)\Vert^2\bigg\rbrack+C

The possible approach we can take is

  • Directly predict x0x_0 and find ΞΌΛ‰\bar{\mu},use it in the posterior function
  • Predict the entire ΞΌΛ‰\bar{\mu} term
  • Predict the noise at each timestep. Write the x0x_0 in ΞΌΛ‰\bar{\mu} in terms of xtx_t

We will be applying appoach 3.



We have xβ‚œ by adding noise Ο΅ t times to the base image using the forward process using

Lsimple=βˆ₯Ο΅βˆ’Ο΅ΞΈ(xt,t)βˆ₯2\mathcal{L}_{simple}= \Vert\epsilon-\epsilon_\theta(x_t,t)\Vert^2

Eventually the model needs to train a noise estimator ϡθ(xt,t)\epsilon_\theta(x_t,t).

Training and Inference


For training, take a random timestep and train the network to predict noise level at this timestep. At inference, go through entire T iterations of model. Starting at a noisy image xTx_T from the normal distribution. For timestep t > 1, an additive sampled noise from the normal distribution is added to denoised xtx_t sample to form a xtβˆ’1x_{t-1} sample. The additive noise is added to approximate the distribution of x_{t-1} and to create some diversity to the DDPM model.

Acknowledge

The blog post is based on the Paper "Denoising Diffusion Probabilistic Models" by Jonathan Ho, Ajay Jain, Pieter Abbeel at the 34th Conference of Neural Information Processing Systems 2020. arxiv paper link

Reference used in this blog post

profile
Interested in Speech Processing

0개의 λŒ“κΈ€