By using Diffusion Probabilistic models, high quality image synthesis is made possible. The best performing method is training on weighted variantional bound. The variantional bound comes from connecting 1. [Diffusion Probabilistic model] and 2. [denoising score matching through Langevin dynamics]. The proposed model is a lossy decompression in a progressive manner and it can be interpreted as a generalization of autoregressive decoding. The implementation is available at github.
Prerequisite
What is Diffusion?
Diffusion came from the term used in Thermodynamics, where particle or molecules spontaneously move from high concentration to low concentration. Likening an image to a liquid, its pixels act like an atom. We repricate the physical diffusion process with noise. Adding noise to the image is like the diffusion process. By gradually adding noise at successive individual steps we are destorying the information. And we call this Forward Diffusion Process. The second part of the task is to restore the image by gradually denoising it. This denoising is called the Backward Diffusion Process.
Forward Process is fundamentally a process where a transform function transforms a complex distribution to a predefined prior distribution.
x0ββΌpcomplexββΉT(x0β)βΌppriorβ
The data points on a complex distribution are mapped to a point on a prior distribution.
A high-level conceptual overview of the entire image space.
In DDPM the prior distribution is defined to follow a Gaussian distribution and the transformation mechanism is assumed to be a Markov Chain Process, meaning that the current state only depends on the one step previous state and the transition probability between states. Lets start with notations first.
X : Random variable for an image distributed according to probability distribution complex P(x)
so X=x for some image from the possible set of Images in X.
n : Number of pixels in a image(HxW of pixels).
Then X is a set of N random variables i.e, X={v1β,v2βΒ ...Β vnβ} and XβRN for N pixel image.
Markov Chains and what it means to have them in DDPMs
Given that a image state at timestep K is represented as Xk, the forward path can be written as follows. Thus by the property of Markov process, the output of the current time step is conditioned is only on t-1 timestep.
In the context of diffusion process, the number of timesteps T is the number of steps needed to convert input image into a pure gaussian noise. If the Backward process is exactly the opposite of the forward process, Then it would be represented as follows.
P(Xiβ1β=xiβ1ββ£Xiβ=xiβ)
Deep dive into DDPM
Forward Step
Diffusion model can be understood as a lantent variable model which get an input image and maps it to the latent space using fixed forward diffusion process q. q process is a Markov chain and the goal of the forward q process is to add noise progressively to get an approximate posterior q(x1:T|x0) where each xi is regarded as latent variables having the same dimensionality as input x0.
The total noise adding process is a joint distribution of Markov chain gradually adding Gaussian noise. The interesting part of forward process is that the variance Ξ²tβI is a hyperparameter not a trainable parameter. The reason why the mean and the variance of the Gaussian distribution has the particular form is due to βt=1TβN(1βΞ²tββ,Ξ²tβI)=p(xTβ)βN(0,1) when Ξ²tβ is small enough (Meaning that Ξ² is very small).
Backward Step
The backward diffusion process, the model tries to learn the reverse denoising process of recovering the original form of the input.
As mentioned earlier, Ξ² is a very small value. This also implies that reverse process q(xtβ1ββ£xtβ) can be estimated as a Gaussian distribution and p(xtβ1ββ£xtβ),the estimation of the real distribution q, can be chosen to be Gaussian as well as parameterize the mean and variance as
starting from a pure Gaussian noise: p(xTβ)=N(xTβ;0,I). The whole denoising process of pΞΈβ(x0:Tβ) is given as p(xTβ,xTβ1β,...,x0β)=p(xTβ)βp(xTβ1ββ£p(xTβ))βp(xTβ2ββ£p(xTβ)p(xTβ1β))...βp(x0ββ£x1β,x2β,...xTβ)=p(xTβ)βp(xTβ1ββ£p(xTβ))...βp(x0ββ£x1β) by the Markov Assumption. Finally this chain of equations can be simplified to a product form(β).
Evidence lower Bound
The objective of the generative model is to model the probability distribution of the data p_complex or q(x0β). The original data space is intractable leading us to instead learn approximate distribution
This q(xt) is intractable because the distribution of 1) each time-step and 2) q(xt/xtβ1β) depends on the entire data distribution space of all possible images. Instead we make a Neural Network learn a distribution given pΞΈβ(xtβ1ββ£xtβ) approximating to q(xtβ1ββ£xtβ). With KL divergence, calculating the distance between P and Q is done.
After some simplification the final term can be as below.
The terms below were ignored
L0β β The authors got better results without this. LTβ β This is the βKL divergenceβ between the distribution of the final latent in the forward process and the first latent in the reverse process. However, there are no neural network parameters involved here, so we canβt do anything about it except define a good variance scheduler and use large timesteps such that they both represent an Isotropic Gaussian distribution.
Ltβ1β is the only loss term left which is a KL divergence between the βposteriorβ of the forward process (conditioned on xt and the initial sample x0), and the parameterized reverse diffusion process. Both terms are gaussian distributions as well.
To simplify the computing process, DDPM chooses the same variance for both P and Q distribution as a constant Ξ²tβ. All we need to do is keep the mean same and the distribution will be same.
As we have kept the variance constant, minimizing KL divergence is as simple as minimizing the difference (or distance) between means (π΅) of two gaussian distributions q and p and Loss function at each timestep can be written as
Directly predict x0β and find ΞΌΛβ,use it in the posterior function
Predict the entire ΞΌΛβ term
Predict the noise at each timestep. Write the x0β in ΞΌΛβ in terms of xtβ
We will be applying appoach 3.
We have xβ by adding noise Ο΅ t times to the base image using the forward process using
Lsimpleβ=β₯Ο΅βϡθβ(xtβ,t)β₯2
Eventually the model needs to train a noise estimator ϡθβ(xtβ,t).
Training and Inference
For training, take a random timestep and train the network to predict noise level at this timestep. At inference, go through entire T iterations of model. Starting at a noisy image xTβ from the normal distribution. For timestep t > 1, an additive sampled noise from the normal distribution is added to denoised xtβ sample to form a xtβ1β sample. The additive noise is added to approximate the distribution of x_{t-1} and to create some diversity to the DDPM model.
Acknowledge
The blog post is based on the Paper "Denoising Diffusion Probabilistic Models" by Jonathan Ho, Ajay Jain, Pieter Abbeel at the 34th Conference of Neural Information Processing Systems 2020. arxiv paper link