Generative Modeling by Estimating Gradients of the Data Distribution (2)

Seow·2024년 7월 7일
0

Diffusion

목록 보기
7/8

Previous post : https://velog.io/@yhyj1001/Generative-Modeling-by-Estimating-Gradients-of-the-Data-Distribution-1-A-Connection-Between-Score-Matching-Denoising-Autoencoders
github: https://github.com/yhy258/ScoreMatching_pytorch/tree/main

Motivations & Intuitions ⭐️

Motivations 📚

  • The score xlogpdata(x)\nabla_x \log p_{data}(x) is a gradient taken in the ambient space of xx, so it is undefined when xx is confined to a low dimensional manifold (manifold hypothesis).
  • When the support of the data distribution is confined, the score matching objective can not be a plausible score estimator. This is because the score matching function is an expectation for the given data distribution.
  • Because of the scarcity of data in low density regions, score matching may not have enough evidence to estimate score functions accurately in regions of low data density.
  • When two modes of the data distribution are separated by low density regions, Langevin dynamics cannot successfully reflect the relative weights of these two modes in reasonable time, and might not result in the ground truth data distribution. This is because the score function does not depend on the relative weights of the modes.

Intuitions 🧐

  • The support of Gaussian noise distributions is the whole space.
  • Large Gaussian noise can fill low density regions.
  • Perturbed data using various levels of noise can effectively reflect the iterative sampling method, Langevin dynamics.

Contributions ⭐️

  • They perturb the data using various levels of noise.
  • They propose the annealing schedule for Langevin dynamics in order to accurately reflect the relative weights of the modes.

Methods

Noise Conditional Score Networks

First of all, authors define a decreasing positive geometric sequence {σ1,σ2,,σL}\{\sigma_1, \sigma_2, \cdots, \sigma_L\}. In addition, qσi(x)=pdata(t)N(xt,σ2I)dtq_{\sigma_i}(x) = \int p_{data}(t)\mathcal{N}(x | t, \sigma^2 I)dt is a sequence of noise-perturbed distributions that converge to the true data distribution.
They employ the architecture design of UNet to jointly estimate the scores of all perturbed data distribution, sθ(x,σ)xlogqσ(x)s_\theta (x, \sigma) \approx \nabla_x \log q_\sigma (x).

Learning NCSNs via score matching

In order to perform the score matching, they use the denoising score matching. However, a sliced score matching can also train NCSNs.
As discussed in the previous post, the denoising score matching is defined as:

l(θ,σ)=Eqσ(x,x~)[12sθ(x~,σ)logqσ(x~x)x~2]=12Epdata(x)Ex~N(x,σ2I)[sθ(x~,σ)+x~xσ22]l(\theta, \sigma) = \mathbb{E}_{q_\sigma(x, \tilde x)} [{1\over2}||s_\theta(\tilde x, \sigma) - {\partial \log q_\sigma(\tilde x | x)\over \partial \tilde x}||^2]\\[10pt] = {1\over2}\mathbb{E}_{p_{data}(x)}\mathbb{E}_{\tilde x \sim \mathcal{N} (x, \sigma^2 I)} [||s_\theta(\tilde x, \sigma) + {\tilde x - x\over \sigma^2}||^2]

This equation can be defined because qσ(x~x)q_\sigma(\tilde x|x) is tractable.
Then, they combine this equation for all σi\sigma_i.

L(θ;{σi}i=1L)=1Li=1Lλ(σi)l(θ;σi)\mathcal{L}(\theta;\{\sigma_i\}^{L}_{i=1}) = {1\over L}\sum^{L}_{i=1}\lambda(\sigma_i)l(\theta; \sigma_i)

where λ(σi)\lambda(\sigma_i) is positive.
Because they observe that when the score networks are trained to optimality, sθ(x,σ)21σ||s_\theta(x, \sigma)||_2 \approx {1\over \sigma}, they set λ(σi)\lambda(\sigma_i) coefficients as σi2\sigma^2_i so that the values of λ(σi)l(θ;σi)\lambda(\sigma_i)l(\theta;\sigma_i) are soughly of the same order of magnitude.

If λ(σi)=σi2\lambda(\sigma_i)= \sigma^2_i , λ(σ)l(θ;σ)=12E[σsθ(x~,σ)+x~xσ22]\lambda(\sigma) l (\theta;\sigma)={1\over 2}\mathbb{E}[||\sigma s_\theta(\tilde x,\sigma)+{\tilde x - x\over \sigma}||^2_2]. In this equation, σsθ(x,σ)21||\sigma s_\theta(x, \sigma)||_2 \approx 1 and x~xσN(0,I){\tilde x - x\over \sigma} \sim \mathcal{N}(0, I)

NCSN inference via annealed Langevin dynamics

  • In the low density region, they employ larger step sizes.
  • They can also choose αiσi2\alpha_i \approx \sigma_i^2 in order to fix the magnitude of the "signal-to-noise" ratio in Langevin dynamics.
    Langevin sampling method with the Euler method is defined as x~tx~t1+αi2sθ(x~t1,σi)+αizt\tilde x_t \leftarrow \tilde x_{t-1}+{\alpha_i\over 2}s_\theta(\tilde x_{t-1}, \sigma_i) + \sqrt{\alpha_i}z_t. According to this definition, SNR is αisθ(x,σi)2αiz\alpha_i s_\theta(x, \sigma_i) \over 2\sqrt{\alpha_i}z.
    If αiσi2\alpha_i \approx \sigma_i^2, SNR become 14E[σisθ(x,σi)22]14{1\over 4}\mathbb{E}[||\sigma_is_\theta(x, \sigma_i)||^2_2]\approx{1\over4}.

0개의 댓글