Denoising Diffusion Probabilistic Models (2)

놔놔·2024년 8월 8일

DDPM

목록 보기

2/2

Diffusion models and denoising autoencoders

Forward process and $L_T$

논문에서 forward process의 $\beta_t$ 가 reparameterization을 통해 학습될 수 있다는 사실을 무시하고, 이를 상수로 고정한다. 따라서 논문의 구현에서 근사 posterior 분포 $q$ 에는 학습 가능한 파라미터가 없으며 이로 인해 $L_t$ 는 훈련중 상수로 간주되어 무시할수 있다.

Reverse process and $L_{1:T-1}$

p_\theta(x_{t-1}|x_t):=\mathcal{N}(x_{t-1};\mu_\theta(x_t,t), \textstyle\sum_{\theta}(x_t,t))\,\,\,\,,for\,\,\,\, 1<t\le T

첫번째로 위 정의에서 분산 $\textstyle\sum_{\theta}(x_t,t)$ 을 다음과 같이 훈련되지 않은 시간 의존적인 상수로 설정한다.

\textstyle\sum_{\theta}(x_t,t)={\sigma_t}^2\mathbf{I}

${\sigma_t}^2$ 에 대해 실험을 진행한 결과, 다음 두가지의 결과가 크게 다르지 않았다.

1) ${\sigma_t}^2=\beta_t$

forward process에서 추가된 noise와 동일한 분산을 사용해 노이즈를 제거
데이터가 정규 분포를 따를때 좋음 $x_0\sim\mathcal{N}(0,\mathbf{I})$
reverse process에서 상대적으로 높은 엔트로피를 가진다. 분산이 큰 경우 불확실성이 커지기 때문

2) ${\sigma_t}^2=\tilde\beta_t=\frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\,\beta_t$

특정한 점으로 고정된 데이터를 복원하는데 좋음
데이터가 하나의 점으로 수렴할 때 사용하면 좋음
상대적으로 낮은 엔트로피를 가지게 됨 (데이터의 정보가 집중된 상태)

각각의 분산 설정 방법은 데이터의 분포와 모델의 엔트로피 측면에서 중요한 의미를 가진다.

두번째로, 평균 $\mu_\theta(x_t,t)$ 를 나타내기 위해 다음 분석을 바탕으로 특정 파라미터화를 제안한다. 파라미터화는 모델의 동작을 정의하기 위해 매개변수를 설정하는 과정이다.
주어진 식 $p_\theta(x_{t-1}|x_t):=\mathcal{N}(x_{t-1};\mu_\theta(x_t,t), {\sigma_t}^2\mathbf{I})$ 에서 다음과 같이 쓸 수 있다.

L_{t-1}=\Bbb{E}_q\bigg\lbrack\frac{1}{2{\sigma_t}^2}{||\tilde\mu_t(x_t,x_0)-\mu_\theta(x_t,t)||}^2\bigg\rbrack\,+\,C\,\,\,\,(8)

$C$ 는 $\theta$ 에 의존하지 않는 상수다. 따라서 $\mu_\theta$ 를 파라미터화 하는 가장 간단한 방법은 forward process의 posterior 평균 $\tilde\mu_t$ 를 예측하는 모델을 사용하는 것이다.

추가적으로 식 (4)를 재매개변수화(reparameterizing) 하여 $x_t(x_0,\epsilon)=\sqrt{\bar\alpha_t}x_0+\sqrt{1-\bar\alpha_t}\epsilon\,\,\,\,for\,\,\epsilon\sim\mathcal{N}(0,\mathbf{I})$ 로 나타내고 forward process의 posterior 공식 (7)을 적용할 수 있다.

L_{t-1}-C=\Bbb{E}_{x_0,\epsilon}\bigg\lbrack\frac{1}{2{\sigma_t}^2}{\bigg|\bigg|\tilde\mu_t\bigg(x_t(x_0,\epsilon),\frac{1}{\sqrt{\bar\alpha_t}}(x_t(x_0,\epsilon)-\sqrt{1-\bar\alpha_t}\epsilon})\bigg)-\mu_\theta(x_t(x_0,\epsilon),t)\bigg|\bigg|^2\bigg\rbrack\,\,\,\,(9)

=\Bbb{E}_{x_0,\epsilon}\bigg\lbrack\frac{1}{2{\sigma_t}^2}{\bigg|\bigg|\frac{1}{\sqrt{\alpha_t}}\bigg(x_t(x_0,\epsilon)-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon}\bigg)-\mu_\theta(x_t(x_0,\epsilon),t)\bigg|\bigg|^2\bigg\rbrack\,\,\,\,\,(10)

수식 (10)은 $\mu_\theta$ 가 $x_t$ 가 주어졌을때 $\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon)$ 을 예측해야 하는걸 보여준다. $x_t$ 가 모델의 입력으로 사용될 수 있기 때문에, $\mu_\theta(x_t,t)$ 를 다음과 같이 파라미터화 할 수 있다.

\mu_\theta(x_t,t)=\tilde\mu_t\bigg(x_t,\frac{1}{\sqrt{\bar\alpha_t}}(x_t-\sqrt{1-\bar\alpha_t}\epsilon_\theta(x_t))\bigg)=\frac{1}{\sqrt{\alpha_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\bigg)\,\,\,\:(11)

위 식의 $\epsilon_\theta$ 는 $x_t$ 로부터 $\epsilon$ 을 예측하는 함수다. $x_t-1$ 을 $p_\theta(x_{t-1}|x_t$ 에서 샘플링 하려면, $x_{t-1}=\frac{1}{\sqrt{\alpha_t}}\bigg(x_t-\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t,t)\bigg)+\sigma_tz,\,\,where \,\,z\sim\mathcal{N}(0,I)$ 로 계산하면 된다. 전체 샘플링 절차는 알고리즘 2에서 보이는것 처럼 $\epsilon_\theta$ 가 학습된 데이터 밀도 함수의 기울기 역할을 하는 langevin dynamics와 유사하다. 더불어, 이 파라미터화를 사용하면 식(10)이 간단해진다.

\Bbb{E}_{x_0,\epsilon}\bigg\lbrack\frac{{\beta_t}^2}{2{\sigma_t}^2\alpha_t(1-\bar\alpha_t)}{||\epsilon-\epsilon_\theta(\sqrt{\bar\alpha_t}x_0+\sqrt{1-\bar\alpha_t}\epsilon,t)||}^2\bigg\rbrack\,\,\,\,(12)

요약하자면, reverse process의 평균 함수인 $\mu_\theta$ 를 훈련시켜 $\tilde\mu_t$ 를 예측하거나, 파라미터화를 수정해 $\epsilon$ 을 예측할 수 있다. ( $x_0$ 를 예측할 수 도 있지만 실험 초기에는 이방법이 샘플 품질이 떨어지는것으로 나타남)논문에서는 $\epsilon$ 을 예측하는 파라미터화가 langevin dynamics와 유사하며, 동시에 확산 모델의 변분 경계를 간소화해 score matching과 유사한 목표로 만든다는걸 보여줬다.
그럼에도...이것은 $p_\theta(x_{t-1}|x_t)$ 의 또다른 파라미터화일 뿐이므로, experiments 부분에서 $\epsilon$ , $\tilde\mu_t$ 를 예측하는 방법을 비교하는 실험을 통해 그 효과를 검증했다.

$\epsilon$ , $\tilde\mu_t$ 를 예측하는 방법은 다음 포스트에서 알아보도록 하자...

Data scaling, reverse process decoder, and $L_0$

이미지 데이터는 ${0,1,...,255}$ 범위의 정수형데이터를 $[-1,1]$ 로 스케일링 한 데이터 라고 가정한다. 이렇게 한 이유는 표준 정규 분포 $p(x_T)$ 에서 시작하여 신경망의 reverse process가 일관된 스케일 $[-1,1]$ 의 입력으로 동작하도록 하기 위함이다. 이산 로그 likelihoods를 얻기 위해 reverse process의 마지막 항을 $\mathcal{N}(x_0;\mu_\theta(x_1,1),{\sigma_1}^2\mathbf{I})$ 에서 유도된 독립적인 이산 디코더로 설정한다. 요약하자면 확산모델이 마지막 단계에서 가우시안 분포를 기반으로 각 픽셀의 값을 예측하고, 이를 독립적인 이산 디코더를 통해 확률적으로 샘플링한 뒤 그에 따른 로그 likelihoods를 계산하여 모델의 성능을 측정한다는 것이다.

p_\theta(x_0|x_1)=\prod_{i=1}^D\int_{\delta_-(x_0^i)}^{\delta_+(x_0^i)}\mathcal{N}(x;\mu_\theta^i (x_1,1),\sigma_1^2)\,dx\,\,\,\,(13)

\delta_+(x)=\begin{cases} \infin &\text{if } x=1 \\ x+\frac{1}{255} &\text{if } x<1 \end{cases}\,\,\,\, \delta_-(x)=\begin{cases} - \infin &\text{if } x=-1 \\ x-\frac{1}{255} &\text{if } x>-1 \end{cases}

$\prod$ 연산의 $D$ 는 데이터의 차원수를 의미하고, $i$ 위첨자는 하나의 좌표를 추출하는걸 나타낸다. VAE 디코더와 autoregressive 디코더에서 사용되는 이산화된 연속 분포와 유사하게, 여기서 논문의 선택은 변분경계가 이산 데이터의 무손실 코드 길이를 보장하도록 한다. 이를 위해 데이터에 노이즈를 추가하거나, 스케일링 연산의 Jacobian을 로그 likelihoods에 통합할 필요가 없다. 샘플링 마지막 단계에서 $\mu_\theta(x_1,1)$ 을 무잡음 상태로 나타낸다.

Simplified training objective

이후에는 변분 경계의 단순화된 목적함수를 통해 네트워크가 더 높은 품질의 샘플을 생성하는 법을 알아보겠다. reverse process와 디코더가 정의된 후 변분 경계가 파라미터 $\theta$ 에 대해 명확하게 미분 가능하며, 이를 훈련에 사용할 수 있다. 하지만 논문에서는 샘플 품질을 향상시키고 구현을 간소화 하기 위해 변분 경계의 변형된 버전을 사용한다. 이 변형된 경계는 $L_simple(\theta)$ 로 정의되고, 이는 네트워크가 다른 수준의 노이즈를 가진 데이터를 복원하는데 집중할 수 있도록 해준다.

L_{simple}(\theta):=\Bbb{E}_{t,x_0,\epsilon}\bigg\lbrack\big|\big|\epsilon-\epsilon_\theta(\sqrt{\bar\alpha}_tx_0+\sqrt{1-\bar\alpha_t}\epsilon,t)\big|\big|{}^2\bigg\rbrack\,\,\,\,(14)

가중치 조정은 작은 $t$ 에 해당하는 손실 항을 줄이는 방식으로 가중치가 조정된다. 이는 네트워크가 작은 양의 노이즈를 가진 데이터를 복원하는 대신, 더 어려운 복원 작업에 집중하도록 유도한다. 이런 가중치 재조정이 샘플 품질을 향상시킨다는 것이 실험을 통해 입증된다.

놔놔

5238D8K7

이전 포스트

Denoising Diffusion Probabilistic Models (2)

DDPM

Diffusion models and denoising autoencoders

Forward process and $L_T$

Reverse process and $L_{1:T-1}$

Data scaling, reverse process decoder, and $L_0$

Simplified training objective

Denoising Diffusion Probabilistic Models (1)

0개의 댓글

Denoising Diffusion Probabilistic Models (2)

DDPM

Diffusion models and denoising autoencoders

Forward process and LTL_TLT​

Reverse process and L1:T−1L_{1:T-1}L1:T−1​

Data scaling, reverse process decoder, and L0L_0L0​

Simplified training objective

Denoising Diffusion Probabilistic Models (1)

0개의 댓글

Forward process and $L_T$

Reverse process and $L_{1:T-1}$

Data scaling, reverse process decoder, and $L_0$