Controllable Disentangled Style Transfer via Diffusion Models 논문 리뷰 (StyleDiffusion)

SoyE·2023년 11월 2일
0

Abstract

Content and style (C-S) disentanglement는 style transfer 분야에서 가장 중요한 challenge이다.
기존의 style transfer 기법에서 entangled한 특징이 있었지만 본 논문에서 diffusion models을 사용해 새로운 C-S disentangled framework를 제시함으로써 disentanglement의 특징을 살리고 Content와 Style사이에 interpretable하고 controlable한 특징을 높였다.

Introduction

content의 complement로써 style을 억제하기 때문에 C-S가 완벽히 분리될 수 있다.
-> content만을 추출가능 -> controllability & interpretability 달성

그러나 C-S disentanglement를 더 잘 살리기 위해 content img와 style img에서 추출된 content 정보는 같은 도메인을 공유해야한다.
-> diffusion-based style removal module을 도입함으로써 content img와 style img의 style 정보를 없애고 domain-aligned content 정보를 추출하였다.

diffusion-based style transfer module을 도입함으로써 style image의 disentangled style image 정보를 잘 학습하고 그 정보를 content image에 잘 전송하였다.
-> 이 과정은 simple하지만 efficient한 CLIP-based style disentanglement loss를 사용한다.
-> CLIP-based style disentanglement loss는 stylized result와 style image itself가 정렬되도록 transfer mapping을 수행한다.

Method

method는 크게 2가지로 구성되어있다.

  1. disentangle the content and style of images
  2. transfer the style of Is to the content of Ic

key idea is to explicitly extract the content information and then implicitly learn the complementary style information.

Style Removal Module

The style removal module aims at removing the style information of the content and style images, explicitly extracting the domain-aligned content information.

Since the color is an integral part of style, our style removal module first removes its color.

Then, we leverage a pre-trained diffusion model to remove the style details such as brushstrokes and textures of I's, extracting the content Ics.
-> The insight is that the pre-trained diffusion model can help eliminate the domain-specific characteristics of input images and align them to the pre-trained domain. We assume that images with different styles belong to different domains, but _their contents should share the same domain._ Therefore, we can pre-train the diffusion model on a surrogate domain, e.g., the photograph domain, and then use this domain to construct the contents of images.

After pre-training, the diffusion model can convert the input images from diverse domains to the latents x via the forward process and then inverse them to the photograph domain via the reverse process.
-> style image와 content image의 도메인이 다르지만 pre-trained diffusion model을 사용함으로써 domain-aligned content information만 추출 -> style transfer 용이

diffusion process에서 DDIM을 adopt
The forward and reverse diffusion processes enable us to easily control the intensity of style removal by adjusting the number of return step Tremov -> Increase of Tremov, more style characteristics will be removed main content structures are retained

Style Transfer Module

diffusionbased style transfer module can better learn the disentangled style information and achieve higher-quality and more flexible stylizations

Icc를 pre-trained diffusion model을 활용하여 latent x로 convert -> guided by a CLIP-based style disentanglement loss coordinated with a style reconstruction prior, the reverse process of the diffusion model is fine-tuned (ϵθ → ϵθˆ) to generate the stylized result Ics

training시에는 DDIM, inference시에는 DDPM 방법 사용, why? DDPM forward process can also be used directly to help obtain diverse results

Loss Functions and Fine-tuning

CLIP-based Style Disentanglement Loss

-> straightforward way to obtain the disentangled style information is a direct subtraction

However, the simple pixel differences do not contain meaningful semantic information, thus cannot achieve plausible results.

-> To address this problem, we can formulate the disentanglement in a latent semantic space.

As we define that images with different styles belong to different domains, the projector E should be able to distinguish the domains of Is and Ics.
Fortunately, inspired by the recent vision-language model CLIP that encapsulates knowledgeable semantic information of not only the photograph domain but also the artistic domain.

This “style distance” thus can be interpreted as the disentangled style information.
-> Ds == style information

how to properly transfer it?

A possible solution is directly optimizing the L1 loss.
However, minimizing the L1 loss cannot guarantee the stylized result Ics.
To address these problems, we can further constrain the disentangled directions as follows.

This direction loss aligns the transfer direction of the stylized result and style image itself. -> style domain을 일치시켜 줌 -> 정확한 1대1 mapping 가능

Finally, our style disentanglement loss is

Since our style information is induced by the difference between content and its stylized result, we can deeply understand the relationship between C-S through learning.

Style Reconstruction Prior

To fully use the prior information provided by the style image and further elevate the stylization effects, we integrate a style reconstruction prior into the fine-tuning of the style transfer module.
Therefore, we can define a style reconstruction loss as follows
업로드중..

where Iss is the stylized result given Ics as content.
We optimize it separately before optimizing the style disentanglement loss LSD.

profile
응애

0개의 댓글