[Review] StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery Review

Hwan Heo (허환)·2021년 11월 26일

DeepLearning_Paper_Review

목록 보기

1/5

GAN 에서 latent code 를 조정하여 원하는 manipulation 이미지를 생성하는 것은 아직 어려운 것으로 알려져 있다. 이 논문은 Image-Text embedding 을 pair 로 이루며 학습된 feature extractor: CLIP 을 이용하여 text-driven latent vector 를 GAN 의 steerable style code 로 활용하였다. 이를 통해 GAN 에서 CLIP 을 이용한 text-driven manipulation 을 효과적으로 검증한 논문이다.
ICCV2021

1. CLIP

CLIP 이란 Contrastive Language-Image Pretraining model 로, OpenAI 에서 공개한 Image-Language multi-domain feature encoder 이다. 이 모델은 Image 와 Text 의 공통 특징을 같은 embedding space 로 학습하기 위하여 다음 그림과 같은 구조로 훈련이 이루어진다.

Contrastive Learning (InfoNCE Loss)을 이용하여 각각 Image Encoder(ViT), Text Encoder(Transformer) 를 학습시킨다.
이는 CLIP space (embedding space) 에서 image 와 text 가 동일한 semantic direction 을 가지도록 한다.

또한 4-billion Dataset 을 구축하여 고품질의 Encoder 를 학습하도록 하였다.
그 결과, 기존 SOTA image-language model 들에 비해 zero-shot classification task 에서 월등한 성능을 보여주며, 이는 여타 contrastive learning 과 마찬가지로 간단한 linear-layer 학습만으로 구현이 가능하다.

저자들은 위의 CLIP encoder pair 를 이용하여, GAN 에서 활용 가능한 의미 있는 latent vector 를 구하는 방법을 제시한다.

2. Text-Driven Manipulation

manipulation latent code 생성 방식은 크게 1) Optimization 2) Learned Mapper 으로 나뉘는데, 각각 non-parametric, parametric 한 방법이다.

2.1. Optimization

저자들은 CLIP model 을 통해서 CLIP space 상에서 Image-text pair의 공통된 semantic information 을 뽑을 수 있다고 가정하고, 다음과 같은 optimization objective 를 제시한다.

\begin{aligned} \argmin_{w}\ &D_{\text{CLIP}}(G(w), t) + \lambda_{L_2} \| w - w_s \|_2 + \lambda_{\text{ID}}\mathcal L_{\text{ID}}(w) \\ \text{where }&\mathcal L_{\text{ID}}(w) = 1 - \text{sim}(R(G(w_s), R(G(w)) \end{aligned}

여기서 $w_s$ 는 initial latent code 이다. (source image)
식을 풀어서 해석해보자면,

manipulation 할 text 와 latent code $w$ 의 generated image를 CLIP space 상에서 최소화 (first term)
원본 latent code $w_s$ 와 변경된 latent code $w$ 의 차이를 최소화 (second term)
실제로 생성된 두 이미지 간의 차이를 최소화 (여기서 $R$ 은 face recognition model: arcface)

하는 latent code 가 위 식의 solution 이 된다.

Iterative 한 update 를 통해 어떤 image-text pair 간에도 제시된 optimization objective 를 통해 원하는 latent code 를 구할 수 있다.
하지만 항상 latent code 를 새로 구하는 것은 효율적이지 않기 때문에, 저자들은 위의 optimization objective 를 training loss 로 이용하는 Mapper Network 를 제시한다.

2.2. Mapper Network

Mapper network 는 다음과 같은 구조로 이루어져 있다.

M_t (w) = (M^c_t (w_c ),\ M^m_t (w_m ),\ M^f_t (w_f))

여기서 각 M은 서로 다른 FC layer 이다. (coarse, middel, fine)
latent code 를 세개로 나눈 까닭은, GAN 이 semantic hierarchy 구조를 갖는 것으로 알려져 있기 때문이다. (각 layer 에서 generation 에 기여하는 부분이 다르다고 알려져 있다 layout, object, color scheme 등)

위 mapper network 를 통해 생성된 latent code 는 optimization objective 와 비슷한 다음 loss 를 통해서 학습되게 된다.

\begin{aligned} &\mathcal L_{\text{CLIP}}(w) = D_{\text{CLIP}}(G(w+ M_t (w), t) \\ &\mathcal L(w) = \mathcal L_{\text{CLIP}}(w) + \lambda_{L_2} \|M_t(w) \|_2 + \lambda_{\text{ID}} \mathcal L_{\text{ID}} (w) \end{aligned}

주요 골자는 optimization objective function 과 거의 동일하며,

mapper network 으로 생성된 latent code ' $w+M_t(w)$ ' 을 통해 generated 되는 image 와, text 간의 CLIP space 상에서의 embedding 차이를 최소화
$M_t(w)$ 의 크기(즉 original latent code 와 manipulated latent code 의 차이)를 최소화
초기 latent code 와의 generation 결과물이 유사할 것.

의 의도를 갖는 training loss 라고 해석할 수 있다.

3. Global Direction

이 논문의 또 하나의 핵심 contribution 으로,
Latent space, Image space, Text space, CLIP space 간의 차이점에 주안하여 서로 다른 text promt 로부터 비슷한 manipulation step 이 발생하는 경우 를 방지하는 방법에 대한 discussion 을 제시한다.

다음 두 개의 image, text attribute 의 CLIP space manifold 를 가정하자.

$\mathcal I:$ manifold of Image Embeddings in CLIP
$\mathcal T :$ manifold of Text Embeddings in CLIP

저자들은 위 두개의 manifold 간 one-to-one embedding 이 존재하지 않는다고 주장한다. 이는 given text 가 서로 다른 많은 이미지에 연관될 수 있기 때문이다.

이때 Style-CLIP 에서 발생할 수 있는 문제는, manipulation 되는 text 와 원본 text 사이의 많은 visual attribute 가 겹친다는데 있다. 가령 예를 들어,

'Car' ↔ 'Spors Car'

라는 관계에서 둘 사이의 semantic embedding 은 'bus' 나, 'truck' 보다 훨씬 가까울 것이므로, 실제 'sports car' 에 해당하는 manipulation 만 온전히 적용하기가 쉽지 않다.

3.1. Prompt Engineering

제시된 문제를 해결하기 위해, 'sports car' 에 대한 적절한 manipulation direction 을 찾는 'prompt engineering' 이라는 방식을 제시한다.

위 그림과 같이, 'car' 와 'sports car' embedding 의 average embedding 으로부터 두 embedding 사이의 normalized difference vector 를 target direction $\Delta t$ 로 설정한다.

3.2. Channel-wise relevance

위의 prompt engineering 은 Text Manifold $\mathcal T$ 에서 정의된 것이므로, Image manifold $\mathcal I$ 에서 target direction $\Delta t$ 에 맞는 적절한 step $\Delta s$ 를 찾아야한다.

이때 서로 비슷한 visual attribute 와 manipulated image 만의 visual attribute 를 구분하기 위하여, channel-wise relevance $R_c$ 를 다음과 같이 정의한다.

{R_c (\Delta i)} = \mathbb E_{s \in S} \{ \Delta i_c \cdot \Delta i \}

여기서 $\Delta i$ 는 CLIP space 에서의 difference embedding vector 으로 정의된다. $\left ( \textit{i.e. }D_{\text{CLIP}}(\Delta t) \sim \Delta i\right )$

또한 $\Delta i _c$ 는 channel $c$ 에 대한 difference embedding vector 로 정의한다.

\Delta i_c = D_{\text{CLIP}} (G(s) ) - D_{\text{CLIP}} (G(s \pm \alpha \Delta s_c) ) \\ \text{where }\Delta s_c : \text{zero vector except channel c}

이때, $\Delta s_c$ 에서 $c$ channel 의 값은 해당 channel 의 standard deviation 으로 설정된다.

즉 channel-wise relevance $R_c$ 는, 특정 channel $c$ 에 대한 (특정한 visual attribute) manipulation 만을 가했을 때

target direction 과의 visual attribute 가 비슷하면 (공통 특징) 내적값이 0에 가까울 것 이고,
target direciton 과의 visual attribute 가 비슷하지 않으면, 내적값이 상대적으로 크게 된다.

즉, 원하는 text manipulation direction 에 대한 유사도 값임을 알 수 있다. 저자들은 이를 통해 특정 channel 의 relevance 값이 threshold 보다 낮을때 이를 manipulation latent code 에서 제외했다고 한다. 따라서 최종 latent manipulation code 는 다음과 같이 정의된다.

\Delta s = \begin{cases} \Delta i _c \cdot \Delta i &\text{if } |\Delta i _c \cdot \Delta i| \ge \beta \\ 0 & \text{otherwise} \end{cases}

위의 그림은 thershold 값이 변할때 StyleCLIP 의 결과물 차이를 나타낸다.
큰 threshold 는 여러 visual attribute 들을 제거하게 될 것이므로 visual effect 가 더 작게 변하게 된다. row 를 아래로 바라볼 때 visual effect 의 변화가 커지는 것을 볼 수 있다.

Hwan Heo (허환)

기타치는AI Researcher

다음 포스트