Stylegan Inversion

이승화·2022년 5월 22일

실제 이미지를 stylegan을 이용해 manipulation 하기 위해, $I$ (real image) 를 stylegan의 latent space 벡터로 Embedding 시켜야 한다.

Inversion시 고려할 점은,
(1) 어떤 latent space로 embedding 시킬 것인가와
(2) 어떤 embedding 방법을 사용할 것인가 이다.

stylegan의 latent sapce 는 $Z space$ , $W space$ , $W_+$ space , $S space$ , $P space$ , $P_n$ space 가 있다.
embedding 방식은 optimization 과 encoder 방식, 둘을 같이 사용하는 hybrid 방식이 있다.

아래 논문들은 이 두가지 고려사항들에 따라 분류, 설명할 수 있다. 논문들을 읽으면서 느낀것은 reconstruction과 editability의 균형을 맞추는 것이 중요하다는 점이다.

1. Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space?

처음으로 stylegan inversion을 시도했다.
저자는 stylegan1에 $W_+$ 로 optimization 방식을 사용했다.

$W$ space 에 embedding한 것보다 $W_+$ space 에 embedding 한 결과가 좋다는 것을 근거로 들었다.
(b,c,d), (e,f,g)는 optimization시 $w$ 의 initialization의 차이이다.

loss function으로는 VGG-16 perceptual loss와 pixel-wise mse loss를 사용했다.

vgg loss 는 1024이미지 보다 작은 256사이즈의 vgg모델을 사용하므로, 논문에서는
resizing trick을 사용해 더 좋은 결과를 얻었다고 한다. 하지만 resizing시 detail이
사라지는 문제가 있으므로 더 좋은 perceptual loss를 찾아야 한다고 말한다.

사용된 stylegan은 FFHQ를 학습한 모델이였다. optimization iteration은 5000을 사용했었는데,
상대적으로 얼굴에 대한 loss는 일찍 수렴한 반면, 다른 도메인 loss는 수렴속도가 늦었다.
아마 해당 도메인을 학습한 stylegan을 사용했다면 iteration을 줄여도 될 것이다.
한 이미지에 대해 TITAN V100으로 7분 걸렸다고 한다.

이를 통해 유추할 수 있는 또 다른점은, multi-domain간의 latent space가 공유된다는 점이다.
사람의 얼굴과 동물의 얼굴은 다르지만 얼굴이라는 공통 범주안에 속하기 때문일 것이다.

comment

reconstruction 만 고려한 느낌. embedding image의 manipulation quality에 대한 설명이 부족
하나의 이미지에 7분은 실효성이 없어 보인다
high-frequency detail을 보충할 방법이 필요해 보인다

2. In-Domain GAN Inversion for Real Image Editing

motivation

기존 inversion 방법들로 embedding한 image들은 reconstruction에만 초점을 맞춰, editing시 결과가 좋지 않음에서 시작한다. 저자는 이 원인이 inversion시 기존 latent space의 semantic domain으로 mapping 되지 않기 때문이라 주장한다.

Here, the semantics refer to the emergent knowledge that GAN has
learned from the observed data.

저자는 stylegan1의 $W$ space로 먼저 encoder 방식을 사용한뒤 optimize한다.

domain-guided encoder 의 training-scheme은 아래와 같다.
저자는 실제 이미지로 생성한 latent가 Generator를 통과하면서 semantic domain을 encoder가 학습할 수 있다고 주장한다. (기존 방식과 다를게 없어보임)
더불어 encoder를 학습할 때 discriminator과의 adversarial loss도 추가했다.
encoder의 구조는 일반적은 progressive resnet과 유사하다.

이는 f-anogan 에서 사용한 izi-encoder training과 같은 맥락이다.

encoder를 학습한뒤 아래와 같은 loss 로 optimize를 진행한다.
저자의 github 에 따르면 100 iteration 이면 충분하다 한다.(P40으로 8초)

comment

저자가 주장하는 semantic domain으로 mapping하는 근거가 다른 method도 해당되지 않나?.
encoder 학습시간이 얼마나 걸리는지 서술돼있지 않다.
encoder만 사용한 결과가 image2stylegan의 optimization 방법보다 확실히 안좋다.
보통 optimization 방식이 결과가 좋고 시간이 오래걸리는데 high-frequency 정보가 결여돼있다. loss term의 차이가 없어서 더 의아함. encoder의 구조가 혼자로는 latent space로 mapping 하기 부족해 보임

3. Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation (psp)

제목부터 알 수 있듯이, encoder을 사용해 실제 이미지를 edit하는 목적이다.
IDinvert와 달리 별다른 optimization 과정이 필요없이 encoder하나로 latent를 알수있다.
더불어 단순히 inversion에 그치는 것이 아니라 encoder하나로 translation task도 수행한다.

PSP encoder은 Feature Pyramid 구조를 갖는다. 이는 stylegan의 구조에서 영감을 받은듯 하다. 여기서 학습되는 layer는 convolution layer인 map2style이다.

더불어 pretrained Generator의 mean $w$ vector를 encoder의 output vector에 더해 이미지를 생성한다.

loss는 face에 대해 $L_{ID}$ 를 사용한 것이 특징이다.

translation task를 수행할 때도 style-mixing, loss term 수정을 통해 별도의 training step을 최소화해 진행할 수 있음을 보였다.

한계점으로는 stylegan에서 학습되지 않은 data에 대한 reconstruction quality가 낮다. 또한 high-frequency 정보를 전부 담지 못했다.

comment

inversion + manipulation 에 대한 실험결과가 없다. 아마 task specific한 encoder를 만드는 것을 목적으로 했기 때문인것 같다.

4. Designing an Encoder for StyleGAN Image Manipulation

논문의 abstact 에서는 inversion 과 editability 의 trade-off 를 말하며, 이를 고려한 encoder를 제시한다. 기존 encoder들이 real-image를 latent space로 mapping 한 공간은 stylegan이 trained된 latent space의 공간 중 frequency가 낮기 때문에, 이 공간에서 latent vector manipulation을 할 경우 결과가 좋지 않다고 말한다.

We identify andanalyze the existence of a distortion-editability tradeoff and a distortion-perception tradeoff within the StyleGAN latent space.

그리고 editability 의 정의를 (1) distortion 과 (2) perceptual quality로 내린다. distortion은 실제 이미지와 복원한 이미지의 유사성이고, perceptual quality는 복원한 이미지가 얼마나 현실성 있는지이다.

저자는 기존 방법들이 (1) distortion에 치중해 $W_+$ space로 mapping을 했는데, 이는 perceptual quality 측면에서 매우 안좋다는 것을 보인다. 대안으로 $W$ space 에 근사한 공간으로의 mapping을 제시한다.

The expressiveness of the W latent space has been shown to be limited.
The space W+ has more degrees of freedom, and is thus significantly more expressive than W. Although this extension is expressive enough to represent real images, as we shall show, inverting images away from the original W space reaches regions of the latent space that are less editable and in which the perceptual quality is lower.

Our key insight is that editability and perceptual quality are best achieved
by inverting an image close to W.
“Close” will be characterized by two key properties which are, Firstly, low variance between the different style vectors, Secondly, each style vector should lie within the distribution W.

논문의 encoder에는 distortion-perception tradeoff 조절을 "Proximity of latent code to W space" 로 한다.

as we approach W, the distortion worsens while the editability and
perceptual quality improve.

논문에서 말하는 "True distribution of W" 가 정확히 무엇을 말하는지 애매하다. 읽으면서 느낀것과 위 그래프를 참고해봤을때, stylegan이 학습할때 사용된 $w$ 들을 말하는것 같다?

저자가 제시하는 $W_*$ space로의 mapping이 기존의 $W$ space 와 $W_+$ space의 단점을 각각 완화시켜준다는 근거인다. distortion 측면에서는 $W$ 보다 좋아보이지만 perceptual quality가 많이 안좋아보인다.

저자가 말했던 Close to $W$ space에 대한 설명은 위의 두 화살표가 의미한다. 빨간 화살표는 $w_i$ 들이 비슷한 것을 의미한다. 파란 화살표는 한 이미지에 대한 여러 latent vecotr 들이 비슷한 것을 의미한다.

이를 구현하는 방법은 아래와 같다.

encoder는 서로 다른 $w_i$ 를 만드는것을 학습하는 것이 아니라 하나의 $w$ 와 offset 들 $\Delta_i$ 을 학습한다. $\Delta_i$ 의 초기값은 0이고 한번에 여러개를 학습하는 것이아니라 순서대로 학습한다고 한다. 이러한 설계는 stylegan의 coarse-finer구조와 비슷한 효과를 갖는다고 말한다.
여러 $W_i$ 들간의 분산을 줄이기 위해 또 다른 discriminator를 사용한다. 저자는 mapping layer를 통과한 $w$ 와 1.을 통과한 latent code를 판별하는 discriminator를 말한다. 다만 real sample from $W$ 가 실제 이미지의 $w$ 가 아니기 때문에 취지와 어긋나 보인다.

저자가 제시하는 tradeoff control은 매우 단순하다. psp encoder로 얻은 vector와 e4e로 얻은 vectore를 interpolate하는 것이다. 위의 이미지에서 2번째 사진이 psp의 inversion이고 6번째 사진이 e4e의 inversion이다. distortion만 보면 psp가 좋지만 edit한 결과의 perceptual quality는 e4e가 좋다. 따라서 각각의 latent를 interpolate한 vector의 edit 결과도 둘 사이의 distortion-perceptual tradeoff의 결과를 보인다.

5. Pivotal Tuning for Latent-based Editing of Real Images

e4e가 mapping 공간에 대한 조절로 distortion-editability trade-off를 해결했다면 PTI는 Generator tuning을 통해 해결한다. 특히 heavy-make up 이미지처럼 StyleGAN에서 학습되지 않은 out-of-bound 이미지에 대한 inversion이 faithful하게 가능한게 특징이다.

PTI 전의 mapping을 보면 out-of-bound 이미지는 editability가 낮은 공간으로 mapping된다. tuning을 거친후에는 distortion도 적고 editability도 올라간다.

저자들이 제시한 real image에 대해 generator의 tuning이 가능한 근거는 아래와 같다.
(Generator tuning without harming editability and general synthesizing performance)
두번째 근거에 대한 부연설명을 하자면, 위의 그림에서 latent vector B와 C의 위치가 크게 차이가 나지 않으므로 Generator에 악영향없이 tuning이 가능하다고 이해했다. (이를 논문에서 보였는가??)

Due to StyleGAN’s disentangled nature, slight and local changes to its produced appearance can be applied without damaging its powerful editing capabilities.

Since $w_p$ is close enough, training the generator to produce the input image from the pivot can be achieved through augmenting appearance-related weights only, without affecting the well-behaved structure of StyleGAN’s latent space.

Inversion 과정은 아래와 같다.

stylegan2 에서 사용한 inversion방법으로 $W$ space 로 mapping.(optimization)
1에서 얻은 $w_p$ 를 pivot으로 사용해, out-of-bound image를 작은 distortion으로 reconstruct하게 Generator을 tuning한다. 논문에서는 batch별로 350회 반복했다고 한다.

두 step을 진행하는데 RTX 2080으로 이미지 하나당 3분이 걸렸다고 한다.

논문을 읽으면서 신기했던 점은 Generator tuning이 multiple image에 대해 동시에 가능하다는 점이다. 다만 multiple-identity에 대한 부작용으로 "ripple-effect"가 발생한다.

별다를 regularization term이 없으면 artifact가 발생한다. 이에 대한 해결책으로 random w와의 interpolation으로 PTI의 영향을 국소적으로 줄여서 해결했다고 한다. (원리가 궁금하다?)

실제로 PTI가 out-of-bound image에 대한 복원 퍼포먼스는 엄청나보인다. stylegan학습이 없었던 페인팅도 완벽하게 그려냈다.

comment

out-of-bound이미지의 경우 editability가 낮은 latent space로 mapping되므로 이를 pivotal tuning으로 해결한 방법.

6.Improved StyleGAN Embedding: Where are the Good Latents?

첫번째 소개했던 image2stylegan을 distortion-editability를 고려한 논문이다.
논문의 main idea는 e4e에서 보여줬던 $W$ space의 density 가 낮은 out-of-bound 이미지의 mapping을 density가 높은 쪽으로 옮겨주는 Regularizer를 추가해주는 것이다.

e4e는 인코더 구조로 close to $W$ space 로 유도해주는 2가지 loss term을 추가한 방법이고,
pti는 generator를 tuning, 본 논문은 $P_N$ space에서의 $L2$ regularization 으로 해결방식을 제시했다.

The 𝐿2 norm in $𝑃_𝑁$ space is a Mahalanobis distance of latent codes, so that 𝐿2
regularization in this space will bias embeddings towards more densely sampled regions of the GAN latent space

what is $P_N$ space?

$P_N$ space (Multivariate Normal distribution) 를 유도하는 이유는 multivariate Normal distribution의 특징을 이용하기 위해서 이다. 처음에 말했듯이 우리의 목적은 $w$ 들을 high probability region으로 이동시키는 것이다. 따라서 분포 중심과 점 사이의 거리중 하나인 Mahalanobis distance를 regularizer 사용하려고 한다.

$W$ space의 indivisual distribution을 보면 right-scewed이다. symmetric 하게 만들기 위해 저자는 mapping layer의 마지막 leaky-relu의 기울기를 invert 시켰다. 이를 $P$ space 라 한다. 사실 mahalanobis distance 에서 indenpendent 조건은 없지만 loss term 계산의 편의성을 위해 PCA whitening을 하는것 같다.

PCA whitening은 변수를 (1)uncorrelate (2)unit variance 로 만들어준다.
$X=U\Lambda U^T$ 로 decomposition 될때 $X_{rot} = U^TX$ 는 서로 독립이다.
여기에 $x_i =\frac{x_{rot,i}}{\sqrt{\lambda_i}}$ 들은 각각 독립이면서 unit variance 를 갖는다.

위의 식처럼 변형을 해준 공간이 $P_N$ space 이다. 이때 벡터 $v$ 의 $L_2$ Norm 은 $N(m, \Sigma)$ 분포의 mahalanobis 거리가 된다.

$d^2_m$ 를 regularizer로 사용하면 아래의 그림에서 알수 있듯이 분산이 큰 방향으로는 상대적으로 수정되는 크기가 작고 분산이 작은 방향으로는 중심으로 거리가 많이 가까워진다. 이는 mahalanobis 거리의 정의를 생각해보면 당연하다. 이 분산이 큰 방향은 stylegan의 latent space 에서는 variation이 많은 위치, 즉 reconstruct에 중요한 위치로 생각할 수 있다. 이러한 점을 고려하지 않고 $w$ vector를 수정하면 결과가 좋지 않다는 것이다.

아래 그래프에서 보면 $d^2_m$ 를 regularizer로 사용하면 reconsturction loss 의 증가량은 미미하지만 분포의 중심과의 거리는 상대적으로 매우 큰것을 알 수 있다.

기존의 $W$ space에서 $W_+$ space 로 확장한것처럼 $P_{N+}$ space로 확장할 수 있다.

다만 $P_{N+}$ 가 사용되는 부분의 $W_+$ 들의 regularizer 항을 계산할 때이다. inversion 에 직접적으로 사용되는 latent vector는 $w_+$ 이다.

psp, e4e 와 같이 encoder based inversion 방식을 보완하는 방법을 제시하는 논문이다.
한번에 latent vector를 예측하는 것이 아닌 $N$ 번의 iteration을 걸쳐 $w$ 의 residual을 예측한다. 어떻게 보면 e4e에서 사용된 방법을 좀더 체계화시켰다고 볼 수도 있다.
(e4e에서 residual이 어떻게 gradualy 학습되는지는 코드를 보고 확인해봐야 하지만)

즉 optimization 방식에 비해 encoder 방식은 상대적으로 reconsturction 정확도가 낮기 때문에, encoder 방식에 iterative mechanism을 추가해 이를 보완한 논문이다.

psp, e4e에서 사용된 encoder는 FPN 구조를 갖는다. 저자들은 iterative training이 이런 구조적인 FPN encoder 구조가 redundunt하다고 말한다. 그래서 위와 같이 모든 layer에 같은 input이 들어가는 simplified된 encoder를 갖는다.

실험 결과도 논문의 의도를 명확히 보여준다. e4e는 optimization 방식에 비해 reconsturction 이 확실히 않좋지만 ReStyle을 적용한 e4e는 이를 보완해준다.

quality-time trade-off 측면에서도 psp,e4e 와 Restyle을 사용한 psp, e4e를 비교해보면 경량화된 encoder때문에 충분한 경쟁력을 갖는다.

title

실제로 iteration 마다 변화를 보면 coarse to fine 순서로 이미지가 복원되는것을 알 수 있다.
저자가 말한것처럼 encoder가 스스로 step마다 좀 더 효과적으로 학습하는것을 보여준 증거이다.

This relaxed constraint allows the encoder to iteratively narrow down its inversion to the desired target latent code in a self-correcting manner.

comment

이런 iterative + residual learning 은 다른 분야에서도 효과적으로 적용가능해 보인다.

8. HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing

HyperStyle은 Generator tuning을 사용해 reconstruction-editability trade-off를 해결하는 방식의 training time을 Hypernetwork를 통해 해결한다. Hypernetwork는 특정 neural network의 가중치를 학습하는 또 다른 neural network이다. Generator의 weight offset을 예측하는 hypernework는 PTI와 비교할 때, 각각 encoder 방식 optimization방식으로 생각할 수 있다.
(PTI는 350회 iteration이 필요한 반면 HyperStyle은 10회를 넘지 않는다.)

HyperStyle은 PTI와 다르게 $w_{init} \in W$ 을 e4e( $x$ )로 시작한다.
Hypernetwork의 input은 원본 이미지( $x$ )와 tunning할 generator가 생성한 이미지 G( $w_{init}$ )이다.
ReStyle과 마찬가지로 Generator weight의 offset을 예측하고 $N$ (5~10)회 반복한다.
이때 생성한 이미지는 weight가 update된 generator가 만든 이미지가 된다.

위의 그림과는 다르게 실제 학습과정에서는 coarse level의 weight에 대한 offset은 학습하지 않는다. 이는 e4e로 학습한 이미지를 보면 coarse detail은 충분하다고 생각되기 때문이다.

StyleGAN의 parameter은 30M개이므로 각각의 offset을 모두 예측하는 방식은 그 자체로 overhead가 크다.
따라서 논문에서는 channel마다 동일한 offset을 사용하는 방법을 택했다. 그러면 encoder방식들과 비교해도 parameter 개수가 크게 차이나지 않는다.

여기서 말하는 channel은 output filter마다 속한 input channel들을 말한다. 즉 kernel size x kernel size 마다 parameter들이 있는데 이들을 하나의 offset으로 퉁친다는 것이다.
그래서 hypernetwork의 Refinement Block별 output의 dimensio은 1 x 1 x $C_{in}$ x $C_{out}$ 이 되고 generator을 tunning 할때는 kernel의 크기만큼 replicate 된다.

추가적으로 Refinement Block의 마지막에는 2개의 FC layer가 있는데, 저자는 이 두개의 FC layer의 weight를 모든 Refinement Block에서 공유해 추가적으로 모델을 경량화했다. 이에 대한 근거는 기존 Hypernetwork의 논문에 있다.