[Review]Learning Statistical Texture for Semantic Segmentation

redgreen·2022년 5월 9일

PaperReview

목록 보기

9/9

CVPR 2021에서 발표된 논문
github

1. Introduction

기존의 semantic segmentation 모델들은 contextual information in high-level feature에 집중했다.
:high-level layer만을 사용하는 것은 edge 등의 중요한 detail을 놓칠 수 있어 skip-connection 등의 기법이 활용된다.

해당 논문에서는 low-level texture feature가 local structure뿐 아니라 global statistical knowledge도 가지고 있다고 주장한다.

low-level information의 distribution을 분석하기 위해 STLNet(Statistical Texture Learning Network)을 고안했다고 한다.

image texture는 단순 boundary, smoothness, coarseness 등의 local structural property일 뿐 아니라 global statistical property라고 주장한다.

low-level information을 통해 histogram of intensity를 추출하는 일종의 spectral domain analysis라고 주장한다.

2.Model

모델의 크게 3가지 구조를 가진다.

2.1 QCO(Quantization and Counting Operator)

크게 quntize와 count 두가지 연산을 수행하고 1-d QCO와 2-d QCO로 구분된다.
1) quantize inpute feature into multiple layer
2) count the number of features

2.1.1 1-d QCO

Quanitzation

1) input map에 Global Average Pooling을 한 결과를 다시 input map과 Cosine similarity를 구해준다
2) 임의의 N개의 level로 나누어 준후 quantization encoding vector $E_i$ 를 얻는다.

이때 얻어지는 $E_i$ 는 $S_i$ 의 quantization level을 나타낸다

argmax나 one-hot encoding 방법보다 smoother way를 사용함으로써 gradient vanishing 문제를 피할 수 있다고 한다.

Counting

벡터 $L$ 과 $E$ 의 channel-wise mean을 Concat한다.

Average Feature Encoding

위의 과정을 통해 얻은 $C$ 와 앞서 global average pooling을 통해 얻었던 $g$ 를 concat하여 output을 얻는다

2.1.2 2-d QCO

2-d QCO에서는 1-d와 유사하지만 인접 pixel간 관계를 통해 공간정보에 주목한다.

2.2 TEM

enhance texture details

1-d QCO를 통해 얻은 값을 q, k, v로 나누어 학습한다.

2.3 PTFEM

exploit texutre-related information

multi-scale feature사용을 위해 다양한 크기의 2-d QCO를 사용하고 논문에서는 [1, 2, 3, 6]크기의 scale을 사용하여 이미지를 축소하였다.

2.4 Loss

원활한 학습을 위해 auxilary layer를 사용하였다.

auxiliary output에는 cross entropy를, main ouput에는 OHEM(online hard examples mining)을 사용하였다고 한다(focal loss와 유사한 목적인듯?)

3. Result

TEM 모듈 적용 전, 후를 비교했을 때 더 뚜렷한 texture detail에 대해 얻을 수 있었다고 한다.

PASCAL, ADE20K, Cityscapes 등의 데이터셋에서 기존 SOTA모델 보다 좋은 성능을 보였다고한다.(ResNet-101기준)

FLOPS

github에 올라와있는 코드를 backbone없이 사용했을 때에도 다른 모델 보다 상당히 무거워졌다.--> 확인해볼 필요가 있는듯

DeepLapV3와의 비교시 조금더 detail한 부분을 잡는 것을 확인할 수 있다.