HRNet

한량·2021년 10월 25일

[P-stage] Semantic Segmentation

목록 보기

4/4

1. 탄생 배경

기존 image classification을 위한 CNN은 high resolution input을 low resolution으로 줄여나가는 LeNet 기반 설계를 이용
--> classification에는 모든 feature가 필요하지 않으며, 해상도가 줄면 computational complexity가 줄고, receptive field가 커지기 때문 + 중요 feature만 추출해서 overfitting 방지

하지만 semantic segmentation은 모든 pixel에 대한 분류가 필요하기 때문에 high resolution을 유지하는게 좋음

DeconvNet, SegNet: max pooling시 어떤 pixel에서 값을 가져왔는지를 저장해서 unpooling시 공간 정보(positional information)를 살림

U-Net: transposed conv, skip connection을 이용해 공간정보를 살림

DilatedNet, Deeplab: sampling, unsampling을 dilated convolution으로 대체해 medium resolution까지 줄이면서 receptive field를 늘림

DeeplabV3+: Depthwise separable convolution으로 max pooling을 대체해 detail information을 살리고,
backbone에서 decoder로 skip connection을 추가함

이런 이전의 모든 classification based model은 1) 높은 time complexity, 2) low position-sensitivity라는 단점이 존재했음

이를 해결하기 위해 high resolution을 계속 유지하는 HRNet(High Resolution Network)이 등장

2. 구조

해상도를 input image의 1/4로 유지
기존 model들은 1/20, 1/16 정도로 유지했기 때문에 상대적으로 high resolution을 유지함

다양한 receptive field를 갖는 feature를 생성
Low resolution의 feature: 넓은 receptive field로 상대적으로 풍분한 semantic information을 가짐
High resolution의 feature: positional information이 많이 살아 detail한 정보를 가짐

High/low resolution feature를 각 stage마다 sum해서 다양한 feature를 고려할 수 있게 함
High -> low: Convolution의 stride = 2로 size / 2, channel x 2, pooling 대신 stride를 사용해 정보 손실을 최소화함
Low -> high: 1x1 conv, bilinear upsampling으로 channel / 2, size x 2
Computational complexity를 위해 1x1 conv 이후 upsampling 진행

Task에 맞는 output을 생성
V1: high resolution feature만 사용해서 pose estimation, keypoint detection에 활용
V2: low resolution feature를 bilinear upsampling 후 모든 feature를 합해서 semantic segmentation에 활용
V2p: HRNetV2에서 down sampling한 결과를 출력해 Faster-RCNN 등의 backbone으로 사용돼 object detection에 활용