[논문 리뷰]Mask-RCNN

이윤석·2021년 9월 30일

0

Mask R-CNN

Introduction

Faster R-CNN의 extension
- idea 1 : pixel-to-pixel alignment
- idea 2 : constructing the mask branch properly is critical for good results
  - RoIPool : adding mask branch for predicting segmentation masks on each Region of Interest(RoI)
    - why : 기존 classification, box regression branch를 위해.
    - how : parallel하게 추가. mask branch - small FCN(predicting a segmentation mask in a pixel-to-pixel manner)
    - RoIAlign : spatial location faithfully preserved

Mask R-CNN

Summary
- Mask R-CNN = Faster R-CNN + mask branch
  - Faster R-CNN = Fast R-CNN + RPN(region proposal network)
    - 기존 구조 : classification branch + localization(bounding box regression) branch on bouding boxes
  - mask branch 추가 : 공간적 정보 손실을 줄이기 위함
    - mask : result of spatial layout of object
Detail
- Contribution
  - mask branch
    - straightforward structure...?
  - FPN(feature pyramid network) : 서로 다른 스케일의 영상에서도 특징들을 찾음 -> scale-invariant
  - RoI align : RoI pooling 대신 사용
    - RoI pooling : predicting pixel-accurate masks에는 문제가 있음
      - RoI에서 작은 feature map 추출하여, quantization을 진행. 보통 max pooling으로 aggregate 해줌.
      - 문제점 : max pooling, rounding 으로 디테일한 정보들이 소실
    - RoI align : removes harsh quatization of RoI pool
      - RoI pooling 과 달리 반올림을 사용하지 않고 bilinear interpolation을 통해 feature map의 RoI 영역을 정확하게 정렬(픽셀의 값을 정수형으로 만듦)
architecture
- backbone
  - Resnet50 or Resnet101 : Faster R-CNN에서 쓰던 것
  - FPN(Feature Pyramid Network) : 다양한 scale에 대해 feature를 뽑아냄. 속도, 정확도가 높음.
    - 기존 모델에 fully convolutional mask prediction branch 추가
    - Head (빗금친 부분) : Classification, Regression(Bounding box Recognition)
    - Mask branch (밑 부분)

Experiment and Result

당시 대부분의 sota methods보다 나은 성능을 보임
Ablation Experiment
- Architecture : 네트워크 깊을수록 성능 좋음
- Multimonial vs Independent Masks : decoupled sigmoid가 더 나은 성능을 보임
- Class-Specific VS Class-Agnostic
  - 성능이 비슷하게 나옴(Specific:30.3 mask AP이 미세하게 Agnostic:29.7 mask AP보다 높음)
- RoIAlig : RoIPool, RoIWarp 보다 나은 성능을 보임

Process

800~1024 사이즈로 이미지를 resize해준다. (using bilinear interpolation)
Backbone network의 인풋으로 들어가기 위해 1024 x 1024의 인풋사이즈로 맞춰준다. (using padding)
ResNet-101을 통해 각 layer(stage)에서 feature map (C1, C2, C3, C4, C5)를 생성한다.
FPN을 통해 이전에 생성된 feature map에서 P2, P3, P4, P5, P6 feature map을 생성한다.
최종 생성된 feature map에 각각 RPN을 적용하여 classification, bounding box regression output값을 도출한다.
bounding box regression값을 원래 이미지로 projection시켜서 anchor box를 생성한다.
Non-max-suppression을 통해 생성된 anchor box 중 score가 가장 높은 anchor box를 제외하고 모두 삭제한다.
각각 크기가 서로다른 anchor box들을 RoI align을 통해 size를 맞춰준다.
Fast R-CNN에서의 classification, bbox regression branch와 더불어 mask branch에 anchor box값을 통과시킨다.

참조

Be Smart with 성실한 호기심

이전 포스트

공간데이터 개념

다음 포스트

[논문 리뷰]You Only Look Once: Unified, Real-Time Object Detection

0개의 댓글

관련 채용 정보