[2019 arXiv] Learning Spatial Fusion for Single-Shot Object Detection

Hyungseop Lee·2025년 2월 26일

[Paper Review] Feature Fusion(Alignment) Networks

목록 보기

4/16

(문제 제기)

하지만, the inconsistency across different feature scales is a primary limitation for the single-shot detectors based on feature pyramid.
- 궁금한 점: 서로 다른 feature scales의 inconsistency란?
  특정 level의 feature map에서 object가 positive로 처리되면, 다른 level에서는 background로 간주됨.
  " Therefore, if an image contains both small and large objects, the conflict among features at dif-ferent levels tends to occupy the major part of the feature pyramid."

(제안)

scale invariance를 달성하기 위해서, recent SOTA detectors는 feature pyramids or multi-level feature towers를 construct함.
예를 들어, Single Shot Detector (SSD)도 various sizes의 objects를 predict하기 위해 multi-scale feature maps을 reuse한다.
하지만, shallow-layer feature maps은 insufficient semantic information을 갖고 있기 때문에 small instances에 대해 low accuracies로부터 고통받음.
이를 해결하기 위해 FPN이 나왔지만, 여전히 개선의 여지가 많음.

(문제 제기)

an object가 a certain level의 feature maps에서 positive로 할당되었을 때,
other levels의 feature maps에서 그에 해당하는 areas는 background로 여겨짐.
그래서 한 image가 small and large objects를 둘 다 포함할 때,
서로 다른 levels의 features 사이에 conflict가 FPN의 대부분을 차지하고 있는 경향이 있다.
이 inconsistency는 training 동안에 gradient computation을 방해하고 feature pyramids의 effectiveness를 downgrade한다.

(제안)

FPN, PANet, DLA, Libra R-CNN, NAS-FPN ... 소개
위 연구들에도 불구하고, feature pyramid based methdos는 여전히
suffer from the inconsistency across different scales, which limits the further performance gain.
제안하는 ASFF는 서로 다른 feature maps 사이의 connections을 학습함으로써 위 문제를 완화할 수 있는데,
사실 이 아이디어는 computer vision domain에서 새로운 것은 아니다.
- [1]은 ...
  하지만 OD에서 spatial contradiction을 줄이는 것은 아니다.
- [28]은 ...
  동일한 level의 feature maps 내의 information flow를 optimize하지만, feature pyramids 내에서 발생하는 inconsistency를 해결하지는 못한다.
- ACNet [37]은 ...
  
  In contrast to them, ASFF는 spatial contradiction을 피하기 위해
  각 location에서 서로 다른 level의 features에 대한 import degrees를 적응적으로 학습한다.

our key idea is to adaptively learn the spatial weight of fusion for feature maps at each scale.
pipeline consists of two steps:
1. indentically rescaling
2. adaptively fusing