Role of CNN-Based backbone in DETR

temp·2021년 9월 27일

AI DETR Object Detection Resnet

XAI / Object Detection

목록 보기

24/24

지속적으로 추가 예정입니다.

1. DETR에서 Backbone의 역할?

Technical details

해당 저자들은 아래와 같은 Setting으로 학습을 진행한다.

DETR

transformer : AdamW with learning rate $10^{-4}$
backbone : AdamW with learning rate $10^{-5}$
- weight decay : $10^{-4}$

모든 트랜스포머의 가중치는 Xavier init을 통해 초기화되며, backbone 모델은 ImageNet에 pretrained된 ResNet을 batchnorm layers frozen 시킨 채로 사용한다.
저자는 두 개의 backbone 모델에 대한 결과(ResNet-50, Resnet-101)를 보여주며, 각각 DETR과 DETR-R101으로 이름 짓는다.

연구 Li, Y. et al, Fully convolutional instance-aware semantic
segmentation. In: CVPR (2017)
에 나와있는 것처럼, backbone의 last stage에 dilation을 더하고(stage : BottleNeck으로 보면 될 듯), 해당 state 내 first convolution의 stride를 제거함으로써 feature resolution을 증가시켰다.

"Fully convolutional instance-aware semantic segmentation"

In the original ResNet, the effective feature stride (the
decrease in feature map resolution) at the top of the network is 32. This is too coarse for instance-aware semantic
segmentation. To reduce the feature stride and maintain the
field of view, the “hole algorithm” [3, 29] (Algorithme a`
trous [30]) is applied. The stride in the first block of conv5
convolutional layers is decreased from 2 to 1. The effective
feature stride is thus reduced to 16. To maintain the field
of view, the “hole algorithm” is applied on all the convolutional layers of conv5 by setting the dilation as 2.

해당 논문은 원래 semantic segmentation을 위한 논문이다. 하지만, instance-awre semantic segmentation을 하기에는 Resnet의 기존의 feature stride인 32는 너무 coarse해, 마지막 stage의 첫번째 convolution layer의 stride를 2에서 1로 줄이는(즉, stride를 없애는) 과정을 거쳤고, hole algorithm을 마지막 state의 모든 convlution layer에 적용함으로써 field of view를 효과적으로 유지하였다.

즉, Instance를 더 잘 고르기 위해, 그리고 receptive field를 잘 유지하기 위해 Feature resolution을 향상시킨 것.
Resnet의 마지막 convolution stage를 없애는 것은 이에 반하는 localized feature를 생성하는 결과를 반환하므로 설령 Backbone - Transformer 간의 관계를 파악하기 위해서 Feature map level은 건드는 것은 좋지 않을듯 하다.

temp

이전 포스트

Role of CNN-Based backbone in DETR

XAI / Object Detection

1. DETR에서 Backbone의 역할?

Turning off each head's attention maps of Decoder in DETR : Focusing on generic attention model explainability

0개의 댓글

관련 채용 정보