[RT-DETR] DETRs Beat YOLOs on Real-time Object Detection

Hyungseop Lee·2024년 5월 11일

[Paper Review] Object Detection

목록 보기

14/17

Paper Info

Wenyu Lv, Yian Zhao, Shangliang Xu, Jinman Wei, Guanzhong Wang, Cheng Cui, Yuning Du, Qingqing Dang, Yi Liu
Baidu Inc. "Detrs beat yolos on real-time object detection." CVPR 2024.
https://zhao-yian.github.io/RTDETR/

Abstract

YOLO series는 speed와 accuracy 사이의 합리적인 trade-off로
real-time object detection을 위한 가장 인기있는 framework가 되었다.
그러나 우리는 YOLO의 speed와 accuracy가 NMS에 의해 부정적으로 영향을 받는 것을 관찰했다.
최근, end-to-end transformer-based detectors(DETRs)이 NMS를 제거하는 대안을 보였다.
그러나 high computational cost로 인해 실용성이 제한되고 NMS를 제외하는 장점을 완전히 활용하지 못하는 문제가 있다.
이 논문에서는 위 문제를 해결하기 위해 처음으로
real-time end-to-end object detector인, Real-Time DEtection TRansformer(RT-DETR)을 제안할 것이다.
RT-DETR은 두 단계로 구성되어 있다.
먼저 accuracy를 유지하면서 speed를 향상시키는 데 중점을 두고,
그 후 speed를 유지하면서 accuracy를 향상시킵니다.
구체적으로,
1. 우리는 speed 향상을 위해
  intra-scale interaction과 cross-scale fusion을 분리함으로써
  multi-scale feature를 빠르게 처리할 수 있는 efficient hybrid encoder를 설계했다.
2. 우리는 accuracy 향상을 위해
  decoder에 대한 high-quality initial queries를 제공할 수 있는 uncertainty-minimal query selection(불확실성 최소화 쿼리 선택)을 제안했다.
또한, RT-DETR은 retraining 없이 다양한 scenarios에 적응함으로써
decoder layer의 수를 조정하여 flexible speed tuning을 할 수 있다.

RT-DETR-R50은 COCO에 대해 53.1% AP(108 FPS on T4 GPU)을 달성.

1. Introduction

YOLO와 같은 CNN-based detector들은 일반적으로 post-processing을 위해 NMS가 필요함.
이는 inference speed를 늦출 뿐만 아니라 speed와 accuracy 모두에 instability를 유발하는 hyper-parameters를 도입함.

또한 scenario에 따라 recall과 accuracy에 서로 다른 강조가 필요하기 때문에
적절한 NMS threshold을 신중하게 선택해야 하며,
이는 real-time detector 개발을 방해하는 요소가 된다.
최근에는 end-to-end transformer-based detectors(DETRs)은
간소화된 architecture와 hand-crafted components 제거로 인해 학계에서 큰 관심을 받고 있다.

그러나 high computational cost로 인해 real-time detection 요구 사항을 충족시키지 못하며,
NMS가 없는 구조의 inference speed에서의 이점이 보이지 않음.
DETRs이 real-time을 달성하기 위해,
우리는 DETR에 대해 다시 고민하고 key components를 자세히 분석하여
unnecessary computational redundancy를 줄이고 accuracy를 더 향상시켰다.

전자(unnecessary computational redundancy를 줄이고)의 경우,
multi-scale features의 도입이 training convergence를 가속화하는 데 도움이 되지만([45]),
이는 encoder에 입력되는 sequence length를 상당히 증가시킴.
따라서 real-time DETR을 구현하기 위해서는 encoder를 redesign해야 한다.

후자(accuracy 향상)의 경우,
이전 연구들 [42, 44, 45]은 DETR의 성능을 저해하는 hard-to-optimize(최적화하기 어려운) object queries에 대해 설명하고,
vanilla learnable embedding을 encoder feature 교체하기 위한 query selection scheme을 제안한다.
그러나 우리는 현재의 query 선택이 classification score를 직접 선택에 사용하며,
feature의 quality를 결정하는 object의 category와 location을 동시에 modeling해야 하는 것을 무시한다는 사실을 관찰했다.
이로 인해, localization confidence가 낮은 encoder feature가 초기 queries로 선택되어
uncertainty 정도가 매우 높아지고 DETR의 성능을 저해한다.
그렇기 때문에 query initialization을 통해 성능을 더욱 향상시킬 수 있는 기회로 간주된다.

이 논문에서 우리는 처음으로 Real-Time DEtection TRansformer(RT-DETR)을 제안한다.

multi-scale features를 빠르게 처리하기 위해,
우리는 vanilla transformer encoder를 대체하는 efficient hybrid encoder를 설계하여,
intra-scale interaction과 cross-scale fusion을 분리함으로써
inference speed를 향상시켰다.

object query로 선택되는 encoder feature의 low localization confidence가 낮은 경우를 피하기 위해,
uncertainty-minimal query selection을 제안했다.
이는 uncertainty를 명시적으로 optimizing하여
decoder에 high quality initial queries를 제공하여 accuraacy를 높임.

게다가 RT-DETR은 DETR의 multi-layer decoder architecture 덕분에
retraining 없이 다양한 real-time scenarios에 대응할 수 있는 flexible speed tuning을 지원함

efficient hybrid encoder ➡️ 속도 향상

uncertainty-minimal query selection ➡️ 성능 향상

2.1. Real-time Object Detectors

YOLOv1은 진정한 real-time object detection을 달성한 최초의 CNN-based one-stage objet detector임.
YOLO detectors는 두 가지 categories로 분류될 수 있다
1. anchor-based
2. anchor-free
YOLO detectors는 speed와 accuracy 사이의 합리적인 trade-off를 달성하고 다양한 practical scenarios에서 사용된다.
이러한 발전된 real-time detectors는 많은 overlapping boxes를 생성하여
NMS post-processing이 필요하여 speed가 느려짐.

2.2. End-to-end Object Detectors

Carion et al.[4]은 DETR이라고 불리는 Transformer 기반의 end-to-end detector를 처음으로 소개했다.
그리고 그것의 독특한 특징으로 눈길을 끌었다.
특징적으로, DETR은 hand-crafted anchor와 NMS components를 제거했다.
대신에 bipartite matching과 one-to-one object set을 즉시 predict할 수 있다.
이러한 장점에도 불구하고 몇가지 문제들로 인해 고통을 받는다.
많은 DETR variants들은 이 문제를 해결할 수 있도록 제안해왔다.
- slow training convergencec
- high computational cost
- hard-to-optimize queries

Accelerating convergence

Deformable-DETR은 multi-scale feature를 이용하여
attention mechanism의 효율성을 강화시킴으로써
training convergence를 가속화했다.
DAB-DETR과 DN-DETR은 iterative refinement scheme과 denoising training을 소개함으로써
performance를 향상시켰다.
Group-DETR은 group-wise one-to-many assignment를 소개함.

Reducing computational cost

Efficient DETR과 Sparse DETR은 encoder와 decoder layer 또는 updated queries의 수를 줄임으로써
computational cost를 줄였다.
Lite DETR은 encoder의 효율성을 높이기 위해 low-level feature의 update frequency를 줄임으로써
efficiency를 향상시켰다.

Optimizing query initialization

Conditional DETR과 Anchor DETR은 queries의 optimization difficulty를 감소시켰다.
Zhu et al. [45]는 two-stage DETR을 위한 query selection을 제안했고,
DINO [44]는 better initialize queries를 위한 mixed query selection을 제안했다.
현재의 DETRs들은 여전히 computationally intensive하, real time을 위해 design되지 않았다.

우리의 RT-DETR은 computational cost reduction과 optimize query initialization,
SOTA real-time detector를 능가하는 것을 목표로 연구를 진행했다.

3. End-to-end Speed of Detectors

3.1. Analysis of NMS

NMS는 overlapping output boxes를 제거하기 위해 object detection에서
널리 사용되는 post-processing algorithm이다.
NMS에는 두 개의 threshold가 필요.
1. confidence threshold
2. IoU threshold
구체적으로,
box가 갖는 score가 confidence threshold 이하이면 직접 filtered out되며,
어떤 두 상자의 IoU가 IoU threshold를 초과하면 lower score box가 discarded된다.
(정리하면,
우선 모든 box들에 대해서 해당 box가 얼만큼의 확신을 갖는지에 대한 confidence score를 confidence threshold로 filtering.
filtering된 후 남은 box들에 대해서 IoU threshold를 적용하여 한 번 더 filtering.
최종적으로 남은 box들에 대해서 NMS 적용.)

이 process는 모든 catecory의 box가 처리될 때까지 반복적으로 수행된다.
따라서 NMS의 execution time은 주로 box의 수와 두 threshold에 의해 결정된다.
➡️ 우리는 이를 관찰을 확인하기 위해 YOLOv5(anchor-based)와 YOLOv8(anchor-free)를 분석하였다.
- anchor-based :
  미리 setting해놓은 anchor에서 category와 coordinates를 예측하는 방식.
  (anchor를 1:1, 1:2, 2:1과 같이 사전에 설계했기 때문에 해당 범위를 크게 벗어나는 물체를 detect해야 할 때 anchor가 아무 쓸모가 없거나 오히려 학습에 방해.
  dataset을 미리 파악하여 새로 anchor를 정의해야 하는데, 이는 번거롭고 적합한 anchor를 찾기 위해 hyperparameter search를 해야 한다는 것이 단점.)
- anchor-free :
  1. key point를 이용하여 object의 위치를 예측하는 keypoint-based 방법
  2. object의 중앙을 예측한 후 positive인 경우 object boundary의 거리를 예측하는 center-based 방법
우리는 먼저 같은 input에 대해 다른 confidence thresholds를 적용하여
filtering 후 남아있는 box의 수를 셌다.
우리는 sample value를 0.001부터 0.25까지의 값에서 confidence threshold를 sampling하여
두 detector의 남은 상자의 수를 세고 이를 bar graph에 ploting했다.

Figure 2는 NMS가 그 hyper-parameter에 민감함을 직관적으로 보여준다.
confidence threshold가 증가함에 따라 더 많은 prediction box들이 filtered out되고,
IoU를 계산해야 하는 남은 box의 수가 감소하므로 NMS의 실행 시간이 줄어듦.게다가 우리는 YOLOv8을 사용하여 COCO val2017에서의 accuracy를 평가하고
다양한 hyper-parameter에서 NMS operation의 execution time을 측정했다.
우리가 채택한 NMS operation은 TensorRT efficientNMSPlugin을 참고하여,
이는 EfficientNMSFilter, FadixSort, EfficientNMS 등의 여러 kernel을 포함한다.
여기서 우리는 EfficientNMS kerenl의 실행 시간만을 보고한다.
우리는 TensorRT FP16에서 T4 GPU에서 속도를 test하며, input 및 preprocessing은 일관되게 유지했다.
hyperparameter와 해당 결과는 Table 1.에 있다.결과에서 보듯이, EfficientNMS kernel의 실행 시간은 confidence threshold값이 감소하거나 IoU threshold값이 증가함에 따라 증가한다.
(Conf thr.를 높게 setting할수록 filtering 후 남은 box가 적기 때문에 NMS 시간이 감소되는 것은 알겠는데,
IoU thr.를 높게 setting할수록 NMS 시간이 왜 증가하지?
IoU thr.를 높게 setting하면, overlapping box들이 더 적어서 NMS 시간이 더 적게 걸려야 하지 않나?)

또한 appendix에서 다른 NMS threshold를 사용한 YOLOv8의 prediction을 시각화했는데,
결과에서 보듯이 부적절한 confidence threshold는 detector에 의해
많은 false positives(object가 아니지만 object라고 detection) or false negatives(object이지만 object가 아니라고 판별)를 만들어낸다.
일반적으로 YOLO detector는 model speed를 보고하고 NMS time은 제외하기 때문에,
최종적으로는 end-to-end speed benchmark가 수립되어야 한다.

3.2. End-to-end Speed Benchmark

다양한 real-time detector들의 end-to-end speed를 공정하게 비교하기 위해
우리는 end-to-end speed benchmark를 수립했다.
NMS의 execution time은 input에 영향을 받기 때문에
benchmark dataset을 선택하고 여러 image에 거친 average execution time이 필요하다.
우리는 COCO val2017을 benchmark dataset으로 선택하고 위에서 언급한 YOLO detector를 위한
TensorRT의 NMS post-processing plugin을 추가했다.
구체적으로,
우리는 해당 accuracy에 대한 NMS threshold값에 따른 detector의 평균 inference speed를 test하며,
이는 I/O 및 MemoryCopy 작업을 제외한 시간이다.
우리는 T4 GPU에서 TensorRT FP16을 사용하여 anchor-based detector YOLOv5 [11] 및 YOLOv7 [38],
그리고 anchor-free detector PP-YOLOE [40], YOLOv6 [16] 및 YOLOv8 [12]의
전체적인 speed를 test하기 위해 이 benchmark를 활용함.
Table 2.를 통해, anchor-free detector가 동등한 accuracy를 갖는 anchor-based detector보다
우수한 성능을 보이는 것을 결론을 내렸다.
그 이유는 anchor-based detector가 anchor-free detector보다 NMS 시에 사용되는 prediction box 수가 더 많으므로,
더 많은 NMS time이 필요하기 때문이다.

(내가 이해한 내용을 정리)
YOLO variants에 대해서 예전에는 NMS 시간을 고려하지 않았지만,
NMS 시간을 고려한 새로운 benchmarking에 대한 결과, 우리의 RT-DETR이 더 빠르고 성능이 좋다.

또한 anchor-free detector가 anchor-based detector보다 prediction box를 덜 만들어 내서
더 적은 NMS time이 걸리기 때문에 일반적으로 anchor-free detector의 속도가 더욱 빠르다.
그런데 기존의 DETR variants들은 NMS 과정이 없었는데도 속도가 빨라지는 장점을 살리지 못했다.
(뒤에 나올 내용)
그래서 우리는 DETR variants들의 encoder의 computational bottleneck 문제를
efficient hybrid encoder를 통해 효과적으로 줄여 Real-Time에 달성했다.

4. The Real-time DETR

4.1. Model Overview

RT-DETR은
backbone,
an efficient hybrid encoder,
Transformer decoder with auxiliary prediction heads
로 구성되어 있다.

RT-DETR의 overview는 Figure 4.에 있다.

우리는 encoder의 backbone의 last three stages인
{ $S_3, S_4, S_5$ }의 feature를 encoder에 입력한다.

The efficient hybrid encoder는 multi-scale features({ $S_3, S_4, S_5$ })를
intra-scale features interaction과 cross-scale feature fusion을 통해
image feature의 sequence로 변환한다. (Sec. 4.2)

이후, uncertainty-minimal query selection이 적용되어
일정 수의 encoder feature가 decoder의 initial object queries로 선택된다. (Sec 4.3)

마지막으로, decoder with auxiliary prediction heads는
object queries를 반복적으로 optimize하여 categories와 boxes를 생성한다.

4.2. Efficient Hybrid Encoder

Computational bottleneck analysis

multi-scale features의 도입은 training convergence를 가속화하고 성능을 향상시킨다[45].

그러나 deformable attention이 computational cost를 줄여도,
급격히 증가한 sequence length로 인해 encoder가 computational bottleneck이 되는 문제는 여전히 존재한다.
Lin et al. [19]에 따르면, encoder는 GFLOPs의 49%를 차지하지만 Deformable-DETR에서 AP의 11%만 기여한다.
이 bottleneck을 극복하기 위해, 우리는 먼저 multi-scale Transformer encoder에 존재하는 computational redundancy를 분석했다.
➡️ 직관적으로,
object에 대한 풍부한 의미 정보(rich semantic information)를 포함하는 high-level features가 low-level features에서 추출되기 때문에
연결된 multi-scale features들에 대한 feature interaction을 수행하는 것은 redundant를 만든다.

그러므로, 우리는 동시에 intra-scale 및 cross-scale feature interaction이 비효율적임을 증명하기 위해
다양한 유형의 encoder를 갖춘 variants를 설계했다. (Figure 3.)
- A : Transformer encoder in DINO-Deformable-R50
- A ➡️ B : Variant B는 A에 single-scale Transformer encoder를 삽입하며,
  이 encoder는 one layer of Transformer block을 사용한다.
  The multi-scale features는 intra-scale feature interation을 위해 encoder를 공유한 후 output으로 concatenate된다.
- B ➡️ C : Variant C는 B를 기반으로 하고 cross-scale feature fusion을 도입하며,
  이어서 multi-scale Transformer encoder로 concatenated된 features를 사용하여
  동시에 intra-scale 및 cross-scale feature interaction을 수행한다.
- C ➡️ D : Variant D는
  intra-scale interaction에는 single-scale Transformer를,
  cross-scale fusion에는 PANet-style structure를 사용하여
  encoder를 decouple(분리)한다.
- D ➡️ E : Varaint E는 D를 기반으로 하고,
  우리가 설계한 efficient hybrid encoder를 채택하여
  D의 intra-scale interaction과 cross-scale fusion을 강화했다.

Hybrid design

위 분석에 따라서,
우리는 encoder의 구조를 다시 생각하고 efficient hybrid encoder를 제안한다.
efficient hybrid encoder는 2개의 module로 이루어져 있다.
1. Attention-based Intra-scale Feature Interaction (AIFI)
2. CNN-based Cross-scale Feature Fusion (CCFF)
- 구체적으로,
  AIFI는 single-scale Transformer encoder를 사용하여
  $S_5$ 에서만 intra-scale interaction을 수행함으로써
  variant $D$ 를 기반으로 computational cost를 더욱 줄인다.
  ➡️ (가설 1) 그 이유는 더 풍부한 semantic concept을 가진 high-level features에
  self-attention operation을 적용하면
  conceptual entity들 간의 connection을 포착하여
  subsequent modules이 object를 localization하고 recognition하는 데에 도움이 된다.
  ➡️ (가설 2) 하지만 low-level features의 intra-scale interaction은 semantic concept이 부족하고
  high-level feature와의 duplication 및 confusion의 risk 때문에 불필요하다.
  - 위 가설들을 증명하기 위해,
    우리는 variant $D$ 에서 $S_5$ 에서만 intra-scale interaction을 수행했다 (Table 3의 row $D_5$ )
- $D$ 에 비해 $D_{S_5}$ 는
  latency를 상당히 줄이면서(35% faster) accuracy도 향상시킴(0.4% AP higher).
  CCFF는 cross-scale fusion module을 기반으로 optimized되어 있으며,
  여러 conv layer로 구성된 fusion block을 fusion path에 삽입했다.
  fusion block의 역할은 두 인접한 feature를 새로운 feature로 fuse(융합)하는 것이며,
  그 구조는 Figure 5.에 나와있다.fusion block은 두 개의 $1 \times 1$ convolution으로 channel수를 조절하고,
  RepConv[8]를 구성하는 $N$ 개의 RepBlocks이 feature fusion을 위해 사용된다.
  - RepBlock 이란?
    Re-parameterization Block.
    training 시에 $3 \times 3$ branch와 $1 \times 1$ branch 연산을 거쳐 forwarding되는데,
    inference 시에는 $1 \times 1$ branch와 residual connection이 제거되어 사용된다.

우리는 hybrid encoder의 calculation을 다음과 같이 formulate했다.
(Efficient Hybrid Encoder 정리)
$S_5$ 에 대해서만 Attention-based Intra-scale Feature Interaction (AIFI)를 적용함 ➡️ $F_5$
( $F_5$ , $S_3$ , $S_4$ )에 CNN-based Cross-scale Feature Fusion (CCFF)를 적용
code와 함께 보는 Efficient Hybrid Encoder

4.3. Uncertainty-minimal Query Selection

DETR에서 object queries를 optimize하는 어려움을 줄이기 위해서, 몇가지 후속 연구들이 있었다.
그 연구들은 모두 confidence score를 사용하여 encoder에서 top $K$ 개의 feature를 선택하여 object query를 initialize한다.
- (출처 : https://github.com/fundamentalvision/Deformable-DETR/blob/main/models/deformable_transformer.py)
  Deformable DETR에서는 top 300개의 proposals을 추출하여 object query를 initialize하는 것 같다.
  confidence score는 feature가 foreground object를 포함할 가능성을 나타내는데,
  그것만으로 detector는 object의 category와 localization을 동시에 modeling해야 한다.
  ➡️ 분석 결과,
  confidence score가 top $K$ 인 feature로 query selection하는 것은 상당한 수준의 uncertainty를 초래하며,
  decoder의 initialization을 optimize하지 못하게 하고 detector의 성능을 저하시킨다.
이 문제를 해결하기 위해,
우리는 uncertainty minimal query selection scheme을 제안하고,
이 방법은 명시적으로 encoder feature의 joint latent variable(결합 잠재 변수)를 modeling하기 위해
epistemic(지식의) uncertainty를 구성하고 optimize한다.

구체적으로,
feature uncertainty $U$ 는 Eq. (2).의 localization $P$ 와 classification $C$ 의 predicted distiribution의 차이로 정의된다.
(대체 왜 이렇게 정의를 했을까...? 이 식이 feature uncertainty를 대변할 수 있는 값이 어떻게 되는가?)Query의 uncertainty를 최소화하기 위해,
우리는 gradient-based optimization을 위한 loss function Eq.(3)에 uncertainty를 추가하였다. $L_{cls(U(\hat{X}, \hat{c}, c)}$ 는 어떻게 계산한다는 건지?
(나중에 코드에서 자세히 살펴봐야 할듯)

Effectiveness analysis

uncertainty-minimal query selection의 효과를 분석하기 위해,
우리는 COCO val2017 dataset에서 선택된 feature들의 classification score와 IoU socre를 시각화했다. (Figure 6.)
classification score가 0.5보다 큰 feature들에 대한 scatterplot임.보라색 점은 uncertainty-minimal query selection에서 train된 model에서 선택된 feature들을 나타냄.
초록색 점은 vanilla query selection에서 train된 model에서 선택된 feature들을 나타냄.
figure의 top right에 가까울수록 해당 feature의 quality가 높아진다.
즉, 예측된 category와 box가 실제 object를 더 잘 설명함.

scatterplot에서 주목할 만한 점은 보라색 점이 top right에 집중되어 있고,
반면에 초록색 점은 bottom right에 집중되어 있다는 것.
이는 uncertainty-minimal query가 더 높은 quality의 encoder feature를 생성한다는 것을 보여줌.

4.4. Scaled RT-DETR

real-time deteector는 일반적으로 다양한 scenarios를 수용하기 위해 다른 scale의 model을 제공하는데,
RT-DETR도 flexible scaling을 지원한다.
구체적으로,
hybrid encoder의 width는 embedding dimension과 channel 수를 조절하여 제어하고,
depth는 transformer layer 및 RepBlock의 수를 조절하여 제어한다. (RepBlock?)
decoder의 width와 depth는 object queries의 수와 decoder layer를 조절하여 제어한다.
또한 RT-DETR의 속도는 decoder layer의 수를 조절하여 유연하게 조절할 수 있다.
우리는 끝 부분의 몇 개의 decoder layer를 제거하는 것이 accuracy에 미치는 영향이 미미하지만
inference speed를 크게 향상시킨다는 것을 관찰했다.

ResNet50 및 ResNet101로 설계된 RT-DETR과 YOLO detectors의 L과 X model(yolo의 SOTA model인듯)과 비교했다.
Lighter RT-DETR은 smaller(ResNet18/34) or scalable(CSPResNet) backbone을 적용하여 설계할 수 있다.

5. Experiments

5.1. Comparison with SOTA

궁금한, 모르겠는 것

우선 interaction은 정보 교환을 위한 연산이라고 생각하면 될 듯함.
예를 들어, encoder-decoder attention(=cross attention) 또는 fusion

single-scale Transformer encoder?
- 말 그대로 single-scale에 대한 transformer encoder
multi-scale Transformer encoder?
- encoder에 입력 전, ResNet backbone에서 feature를 추출하여 encoder에 삽입하는데
  FPN과 같이 ResNet의 각 stage에서 나오는 feature들(multi-scale)을 encoder에 삽입할 때
  사용되는 transformer encoder.
  예를 들어, Deformable
intra-scale feature interaction?
- 하나의 scale을 가진 feature가 자기 스스로 self-attention 하는 것
cross-scale feature fusion?
- 여러 scale의 feature들이 fusion(정보 혼합)되어 새로운 feature를 만들어내는 것.
  이 논문에서는 CCFF와 같은 함수를 의미.

AIFI(Attention-based Intra-scale Feature Interaction)?
- single-scale Transformer encoder를 이용하여 $S_5$ decouple하여 $F_5$ 를 만드는..?
  (이후, CCFF를 이용하여 $S_3, S_4, F_5$ 를 fusion)

Hyungseop Lee

Efficient Deep Learning

이전 포스트