2022, DAB-DETR [ICLR]

DongKeon Park·2023년 5월 17일

Object Detection

목록 보기

1/1

DAB-DETR

Dynamic Anchor Boxes are better queries for DETR

Published on ICLR 2022 , Tsinghua University.

Presenter: Dongkeon Park

Introduction

What’s DETR (DEtection TRansformer)
- Transformer on Object Detection
Contribution
- Object Detection -> “direct set prediction” 새로운 방법 제시
  - 이로써 NMS와 같은 heuristic 단계를 효과적으로 제거
- Encoder-Decoder 기반 Transformers
  - Sequence prediction 방법
  - Transformer Self-attention 역할
  - 각 Sequence 간의 소통 -> feature 간의 연관성
    - 중복된 prediction 제거 (NMS 제거)
  - 기존 Faster-RCNN과 비슷한 성능을 보임
  - 간단한 재 구현성
    - 기존 Resnet\, transformer 로만 구현 가능

DETR: End-to-End Object Detection with Transformers\, ECCV 2020

Detail
- Bipartite matching 을 통해 set prediction problem 직접적으로 해결
- Training에서 Bipartite matching 을 통해 instance가 중복되지 않도록 유도함

Attention map visualization
- Last Encoder layer attention map visualization
  - Transformer Encoder가 이미지 내의 Instance를 분리
  - Decoder가 object extraction과 localization을 용이하게 함

Attention map visualization
- Last Decoder layer attention map visualization
  * Decoder가 일반적으로 각 object의 edge에 초점을 두고 있음
Problem of DETR
- DETR suffers from significantly slow training convergence
  - Ineffective design and use of queries
  - Usually requiring 500 epochs to achieve a good performance
- Role of the learned queries in DETR is still not fully understood or utilized.

Deformable DETR: Deformable Transformers for End-to-End Object Detection\, ICLR\, 2021

Efficient DETR: Improving End-to-End Object Detector with Dense Prior\, ArXiv\, 2021 (Will be published in ICCV 2022)

Problem of DETR
- Follow-up works for both faster training convergence and better performance
  - Deformable DETR
    - 2-D reference point를 query처럼 사용하고\, cross-attention 시 각 reference point와 연산을 실시한다.
  - Efficient DETR
    - object query들 중 top K를 선택하는 dense prediction을 사용한다
  - Conditional DETR
    - image feature와 더 잘 맞도록 content feature를 바탕으로 "conditional query"를 학습한다.
- Only leverage 2D positions as anchor points without considering the object scales

Deformable DETR: Deformable Transformers for End-to-End Object Detection\, ICLR\, 2021

Efficient DETR: Improving End-to-End Object Detector with Dense Prior\, ArXiv\, 2021 (Will be published in ICCV 2022)

Propose new query formulation
- use anchor boxes\, i.e.\, 4D box coordinates (x\, y\, w\, h)\, as queries and update them layer by layer
- Spatial point와 함께 scale 정보가 들어 있어 scale 고려가 가능

Key Insight of DETR

Query
- Content part (decoder self-attention output)
- Positional part (learnable queries in DETR)
Key for cross attention
- Content part (encoded image feature)
- Positional part (positional embedding)
Interpreted
- as pooling features from a feature map based on the query-to-feature similarity measure
  - cross-attention 연산 후 만들어지는 decoder embedding이 다시 query로 쓰임

3 WHY A POSITIONAL PRIOR COULD SPEEDUP TRAINING?

Prior Work (Sun et al. Rethinking transformer-based set prediction for object detection)
- Cross-attention module is mainly responsible for the slow convergence
  - Simply removed the decoders for faster training.
Encoder self-attention vs Decoder cross-attention
- Key Diff: Inputs from the queries
- Decoder Emb
  - Projected to image feature
Root Cause: learnable queries

Two possible reasons
- 1) it is hard to learn the queries due to the optimization challenge
- 2) the positional information in the learned queries is not encoded in the same way as the sinusoidal positional encoding used for image features
First reason (optimization challenge)
- reuse the well-learned queries from DETR (keep them fixed) and only train the other modules
- Query learning (or optimization) is likely not the key concern

Second reason (positional information issue)
- visualize a few positional attention maps
  - between the learned queries and the positional embeddings of image feature

Undesirable properties
- multiple modes and nearly uniform attention weights
  - two or more concentration centers
    - making it hard to locate objects when multiple objects exist in an image
  - focus on areas that are either too large or small
    - cannot inject useful positional information
explicit positional priors
- to constrain queries on a local region

Cross-attention

DETR+DAB
- Note that the only difference between DETR and DETR+DAB is the formulation of queries
- can achieve both a faster training convergence and a higher detection accuracy

Cross-attention

vs Conditional DETR
- explicit positional embedding as positional queries for training\,
- yielding attention maps similar to Gaussian kernels
- they ignore the scale information of an object

Introduction

Key Insight of DETR

center position (x\, y)
- allowing us to use an anchor box to pool features around the center
anchor box size (w\, h)
- modulate the cross-attention map\, adapting it to anchor box size
positional constraint
- Focus on a local region corresponding to a target object
- also facilitate to speed up the training convergence
DAB-DETR
- Novel query formulation using dynamic anchor boxes (DAB) for DETR
- Offer a deeper understanding of the role of queries in DETR
- Directly uses box coordinates as queries and dynamically updates them layer by layer
  - 1. Query - Feature 간 similarity 향상
  - 2. DETR 에서 training convergence 속도가 느렸던 점을 개선
  - 3. Positional attention map 을 직접 box 의 width\, height 정보를 통해 조절 가능

4 DAB-DETR

Transformer decoder part

Dual queries fed into the decoder
- Dual query
  - Positional queries (anchor boxes)
  - content queries (decoder embeddings)
- objects which correspond to the anchors and have similar patterns with the content queries
The dual queries are updated layer by layer
- to get close to the target ground-truth objects gradually

Learning Anchor Boxes Directly

Positional Encoding (PE)
- Anchor\, Content query\, Positional query

Learning Anchor Boxes Directly

Self-attention for query update
- all three of queries\, keys\, and values have the same content items

Learning Anchor Boxes Directly

Cross-Attention for feature probing

DongKeon Park

Currently pursuing my Ph.D. in GIST, I am deeply intrigued by the field of speaker diarization and committed to making meaningful contributions to it.

2022, DAB-DETR [ICLR]

Object Detection

DAB-DETR

Introduction

3 WHY A POSITIONAL PRIOR COULD SPEEDUP TRAINING?

Introduction

4 DAB-DETR

0개의 댓글