[WIP] DETR: End-to-End Object Detection with Transformers

Estelle Yoon·2025년 3월 18일

WIP

목록 보기
5/13

Date: 2020
Journal: ECCV

1 Introduction

Problem

Modern detectors use indirect way like defining surrogate regression and classification problems on a large set of proposals, anchors, or window centers

This performances are significantly influenced by post-processing steps

To overcome this, direct set prediction approach - end to end philosophy -is used, it has led to significant advances in complex structured prediction but except object detection

Proposal

Streamlined the training pipeline bu viewing object detection as direct set prediction problem

Adopt encoder-decoder transformer

DETR predicts all object all at once, and is trained end to end with a set loss function which performs bipartite matching between predicted and ground truth

DETR simplifies the detection pipeline by dropping multiple hand-designed components that encode prior knowledge

DETR doesn’t require any customized layers and can be reproduced easily in any framework that has CNN and transformer

2 Related work

2.1 Set Prediction

A general approach is to use auto-regressive sequence models such as RNN

As loss function need to be invariant to a permutation of predictions

Usual solution is designing loss based Hungarian algorithm which enforces permutation invariance and guarantees unique match

DETR use transformer with parallel decoding to follow this bipartite matching

2.2 Transformer and Parallel Decoding

Transformer introduced self attention layers, which scan through each elements of a sequence and update by aggregating information from whole sequence

The main advantage is global computation and perfect memory - suitable for longer sequences

DETR combine transformer and parallel decoding for their suitable trade off between computational cost and ability to perform global computations

2.3 Object detection

In DETR, hand crafted process is removed, and directly predicting the set of detections with absolute box prediction with reference to input image rather than an anchor

Set based loss

Several object detectors use bipartite matching loss

But these models was modeled with convolutional or fully connected layers

To improve performance, hand designed NMS post processing is needed

This means unless they use set based loss, they still need manual processing

Recurrent detectors

Recurrent detector use bipartite matching losses with encoder-decoder architecture based on CNN activation and RNN to directly produce a set of bounding boxes

This performs on small data set, not on modern baselines

3 The DETR model

Two ingredients are essential

  1. a set prediction loss that forces unique matching between predicted and ground truth boxes

  2. an architecture that predicts a set of objects and models their relation

3.1 Object detection set prediction loss

DETR infers a fixed size set of NN predictions

Loss produces optimal bipartite matching between predicted and ground truth, then optimize losses

The optimal assignment is computed with Hungarian Algorithm

3.2 DETR architecture

Backbone

Conventional CNN is used for backbone, which generates a lower resolution activation map

Transformer encoder

First 1by1 convolution reduce the channel dimension of the high level activation map to smaller dimension

Each encoder layer as a standard architecture consist of a multi-head self-attention module and FFN, with additional positional encodings

Transformer decoder

Transformer decoder transforming NN embeddings of size dd using multi-headed self and encoder decoder attention mechanisms with transformer decoding NN objects in parallel at decoding layer

Prediction feed forward network

Final prediction is computed by 3 layer perceptron with ReLU activation function and hidden dimension dd and a linear projection layer

Auxiliary decoding losses

Auxiliary losses can help the model output the correct number of objects of each class during training

profile
Studying

0개의 댓글