Date: 2020
Journal: ECCV
Modern detectors use indirect way like defining surrogate regression and classification problems on a large set of proposals, anchors, or window centers
This performances are significantly influenced by post-processing steps
To overcome this, direct set prediction approach - end to end philosophy -is used, it has led to significant advances in complex structured prediction but except object detection
Streamlined the training pipeline bu viewing object detection as direct set prediction problem
Adopt encoder-decoder transformer
DETR predicts all object all at once, and is trained end to end with a set loss function which performs bipartite matching between predicted and ground truth
DETR simplifies the detection pipeline by dropping multiple hand-designed components that encode prior knowledge
DETR doesn’t require any customized layers and can be reproduced easily in any framework that has CNN and transformer
A general approach is to use auto-regressive sequence models such as RNN
As loss function need to be invariant to a permutation of predictions
Usual solution is designing loss based Hungarian algorithm which enforces permutation invariance and guarantees unique match
DETR use transformer with parallel decoding to follow this bipartite matching
Transformer introduced self attention layers, which scan through each elements of a sequence and update by aggregating information from whole sequence
The main advantage is global computation and perfect memory - suitable for longer sequences
DETR combine transformer and parallel decoding for their suitable trade off between computational cost and ability to perform global computations
In DETR, hand crafted process is removed, and directly predicting the set of detections with absolute box prediction with reference to input image rather than an anchor
Several object detectors use bipartite matching loss
But these models was modeled with convolutional or fully connected layers
To improve performance, hand designed NMS post processing is needed
This means unless they use set based loss, they still need manual processing
Recurrent detector use bipartite matching losses with encoder-decoder architecture based on CNN activation and RNN to directly produce a set of bounding boxes
This performs on small data set, not on modern baselines
Two ingredients are essential
a set prediction loss that forces unique matching between predicted and ground truth boxes
an architecture that predicts a set of objects and models their relation
DETR infers a fixed size set of predictions
Loss produces optimal bipartite matching between predicted and ground truth, then optimize losses
The optimal assignment is computed with Hungarian Algorithm
Conventional CNN is used for backbone, which generates a lower resolution activation map
First 1by1 convolution reduce the channel dimension of the high level activation map to smaller dimension
Each encoder layer as a standard architecture consist of a multi-head self-attention module and FFN, with additional positional encodings
Transformer decoder transforming embeddings of size using multi-headed self and encoder decoder attention mechanisms with transformer decoding objects in parallel at decoding layer
Final prediction is computed by 3 layer perceptron with ReLU activation function and hidden dimension and a linear projection layer
Auxiliary losses can help the model output the correct number of objects of each class during training