DETR Paper Review - End-to-End Object Detection with Transformers

davidlyoo·2025년 2월 3일
0

Computer Vision

목록 보기
8/8
post-thumbnail

Summary

DEtection TRansformer (DETR) introduces a transformer-based approach to object detection, redefining it as a direct set prediction problem. Unlike many other modern detectors which rely on predefined anchor boxes and non-maximum suppression (NMS), DETR utilizes bipartite matching and transformers to directly predict objects in an end-to-end fashion.

Key Innovations:

    1. Set-based global loss: Ensures unique predictions via bipartite matching.
    1. Transformer encoder-decoder architecture: Captures global context via self-attention and processes object queries in parallel, eliminating the need for sequential region-based detection pipelines.

Introduction

Object detection has traditionally relied on hand-designed components such as predefined anchor boxes, IoU-based matching, and NMS. These methods have been widely successful but come with several challenges:

Limitations of Traditional Object Detectors

  1. Predefined Anchor Boxes:

    • Require careful tuning for each dataset.
    • Increase computational complexity and introduce hyperparameter sensitivity.
  2. Complex Post-Processing Steps:

    • These challenges call for a model that eliminates heuristic-based matching and post-processing.
    • IoU-based Matching: IoU-based matching assigns multiple bounding boxes to the same object based on heuristics, leading to inconsistencies during training.
    • NMS: Required to remove redundant predictions but introduces heuristic decision rules.

DETR: A Set Prediction Approach

DETR redefines object detection as a set prediction problem, eliminating the need for anchors and post-processing. Instead of relying on region proposals and IoU-based assignment, DETR directly predicts a fixed-size set of objects in a single forward pass.


DETR

Key Innovations of DETR

By treating object detection as a direct set prediction problem, DETR simplifies the detection pipeline while ensuring unique, permutation-invariant predictions.

  1. Anchor-Free and Post-Processing-Free:

    • No predefined anchor boxes.
    • No need for NMS, as predictions are uniquely assigned.
  2. Permutation Invariance & Set Prediction:

    • Non-autoregressive decoding: DETR predicts all objects at once without relying on sequence order.
    • Unlike traditional detectors, DETR treats predictions as an unordered set, ensuring order-invariant training.
    • However, since there is no predefined order, we need a way to associate predicted objects with ground-truth objects.
  3. Bipartite Matching via the Hungarian Algorithm:

    • DETR uses bipartite matching (Hungarian algorithm) to assign predictions to ground-truth objects.
    • Since the number of detected objects can vary, DETR selects the best-matching predictions out of a fixed number of queries.
    • One-to-one matching ensures that each object is assigned to a single prediction, eliminating the need for NMS.

DETR Architecture

The DETR model

  • A set prediction loss (bipartite matching loss)
  • Architecture that predicts a set of objects and models their relation

Object detection set prediction loss

The loss computation in DETR is divided into two main stages:
1. Matching: Assigning predicted objects to ground-truth objects using bipartite matching.
2. Hungarian Loss Computation: Calculating the classification and bounding box losses based on the matched pairs.


1. Matching Stage
DETR generates a fixed number (N) of object queries, which are then matched to ground-truth objects using bipartite matching (Hungarian algorithm).

  • Number of object queries (N) ≥ number of ground-truth objects
    → If there are fewer ground-truth objects, "no object" class is assigned to the extra predictions.

Matching Cost Computation

The Hungarian algorithm calculates the matching cost to determine the optimal assignment between predictions and ground-truth objects. The matching cost considers:

  • Classification score: The probability of predicting the correct class.
  • Bounding box similarity: Based on IoU between predicted and ground-truth boxes.

This matching process serves the same role as heuristic-based assignment rules in traditional object detectors (e.g., anchor-based models), but ensures one-to-one matching without duplicates.


2. Hungarian Loss Computation

Once the matching is determined, DETR computes the Hungarian Loss, which consists of two components:

(1) Classification Loss

  • Uses Negative Log-Likelihood (NLL) Loss to optimize class predictions.
  • Ensures that each predicted object is classified correctly.
  • Includes an additional "no object" class to handle unmatched predictions.

(2) Bounding Box Loss
Since DETR directly predicts bounding boxes without using anchors, it requires a stable loss function that balances different object sizes.
To achieve this, DETR combines two loss functions:

  • L1 Loss: Minimizes the direct difference between predicted and ground-truth box coordinates.
  • Generalized IoU (GIoU) Loss: Addresses scale differences and ensures stability.

These two losses are linearly combined and normalized by the number of ground-truth objects in the batch.


Why Use Generalized IoU Loss?

  • L1 Loss alone is scale-dependent, meaning smaller objects may have lower absolute errors but higher relative errors.
  • GIoU Loss introduces scale invariance, helping the model generalize better across different object sizes.
  • Combining L1 and GIoU Loss ensures both precise localization and scale robustness in bounding box regression.

DETR contains three main components

    1. CNN backbone: using ResNet50 as a backbone network to extract a compact feature representation
    1. Encoder-decoder transformer
    1. Feed Forward Network (FFN): makes the final detection prediction

Backbone Network

  • Starting from the initial image: getting image feature map
  • Backbone generates a lower-resolution activation map (e.g., C=2048, H = H0/32, W = W0/32)

Encoder-decoder Transformer

  • Encoder

    • 1x1 convolution reduces the channel dimension of the high-level activation map new feature map z
    • Each encoder layer has a standard architecture and consists of a Multi-Head Self Attention, FFN + fixed positional encoding that are added to the input of each attention layer
  • Decoder

    • The decoder follows the standard transformer architecture:
      • Multi-Head Self-Attention (MHA)
      • Encoder-Decoder Cross-Attention Mechanism

Unlike standard transformers, DETR’s decoder processes all N objects in parallel at each decoder layer.

  • Object Queries
    • Learnable embeddings of size N, representing potential object detections.
    • These embeddings interact with the encoded feature map to extract object-specific information.

The object queries are added to the input of each attention layer, transforming them into output embeddings, which are then decoded independently into:

  • Bounding box coordinates
  • Class labels

Feed Forward Network (FFN)

  • A three-layer perceptron (MLP) with ReLU activation and hidden dimension d.
  • A linear projection layer is used for final predictions.
  • Predicts normalized bounding box coordinates (center, height, width) relative to the input image.
  • A separate linear layer predicts class labels using a softmax function.
  • A fixed-size set of N bounding boxes is produced, where N is typically much larger than the actual number of objects in the image.
    • An additional special class label ("no object") is used to represent background regions.

Auxiliary decoding losses

  • Auxiliary losses are applied during training to help the model output the correct number of objects per class.
  • Hungarian loss and prediction FFNs are added after each decoder layer.
  • All prediction FFNs share their parameters across decoder layers.
  • A shared layer-normalization (LayerNorm) is used to normalize inputs to the prediction FFNs, ensuring consistent learning across decoder stages.

Performance

Comparison with Faster R-CNN

Despite its simplicity, DETR achieves competitive performance with Faster R-CNN, demonstrating that transformers can replace region-based object detection frameworks. However, notable differences exist:

  • DETR excels at detecting large objects, benefiting from the non-local computations of the transformer.
  • DETR struggles with small objects, as it lacks an explicit feature pyramid like FPN in Faster R-CNN.
  • Training Efficiency: DETR requires longer training schedules but eliminates heuristic-based hyperparameter tuning.

Conclusion

DETR introduces a transformer-based object detection framework, marking a significant departure from traditional region-based architectures. This work highlights the importance of borrowing techniques from different research areas, such as leveraging self-attention mechanisms for object detection. While DETR shows strong performance on large objects, improving its ability to detect small objects remains a future research direction—similar to how FPN improved Faster R-CNN.


References

Carion N, et al. (2020). End-to-end object detection with transformers. European Conference on Computer Vision (ECCV). https://doi.org/10.1007/978-3-030-58452-8_13

0개의 댓글