DETR

ODD·2024년 10월 8일

Backbone: ResNet-50

Extracts features from input image
image -> lower-dimensional feature map

Transformer

models relationships between different parts of the image
(spatial relationship)

Multi-Head Attention

positional Encoding

where are objects located in the image?

Bounding Box Prediction

using linear layer

Object detection set prediction loss

  • DETR infers a fixed-size set of N predictions
  • N is the number of class padded with no object
  • Hungarian algorithm을 활용해 matching cost가 가장 적은 match를 찾음
  • matching cost는 class prediction과 box에 대해 각각 similarity를 계산
  • 이 때, Bounding box loss가 박스의 크기에 따라 bias될 수 있으므로 IoU loss를 사용해 generalize하도록 함

Evaluation

  • Dataset: COCO 2017
  • Trained on 16 V100 GPUs 3 days

0개의 댓글