Backbone: ResNet-50
Extracts features from input image
image -> lower-dimensional feature map
models relationships between different parts of the image
(spatial relationship)
Multi-Head Attention
positional Encoding
where are objects located in the image?
Bounding Box Prediction
using linear layer
Object detection set prediction loss
- DETR infers a fixed-size set of N predictions
- N is the number of class padded with no object
- Hungarian algorithm을 활용해 matching cost가 가장 적은 match를 찾음
- matching cost는 class prediction과 box에 대해 각각 similarity를 계산
- 이 때, Bounding box loss가 박스의 크기에 따라 bias될 수 있으므로 IoU loss를 사용해 generalize하도록 함
Evaluation
- Dataset: COCO 2017
- Trained on 16 V100 GPUs 3 days