[CV] DETR: End-to-End Object Detection with Transformers review

๊ฐ•๋™์—ฐยท2022๋…„ 3์›” 6์ผ
0

[Paper review]

๋ชฉ๋ก ๋ณด๊ธฐ
15/17
post-custom-banner

๐ŸŽˆ ๋ณธ ๋ฆฌ๋ทฐ๋Š” DETR ๋ฐ ๋ฆฌ๋ทฐ๋ฅผ ์ฐธ๊ณ ํ•ด ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ‘ฉโ€๐Ÿ’ป ์˜ค๋Š˜์€ DETR์— ๋Œ€ํ•ด ๋ฆฌ๋ทฐํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. DETR์€ transformer์„ ์‚ฌ์šฉํ•œ End-to-End object detection์ž…๋‹ˆ๋‹ค. Transformer๊ฐ€ ์„ธ์ƒ์— ๋‚˜์˜จ ์ดํ›„ ๋งŽ์€ ๋ถ„์•ผ์—์„œ ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋ก  ๋Œ€์‹  transformer๋ฅผ ์ ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. DETR์€ ๊ธฐ์กด Transformer ์•„ํ‚คํ…์ฒ˜์™€ object detection์— ๋Œ€ํ•œ ๊ธฐ๋ณธ์ ์ธ ์ง€์‹๋งŒ ์žˆ๋‹ค๋ฉด ์ˆ˜์›”ํ•˜๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ๋Š” ๋…ผ๋ฌธ์ด๋ผ๊ณ  ์ƒ๊ฐ๋ฉ๋‹ˆ๋‹ค.

Keywords

๐ŸŽˆ Removing NMS and anchor generation
๐ŸŽˆ Transforemr encoder-decoder architecture
๐ŸŽˆ Bipartite matching(์ด๋ถ„ ๋งค์นญ)

Introduction

โœ” Object detection์˜ ๋ชฉ์ ์€ bounding box์˜ ์ง‘ํ•ฉ๊ณผ ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋ผ๋ฒจ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ˜„๋Œ€์˜ detector๋“ค์€ ๋งŽ์€ set of proposals, anchors ๋“ฑ๊ณผ ๊ฐ™์€ ๊ฐ„์ ‘์ ์ธ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์ค‘๋ณต์„ ์ œ๊ฑฐํ•˜๋Š” postprocessing step์ด๋‚˜ heuristicsํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์ •ํ•˜๋Š” anchor๋“ค์ด ๋งŽ์€ ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.

โœ” DETR์€ training pipeline์„ ๊ฐ„์†Œํ™”ํ•ด ์ง์ ‘์ ์œผ๋กœ detecting ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ postprocessing ๋ฐฉ๋ฒ•๋“ค์„ ๊ฐ„์†Œํ™” ํ•œ๋‹ค๊ณ  ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์— ๋Œ€ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ self-attention ๊ธฐ๋ฐ˜์ธ transformer๋ฅผ ์ฑ„ํƒํ–ˆ๊ณ , ์ด๋Š” ์ค‘๋ณต์„ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ(ex.NMS)์— ๋งค์šฐ ์ ํ•ฉํ•˜๋‹ค๊ณ  ์ด์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค.

โœ” DETR์€ ๋ชจ๋“  ๊ฐœ์ฒด๋ฅผ ํ•œ ๋ฒˆ์— ์˜ˆ์ธกํ•˜๋ฉฐ, predicted์™€ ground-truth objects์˜ bipatite matching(์ด๋ถ„ ๋งค์นญ) loss function์„ ์‚ฌ์šฉํ•ด end-to-end๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. DETR์€ ์กด์žฌํ•˜๋Š” ๋‹ค๋ฅธ detection๋“ค๊ณผ ๋‹ค๋ฅด๊ฒŒ, customized layer๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค๊ณ  ์ด์•ผ๊ธฐ ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋“  ํ”„๋ ˆ์ž„ ์›Œํฌ์—์„œ ์‰ฝ๊ฒŒ ์žฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” ๊ฒฐ๊ณผ์ ์œผ๋กœ DETR์€ ๋ณ‘๋ ฌ์ ์ธ decoding transforemr์™€ bipatite matching(์ด๋ถ„ ๋งค์นญ)์ด ๊ฒฐํ•ฉ๋œ ๊ตฌ์กฐ๋ผ๊ณ  ์ด์•ผ๊ธฐ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. bipatite matching(์ด๋ถ„ ๋งค์นญ) loss fucntion์€ ๊ฐ๊ฐ์˜ predicted๋ฅผ GT object์— uniqueํ•˜๊ฒŒ ํ• ๋‹นํ•˜๋ฉฐ, ์˜ˆ์ธก๋œ ๊ฐ์ฒด์˜ ์ˆœ์—ด์—๋Š” ๋ถˆ๋ณ€ํ•จ์œผ๋กœ, ๋ณ‘๋ ฌ๋กœ ๋‚ด๋ณด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” DETR์€ large objects์—์„œ๋Š” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ ์ƒ๋Œ€์ ์œผ๋กœ small object์— ๋Œ€ํ•ด์„œ๋Š” ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด๋Š” CNN์˜ feature map์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—, small object์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์ด ๋‚ฎ์„ ๊ฒƒ์ด๋ผ๋Š” ์ถ”์ธก์„ ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” DETR์€ ๋งค์šฐ ๊ธด training schedule์ด ํ•„์š”ํ•˜๋ฉฐ, transformer ๋ณด์กฐ decoding loss์„ ์‚ฌ์šฉํ•ด benefits์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Related work

Set Prediction

โœ” ๋จผ์ € Set(์ง‘ํ•ฉ)์˜ ์˜๋ฏธ์— ๋Œ€ํ•ด์„œ ์ƒ๊ฐํ•ด๋ด์•ผํ•ฉ๋‹ˆ๋‹ค. Set(์ง‘ํ•ฉ)์ด๋ž€ ์ˆœ์„œ๊ฐ€ ์—†์œผ๋ฉฐ, ์ค‘๋ณต์ด ์—†๋Š” ๊ฒƒ๋“ค์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ detector๋“ค์˜ ์–ด๋ ค์›€ ์ค‘ ํ•˜๋‚˜๋Š” ๋ฐ”๋กœ ์ค‘๋ณต์„ ํ”ผํ•˜๋Š” ๊ฒƒ์ด์˜€์œผ๋ฉฐ, ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ์œ„ํ•ด NMS์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•๋“ค์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ DETR์˜ direct set predicttion์€ NMS์™€ ๊ฐ™์€ postprocessing์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋Š” Hungarain algorithm์„ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ bipatite matching(์ด๋ถ„ ๋งค์นญ) loss function์„ ์‚ฌ์šฉํ•ด ๋ณด์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ… bipatite matching(์ด๋ถ„ ๋งค์นญ)

  • ๋™๋นˆ๋‚˜ Youtube๋‹˜์˜ DETR ๋ฆฌ๋ทฐ์—์„œ ์ข‹์€ ์˜ˆ์‹œ๊ฐ€ ์žˆ์–ด ๊ฐ€์ง€๊ณ  ์™”์Šต๋‹ˆ๋‹ค. ์œ„์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด prediction๊ณผ GT๊ฐ’์„ ๊ฐ๊ฐ ์ค‘๋ณต๋˜์ง€ ์•Š๊ฒŒ ๋งค์นญ์„ ํ•˜๋Š” ๊ฒƒ์ด ์ด๋ถ„ ๋งค์นญ์ด๋ผ๊ณ  ๊ฐ„๋‹จํ•˜๊ฒŒ ์ด์•ผ๊ธฐ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

The DETR model

โœ… Direct set predictions in detection

  • a set prediction loss that forces unique matching between predicted and GT box
  • an architecture that predicts a set of objects and models their relation

Object detection set prediction loss

โœ” DETR์—์„œ๋Š” decoder์—์„œ ๊ณ ์ •๋œ N predictions ์‚ฌ์ด์ฆˆ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ N์€ ํ•œ ์ด๋ฏธ์ง€์—์„œ ๋ณด์—ฌ์ง€๋Š” ๊ฐœ์ฒด์ˆ˜๋ณด๋‹ค ๋” ํฐ ์ˆ˜๋กœ ์ง€์ •ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ๊ณ ์ •๋œ N predictions์€ ์ถ”ํ›„ decoder์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ๋‹ค๋ฃฐ๋•Œ ๋”์šฑ ์ž์„ธํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

โœ” ๊ฐ GT๊ฐ’๊ณผ prediction ๊ฐ’์ด ํ•ฉ์ด ๊ฐ€์žฅ ์ž‘์€ ๊ฒƒ์„ ๊ตฌํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค.

โœ” ์œ„์™€ ๊ฐ™์ด LmatchL_{match}๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋˜๋ฉฐ, ๊ฐ๊ฐ์˜ cic_i๋Š” ํƒ€์ผ“ ํด๋ ˆ์Šค ๋ ˆ์ด๋ธ”, bib_i๋Š” GT box์˜ ์ค‘์‹ฌ ์ขŒํ‘œ์™€ ๋†’์ด์™€ ๋„ˆ๋น„์˜ ์ƒ๋Œ€์ ์ธ ์‚ฌ์ด์ฆˆ๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค. ์œ„์™€ ๊ฐ™์€ matching์„ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด์˜ heuristicํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์ฐพ๋Š” match proposal๊ณผ anchors๊ณผ ๊ฐ™์€ ์—ญํ™œ์€ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

โœ” ๋‹ค์Œ ์Šคํ…์œผ๋กœ Hungarain loss function๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

โœ” Hungarain loss function๋Š” ์œ„์˜ ์‹์œผ๋กœ ์ •์˜๋˜๋ฉฐ, ์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ ๊ฑด class imbalance๋ฅผ ๊ณ ๋ คํ•ด log-probability term์— ๋Œ€ํ•ด object๊ฐ€ no object์ผ ๊ฒฝ์šฐ 1/10๋กœ ์ค„์—ฌ ๊ณ„์‚ฐํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Bounding box loss

โœ” ์œ„ Hungarain loss function์—์„œ LboxL_{box}๋Š” ๊ธฐ์กด์˜ detector๋“ค์—์„œ ์‚ฌ์šฉํ•˜๋Š” offset์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , L1 loss ๊ณผ GIoU(Generalized Iou)๋ฅผ ์‚ฌ์šฉํ•ด ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ ฮปiou,ฮปL1\lambda_{iou}, \lambda_{L1}์€ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ์œ„์˜ ๋‘ loss๋Š” ๋ฐฐ์น˜ ๋‚ด๋ถ€์˜ ๊ฐ์ฒด ์ˆ˜์— ์˜ํ•ด ์ •๊ทœํ™”๋ฉ๋‹ˆ๋‹ค.

DETR architecture

โœ” DETR ์•„ํ‚คํ…์ฒ˜๋Š” ๋งค์šฐ ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ตฌ์„ฑ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค. 3๊ฐ€์ง€ ๋ฉ”์ธ ์š”์†Œ๋“ค์ด ์กด์žฌํ•˜๋Š”๋ฐ, CNN backbone ์•„ํ‚คํ…์ณ์™€ encoder-decoder transformer ๊ทธ๋ฆฌ๊ณ  simple feed forward network(FFN)์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์œ„์˜ ๊ทธ๋ฆผ์ด DETR์˜ ์ƒ์„ธ ์•„ํ‚คํ…์ฒ˜ ๊ตฌ์กฐ์ธ๋ฐ, ๊ฑฐ์˜ ๊ธฐ์กด์˜ transformer์™€ ๋น„์Šทํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Backbone

โœ” Backbone์—์„œ์˜ ๋งˆ์ง€๋ง‰ feature map size๋Š” C = 2048 ๊ทธ๋ฆฌ๊ณ  H,W = H/32, w/32 ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ Resnet50์„ ์‚ฌ์šฉํ•ด ์œ„์™€ ๊ฐ™์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Transformer encoder

โœ” Transformer encoder๋Š” ์ฒ˜์Œ์— d dimension์œผ๋กœ mapping์„ ํ•ด์ฃผ๋ฉฐ, encoder์— input์œผ๋กœ ๋„ฃ๊ธฐ ์œ„ํ•ด dร—HWd \times HW ์ฐจ์›์œผ๋กœ ๋ณ€ํ™˜ํ•ด์ค๋‹ˆ๋‹ค. ๋˜ํ•œ transformer ์•„ํ‚คํ…์ฒ˜๋Š” permutaion-invariant ์ด๋ฏ€๋กœ, ๊ฐ ๊ณ„์ธต์˜ ์ž…๋ ฅ์— ๊ณ ์ •๋œ position encoding์„ ์ถ”๊ฐ€ํ•ด ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค.

Transforemr decoder

โœ” Decoder์˜ ๊ฒฝ์šฐ ๊ธฐ์กด์˜ transformer ๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ๊ฐ€๋ฉฐ, multi-headed self ๊ทธ๋ฆฌ๊ณ  encoder-decoder attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์‚ฌ์šฉํ•ด N์„ ์ž„๋ฒ ๋”ฉ ์‚ฌ์ด์ฆˆ d๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ๊ตฌ์กฐ๋ž‘ ๋‹ค๋ฅธ ์ ์€ DETR์€ N obects๋ฅผ ๋ณ‘๋ ฌ๋กœ ๋””์ฝ”๋”ฉ ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ N์€ ์•ž์„œ ์ด์•ผ๊ธฐ ํ•œ ๊ฒƒ๊ณผ ๊ฐ™์€ ํ•œ ์ด๋ฏธ์ง€์—์„œ ๋ณผ ์ˆ˜์žˆ๋Š” ๊ฐœ์ฒด ์ˆ˜ ์ด์ƒ์„ ์ง€์ •ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด COCO์—์„œ ํ•œ ์ด๋ฏธ์ง€์— ์ตœ๋Œ€ 63๊ฐœ์˜ ๊ฐœ์ฒด๋ฅผ ๊ฐ€์ง€๊ณ ์žˆ๋‹ค๋ฉด, ์ตœ์†Œํ•œ 63๊ฐœ ์ด์ƒ์˜ N์„ ์ง€์ •ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.

โœ” Decoder ๋˜ํ•œ permutation-invariant ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, N์˜ input embedding์€ ํ•ญ์ƒ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. Decoder์˜ input embedding์€ object queries(N)๋ผ๊ณ  ํ•˜๋Š” ํ•™์Šต๋œ position encoding์ด๋ฉฐ, ์ธ์ฝ”๋”์™€ ์œ ์‚ฌํ•˜๊ฒŒ ๊ฐ attention layer์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

โœ” N๊ฐœ์˜ object queries๋Š” ๋””์ฝ”๋”๋ฅผ ํ†ตํ•ด output embedding์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋“ค์€ box ์ขŒํ‘œ์™€ class label๋กœ FFN์„ ํ†ตํ•ด ๋…๋ฆฝ์ ์œผ๋กœ ๋””์ฝ”๋“œ๋ฉ๋‹ˆ๋‹ค.

Prediction feed-forward networks(FFNs)

โœ” ๋งˆ์ง€๋ง‰ prediction์€ 3๊ฐœ์˜ ReLU๊ฐ€ ํฌํ•จ๋œ ํผ์…‰ํŠธ๋ก ๊ณผ hidden ์ฐจ์› d, ๊ทธ๋ฆฌ๊ณ  linear projection layer๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. FFN์€ ํ‘œ์ค€ํ™”๋œ ์ขŒํ‘œ์™€ box์˜ ๋„ˆ๋น„์™€ ๋†’์ด๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ class label์€ softmax๋ฅผ ์‚ฌ์šฉํ•ด ์˜ˆ์ธกํ•˜๋ฉฐ, ๊ณ ์ •๋œ N๊ฐœ์˜ bbox set์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

Experiments

๐Ÿ“Œ ๋น„๊ต ๋ชจ๋ธ: Faster R-CNN
๐Ÿ“Œ Dataset: COCO minival
๐Ÿ“Œ Optimizer: AdamW
๐Ÿ“Œ Backbone: ResNet50(pre-train ImageNet), ResNet101(pre-train ImageNet) call DETR-R101
๐Ÿ“Œ additional: Conv5 layer์˜ stride๋ฅผ ์‚ญ์ œํ•˜๊ณ  dilation ๋ฐฉ๋ฒ•์„ ์ถ”๊ฐ€ํ•ด resolution์„ ์ฆ๊ฐ€์‹œ์ผฐ๋‹ค.(RETR-DC5), 
๐Ÿ“Œ Scal augmentation, Random crop augmentation, Add dropout 0.1

โœ” Faster R-CNN๊ณผ DETR์˜ ๋น„๊ต ๊ฒฐ๊ณผ ํ…Œ์ด๋ธ” ์ž…๋‹ˆ๋‹ค. DETR-DC5-R101์—์„œ ๊ฐ€์žฅ ๋†’์€ AP๋ฅผ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ชจ๋“  ์˜์—ญ์—์„œ ๊ฐ€์žฅ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์ง€๋Š” ์•Š์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์—์„œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

โœ” ์œ„์˜ ์‚ฌ์ง„์€ Encoder self-attention์˜ attention map์„ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ, ์ด๋Š” ๊ฐ๊ฐ์˜ ๊ฐœ์ฒด๋ฅผ ์ž˜ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” ์ด๋Š” ๊ฐ๊ฐ์˜ decoder์—์„œ์˜ prediction slot์„ visualizationํ•œ ๊ทธ๋ž˜ํ”„์ž…๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ slot๋“ค์€ ํŠน์ • ๋ฒ”์œ„์— ๋Œ€ํ•ด ๊ตฌ์ฒดํ™” ํ•˜๋Š”๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ… ์ด ์™ธ์—๋„ ๋‹ค์–‘ํ•œ ์‹คํ—˜๊ฒฐ๊ณผ๋ฅผ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•ด์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ablation์„ ํ†ตํ•ด ์ œ์‹œ๋œ ๋ฐฉ๋ฒ•๋ก ๋“ค์— ๋Œ€ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ์™€ decoder, encoder layer์˜ ๋ฐ˜๋ณต ํšŸ์ˆ˜์— ๋”ฐ๋ฅธ ์‹คํ˜ ๊ฒฐ๊ณผ ๋“ฑ์„ ์ œ์‹œํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ถ๊ธˆํ•˜์‹œ๋‹ค๋ฉด ํ•œ ๋ฒˆ์ฏค ์ฝ์–ด๋ณด์‹œ๋Š” ๊ฑธ ์ถ”์ฒœ๋“œ๋ฆฝ๋‹ˆ๋‹ค. ๋˜ํ•œ DETR for panoptic sementation์„ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ ์ด๋Š” ์•„๋ž˜์˜ ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.


Reference

profile
Maybe I will be an AI Engineer?
post-custom-banner

0๊ฐœ์˜ ๋Œ“๊ธ€