[๋…ผ๋ฌธ๋ฆฌ๋ทฐ] RetinaNet: Focal Loss for Dense Object Detection

cha-suyeonยท2021๋…„ 12์›” 14์ผ
1

Paper Review

๋ชฉ๋ก ๋ณด๊ธฐ
3/5

๐Ÿ“‘ ๋…ผ๋ฌธ ์ œ๋ชฉ: Focal Loss for Dense Object Detection
๐Ÿ“‘ ๋…ผ๋ฌธ ๋‹ค์šด๋กœ๋“œ: PDF



Preview

์ด๋ฒˆ์—” RetinaNet์˜ ๋…ผ๋ฌธ์„ ๋ฆฌ๋ทฐํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. RetinaNet์€ Object Detection ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜์ด๋ฉฐ One-statge-Detector์ž…๋‹ˆ๋‹ค.

์ด์ „ Object Detection์— ๋Œ€ํ•ด ํฌ์ŠคํŒ…ํ•œ ์ž๋ฃŒ๋ฅผ ์ฐธ๊ณ ํ•ด๋„ ์ข‹์Šต๋‹ˆ๋‹ค.

Object Detection์˜ ๋ชจ๋ธ์€ ์ด๋ฏธ์ง€ ๋‚ด์˜ ๊ฐ์ฒด์˜ ์˜์—ญ์„ ์ฐพ์•„๋‚ด๊ณ , IoU(Intersection over Union) threshold์— ๋”ฐ๋ผ positive/negative sample๋กœ ๋‚˜๋ˆˆ ๋’ค ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“Œ ์ฐธ๊ณ : Iou ๊ฐœ๋…

์—ฌ๊ธฐ์„œ class imbalance์˜ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ๋‚ด์— positive sample(object)์™€negative sample(background) ์‚ฌ์ด์˜ ์ฐจ์ด๊ฐ€ ํฌ๊ธฐ ๋•Œ๋ฌธ์ธ๋ฐ์š”. ์ฒซ ๋ฒˆ์งธ๋กœ ๋ฐฐ๊ฒฝ ์˜์—ญ ๋•Œ๋ฌธ์— ๋Œ€๋ถ€๋ถ„ sample์ด easy negative๊ฐ€ ๋˜๊ณ , easy negative์˜ ์ˆ˜๊ฐ€ ์••๋„์ ์œผ๋กœ ๋งŽ์•„์„œ ํ•™์Šต์— ๋ผ์น˜๋Š” ์˜ํ–ฅ๋ ฅ์ด ์ปค์ ธ์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ํ•˜๋ฝํ•ฉ๋‹ˆ๋‹ค.

Two-statge-Detector๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๊ฐ€์ง€ ์ธก๋ฉด์—์„œ ํ•ด๊ฒฐ์ฑ…์„ ์ผ์Šต๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ๊ฐ€ region proposals์„ ํ†ตํ•ด background sample์„ ๊ฑธ๋Ÿฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฐฉ๋ฒ•์€ selective search, edgeboxes, deepmask, RPN ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ositive/negative sample์˜ ์ˆ˜๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ์œ ์ง€ํ•˜๋Š” sampling heuristic ๋ฐฉ๋ฒ•์„ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. hard negative mining, OHEM ๋“ฑ์ด ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.

One-statge-Detector ๊ณ„์—ด์˜ ๋ฌธ์ œ์ ์€ ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ class imbalance์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ์— ๋ถ€์ ์ ˆํ–ˆ์Šต๋‹ˆ๋‹ค. region proposal ๊ณผ์ •์„ ์—†์• ๊ณ  ์ด๋ฏธ์ง€ ์ „์ฒด๋ฅผ samplingํ•˜๋Š” dence sampling ๋ฐฉ๋ฒ•์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋” ๋งŽ์€ ํ›„๋ณด ์˜์—ญ์ด ๋ฐœ์ƒํ•˜๊ณ  class imbalance ๋ฌธ์ œ๊ฐ€ ๋” ์‹ฌ๊ฐํ•ฉ๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ One-statge-Detector๋Š” Two-statge-Detector๋ณด๋‹ค ์†๋„๋Š” ๋น ๋ฅด์ง€๋งŒ ์„ฑ๋Šฅ์€ ๋–จ์–ด์ง„๋‹ค๋Š” ํŠน์ง•์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ํ•ด๋‹น ๋…ผ๋ฌธ์€ class imbalance์˜ ๋ฌธ์ œ๋ฅผ main์œผ๋กœ ๋ณด๊ณ  focal loss๋ฅผ ํ†ตํ•œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๊ณ , Two-statge-Detector๋ณด๋‹ค ๋” ์ข‹์€ ์„ฑ๋Šฅ๊นŒ์ง€ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.


Abstract

์œ„์˜ ๊ทธ๋ž˜ํ”„์—์„œ CECE ๊ฐ€ Cross Entropy ์‹์ด๊ณ  ์•„๋ž˜์˜ FLFL์ด ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” Focal Loss์ž…๋‹ˆ๋‹ค.

๋‘ ์‹์˜ ์ฐจ์ด๋Š” (1โˆ’p)ฮณ(1-p)ฮณ์ด๋ฉฐ ฮณฮณ๋Š” ๋’ค์˜ Focal Loss์—์„œ ์„ค๋ช…์ด ๋ ๋“ฏํ•ฉ๋‹ˆ๋‹ค.

๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์—ฌ์คฌ๋˜ detector์€ R-CNN ๊ณ„์—ด์˜ tow-stage์˜€๊ณ , one-stage detector๋Š” ๋Œ€๋น„๋˜๊ฒŒ๋„ dence sampling์„ ํ†ตํ•ด ๋น ๋ฅด๊ณ  ๊ฐ„๋‹จํ•˜๊ฒŒ object์˜ locations๋ฅผ ์ฐพ์•„๋ƒˆ์ง€๋งŒ ์„ฑ๋Šฅ์€ two-stage๋ฅผ ๋”ฐ๋ผ๊ฐ€์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ ์ €์ž๋“ค์ด ๋ฐœ๊ฒฌํ•œ ๊ฒƒ์€ foreground์™€ background ์‚ฌ์ด์˜ class imbalance๋ฌธ์ œ์˜€๊ณ , ์ด๊ฒƒ์ด dence detector๊ฐ€ ํ•™์Šตํ•˜๋Š”๋ฐ ๊ฐ€์žฅ ํฐ ์›์ธ์ด์—ˆ์Šต๋‹ˆ๋‹ค.


Main Ideas

class imbalance๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ธฐ์กด์˜ cross entropy loss๋ฅผ reshaplingํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ–ˆ๊ณ , ๊ทธ๊ฒƒ์€ ์ž˜ ๋ถ„๋ฅ˜๋˜๋Š” ์˜ˆ๋“ค(well-classified examples, easy sample)์—๊ฒŒ ๋” ์ž‘์€ ๊ฐ€์ค‘์น˜(dwon-weights)๋ฅผ ์ฃผ๋Š” ๋ฐฉ๋ฒ•์ด์—ˆ์Šต๋‹ˆ๋‹ค.

easy sample์—๊ฒŒ ์ž‘์€ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๊ฒŒ ๋˜๋ฉด ํ•™์Šตํ•˜๋Š”๋ฐ ๋ฐฉํ•ดํ•˜๋Š” ๊ฑธ ์กฐ๊ธˆ ๋” ๋ง‰์•„์ค๋‹ˆ๋‹ค.

๊ทธ ๋ฐฉ๋ฒ• ์ด๋ฆ„์ด FocalFocal LossLoss์ด๋ฉฐ, detector๊ฐ€ ํ•™์Šตํ•˜๋Š” ๋™์•ˆ ์—„์ฒญ๋‚˜๊ฒŒ ๋งŽ์€ ์ˆ˜์˜ easy negatives์—๊ฒŒ ์˜ํ–ฅ์ด ๊ฐ€๋Š” ๊ฒƒ์„ ๋ง‰์œผ๋ฉฐ

์ž‘๊ฒŒ ๋ถ„ํฌ๋˜์–ด ์žˆ๋Š” hard example(object)์—๊ฒŒ ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค.

์ด ์ƒˆ๋กœ์šด FocalFocal LossLoss๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด simple dense detector์„ ์„ค๊ณ„ํ–ˆ๊ณ  ๊ทธ๊ฒƒ์ด RetinaNet์ž…๋‹ˆ๋‹ค.

์ด ๊ฒฐ๊ณผ๋Š” ์ด์ „ ๊ฒƒ๋ณด๋‹ค ์†๋„๋„ ๋น ๋ฅด๊ณ  ์„ฑ๋Šฅ๋„ ์ข‹์•„ (๋‹น์‹œ์—) SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•ด๋‹น ๊ทธ๋ž˜ํ”„์˜ ms๋Š” ์†๋„์ด๋ฉฐ, AP๋Š” ํ‰๊ฐ€ ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค. RetinaNet101์„ ์‚ดํŽด๋ณด๋ฉด RetinaNet50๋ณด๋‹ค ๊ฐ™์€ ์†๋„์— ๋น„ํ•ด ๋” ์ข‹์€ ์„ฑ๋Šฅ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

A๋ถ€ํ„ฐ G๊นŒ์ง€ one-stage/two stage detector๋ฅผ ๋น„๊ตํ•ด๋ณด๋ฉด ์„ฑ๋Šฅ๋„ ์†๋„๋„ ์šฐ์ˆ˜ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.



Focal Loss

Focal Loss๋Š” ํ•™์Šตํ•˜๋ฉด์„œ foreground์™€ background ์‚ฌ์ด์˜ ํฐ imbalance ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋„๋ก one-stage์—์„œ ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  binary classification์„ ์œ„ํ•œ Cross Entropy(CECE)๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

ground-truth class์— ๋Œ€ํ•ด์„œ [0, 1]๋กœ ๋ถ„๋ฅ˜ํ•˜๊ณ , label๊ณผ ๋™์ผํ•˜๋ฉด 1์ž…๋‹ˆ๋‹ค.

ํ‘œ๊ธฐ์ƒ ํŽธ์˜๋ฅผ ์œ„ํ•ด ptp_t๋ผ๊ณ  ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

y=1y=1์ผ ๋•Œ, pp์ด๊ณ  ๊ทธ ์™ธ์˜ ๊ฒฝ์šฐ์—” 1โˆ’p1-p์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ CE(p,y)CE(p, y)์˜ ์‹์„ ๋‹ค์‹œ ์จ์„œ, CE(pt)CE(pt) = โˆ’log(pt)-log(pt)๋ผ๊ณ  ํ‘œ๊ธฐํ•ฉ๋‹ˆ๋‹ค.


(1) Balanced Cross Entropy

์—ฌ๊ธฐ์„œย y์— ์ƒ๊ด€์—†์ดย ptp_t > 0.5 ์ด๋ฉด Confidence๊ฐ€ ๋†’์œผ๋ฏ€๋กœ Loss๊ฐ€ ํฌ๊ฒŒ ์ค„์–ด๋“œ๋Š”๋ฐ, ๋ฌธ์ œ๋Š” ์‰ฝ๊ฒŒ ๋ถ„๋ฅ˜๊ฐ€ ์ž˜ ๋ผ์„œ 0.5๋ฅผ ๋„˜๊ธฐ๊ธฐ ์‰ฌ์šด Background๋‚˜ class๋“ค์ด ๋„ˆ๋ฌด ๋งŽ์ด Loss๋ฅผ ์ค„์—ฌ๋ฒ„๋ฆฌ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ด๋ ‡๊ฒŒ ๋˜๋ฉด Rareํ•œ ํด๋ž˜์Šค๊ฐ€ Loss์— ๋ฏธ์น˜๋Š” ๊ท€์ค‘ํ•œ ์˜ํ–ฅ์„ ์••๋„ํ•ด๋ฒ„๋ฆด ์ˆ˜ ์žˆ๋Š”๋ฐ์š”.

ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” CECE์—๋‹ค๊ฐ€ balance๋ฅผ ๋งž์ถฐ์ฃผ๋Š” idea๋กœ Balanced Cross Entropy๋ž€ ๊ฒƒ์„ ๋จผ์ € ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

Balanced Cross Entropy๋Š” Weightng Factor๋ฅผ ์ œ์•ˆํ•œ ๊ฒƒ์ธ๋ฐ์š”.

y๊ฐ€ 1์ผ ๋•Œ์™€ -1์ผ ๋•Œ, ๊ฐ๊ฐ Loss์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋‹ค๋ฅด๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์•„์ด๋””์–ด์ž…๋‹ˆ๋‹ค.

  • y๊ฐ€ -1์ผ ๋•Œ๋Š” loss์— (1โˆ’ฮฑ)(1-ฮฑ)์˜ weight๋ฅผ ์คŒ
  • y = 1์ผ ๋•Œ๋Š” ฮฑฮฑ์˜ weight๋ฅผ ์คŒ

Weighting Factorย ๋กœ Positive์™€ Negative Sample๋“ค์ด Loss์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์˜ ์ •๋„๋Š” ์กฐ์ •ํ•ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ์•„์ง ์ง„์งœ ๋ชฉํ‘œ์ธ Easy/Hard Sample๋“ค์— ๋Œ€ํ•œ Loss ๋ฐ˜์˜ ์ •๋„๋ฅผ ์กฐ์ •ํ•ด์ฃผ์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

์ด ๋ถ€๋ถ„์„ ํ•ด๊ฒฐํ•œ ๊ฒƒ์ด Scaling Factor์ด๊ณ , Scaling Factor๊ฐ€ ์ถ”๊ฐ€๋œ ๊ฒƒ์ด Focal loss function์ž…๋‹ˆ๋‹ค!


(2) Focal Loss Definition

์‰ฝ๊ฒŒ ๋ถ„๋ฅ˜๋˜๋Š” negative๋Š” loss์˜ ๋Œ€๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•˜๊ณ , gradient๋ฅผ ๊ฑฐ์˜ ์ง€๋ฐฐํ•ฉ๋‹ˆ๋‹ค.

ฮฑฮฑ balance๋Š” positive/negative์—๊ฒ ์˜ํ–ฅ์„ ์ฃผ์ง€๋งŒ, ๊ทธ๊ฒƒ์€ easy/hard example ์‚ฌ์ด๋ฅผ ๊ตฌ๋ณ„์ง“์ง€๋Š” ๋ชปํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ CECE์— (1โˆ’pt)ฮณ(1-p_t)^ฮณ๋ฅผ ๊ณฑํ•ด factor์— ๋” ๋ณ€ํ™”๋ฅผ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

ฮณฮณ๋Š” ์กฐ์ •ํ•  ์ˆ˜ ์žˆ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์ž…๋‹ˆ๋‹ค.

Focal Loss ์‹ ์„ค๋ช…

(1) ptp_t ์™€ modulating factor์™€์˜ ๊ด€๊ณ„

example์ด ์ž˜๋ชป ๋ถ„๋ฅ˜๋˜๊ฑฐ๋‚˜ ptp_t๊ฐ€ ์ž‘์œผ๋ฉด factor๋Š” 1์— ๊ฐ€๊น๊ณ , loss๋Š” ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

ptp_t๊ฐ€ 1์ด๋ฉด factor๋Š” 0์œผ๋กœ ๊ฐ€๊ณ , ์ž˜ ๋ถ„๋ฅ˜๋˜๋Š” example์˜ loss๋Š” ๊ฐ€์ค‘์น˜๊ฐ€ ๋‚ฎ์•„์ง‘๋‹ˆ๋‹ค.

(2) focusing parameter ฮณฮณ์˜ ์—ญํ• 

ํŒŒ๋ผ๋ฏธํ„ฐ ฮณฮณ์€ easy example์˜ ๊ฐ€์ค‘์น˜๊ฐ€ ์ž‘์•„์ง€๋Š” ๋น„์œจ์„ ๋” ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ์กฐ์ •ํ•ด์ค๋‹ˆ๋‹ค.

ฮณฮณ๊ฐ€ 0์ผ ๋•Œ, FLFL์€ CECE์™€ ๋™์ผํ•˜๋ฉด์„œ ฮณฮณ์ด ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ Scaling factor์˜ ์˜ํ–ฅ์ด ์ปค์ง‘๋‹ˆ๋‹ค.

modulating factor๋Š” easy example์˜ ๊ธฐ์—ฌ๋„๋ฅผ ์ค„์ด๊ณ ,

example์ด ์ž‘์€ loss๋ฅผ ๋ฐ›๋Š” ๋ฒ”์œ„๋ฅผ ํ™•์žฅ์‹œํ‚ค๋Š” ๊ธฐ๋Šฅ์„ ํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ฮณ=2ฮณ=2, pt=0.9p_t=0.9์ผ ๋•Œ, CE์— ๋น„ํ•ด 100๋ฐฐ ์ ์€ loss๋ฅผ ๊ฐ€์ง€๋ฉฐ pt=0.968p_t=0.968์ผ ๋•Œ๋Š” 1000๋ฐฐ ์ ์€ loss๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

์ด๋Š” ์ž˜๋ชป ๋ถ„๋ฅ˜๋œ example์„ ์ˆ˜์ •ํ•˜๋Š” ์ž‘์—…์˜ ์ค‘์š”๋„๋ฅผ ์ƒ์Šน์‹œํ‚ด์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ๋Š” ฮณ=2ฮณ=2์ผ ๋•Œ, ๊ฐ€์žฅ ํšจ๊ณผ์ ์ด๋ผ๊ณ  ์–ธ๊ธ‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ FLFL์— ์•ž์—์„œ ฮฑฮฑ์˜ balance๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ๋Š” ์œ„์˜ ์‹์„ ์ฑ„ํƒํ–ˆ๋Š”๋ฐ, balance๊ฐ€ ๋งž์ง€ ์•Š์€ ํ˜•์‹์— ๋น„ํ•ด ์ •ํ™•๋„๊ฐ€ ์•ฝ๊ฐ„ ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

loss layer์˜ ๊ตฌํ˜„์€ pp๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•œ sigmoid ์—ฐ์‚ฐ๊ณผ loss ๊ณ„์‚ฐ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ฐ€์ ธ์˜จ๋‹ค๋Š” ์ ์— ์ฃผ๋ชฉํ–ˆ์Šต๋‹ˆ๋‹ค.



RetinaNet

RetinaNet์€ ํ•˜๋‚˜์˜ backbone network์™€ ๋‘ ๊ฐ€์ง€์˜ task๋ฅผ ๊ฐ€์ง„ subnetwork๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

backbone์€ ์ „์ฒด input image์— ๋Œ€ํ•œ convolutional feature map์˜ ์—ฐ์‚ฐ์„ ํ•˜๋Š” ๊ธฐ๋Šฅ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฑด ์ž์ฒด convolutional network์ž…๋‹ˆ๋‹ค.

feedforward๋กœ ResNet์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ResNet ์ƒ๋‹จ์—์„œ FPN backbone์„ ์‚ฌ์šฉํ–ˆ๊ณ , multi-scale convolutional feature pyramid๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ anchor box๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

first subnet์€ backbone์˜ output(anchor box)์— ๋Œ€ํ•œ object๋ฅผ classificationํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

second subnet์€ anchor box์™€ GT(Ground Truth) Box๋ฅผ ๋น„๊ตํ•˜๋Š” regression์˜ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.


(1) FPN(Feature Pyramid Network Backbone)

RetinaNet์˜ Backbone์œผ๋กœ FPN(Feature Pyramid Network Backbone)์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

FPN์€ ํ•˜๋‚˜์˜ input image์— ๋Œ€ํ•ด multio-scale feature pyramid๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๊ฐ ๋ ˆ๋ฒจ์˜ pyramid๋Š” ๋‹ค๋ฅธ scale์—์„œ object๋ฅผ detectํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

์ž‘์€ ํฌ๊ธฐ์˜ object๋ถ€ํ„ฐ ํฐ ํฌ๊ธฐ์˜ object๊นŒ์ง€ ๋‹ค์–‘ํ•œ scale์„ ๊ฐ€์ง€๋Š” object์˜ detect ๋Šฅ๋ ฅ์ด ํ–ฅ์ƒ ๋ฉ๋‹ˆ๋‹ค.

ResNet ์ž์ฒด๋งŒ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ๋Š” AP๊ฐ€ ๋‚ฎ์•„์„œ ResNet ์ƒ๋‹จ์— FPN์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค

pyramid๋Š” level P3~P7์„ ์‚ฌ์šฉํ–ˆ๊ณ , pyramid์˜ channels์˜ ์ˆ˜๋Š” 256์ž…๋‹ˆ๋‹ค.


(2) Anchors

  • three aspect ratios 1:2;1:1,2:1{1:2; 1:1, 2:1}
  • IoU threshold of 0.5
    • [0, 0.4)์˜ IoU๋Š” background๋ผ๊ณ  ํŒ๋‹จ
    • [0.4, 0.5]์˜ IoU๋ฅผ ๊ฐ€์ง€๋Š” Anchors Box๋Š” ํ•™์Šต ๋„์ค‘์— ๋ฌด์‹œ๋œ๋‹ค.

(3) Classification Subnet

  • Anchor box ๋‚ด์— object๊ฐ€ ์กด์žฌํ•  ํ™•๋ฅ ์„ predictํ•ฉ๋‹ˆ๋‹ค.
  • subnet์€ FPN level ์˜†์— ๋ถ™์–ด์žˆ๋Š” ์ž‘์€ FCN(Fully Convolution Network)์ž…๋‹ˆ๋‹ค.
  • subnet์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” pyramid level์—์„œ ๊ณต์œ ๋ฉ๋‹ˆ๋‹ค.
  • 3โˆ—33*3 conv layers, ReLU activations, sigmoid activations

RPN๊ณผ ๋น„๊ตํ•ด์„œ, object classification subnet์€ ๋” ๊นŠ๊ณ , ์˜ค์ง 3โˆ—33*3 conv layers๋งŒ ์‚ฌ์šฉํ•˜์—ฌ box regression subnet๊ณผ๋Š” ๊ณต์œ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.


(4) Box Regression Subnet

  • Classification Subnet๊ณผ ๊ฐ™์ด ๊ฐ FPN level์— ์ž‘์€ FCN์„ ๋ถ™์ž…๋‹ˆ๋‹ค.
  • ๊ฐ Anchor box ์˜ offset 4๊ฐœ (center x, center y, width, height)๋ฅผ GT๋ฐ•์Šค์™€ ์œ ์‚ฌํ•˜๊ฒŒ regressionํ•ฉ๋‹ˆ๋‹ค.
  • class-agnostic bounding box regressor๋ฅผ ์‚ฌ์šฉ
    • ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ์ ๊ณ  ์„ฑ๋Šฅ์ด ์ข‹์Šต๋‹ˆ๋‹ค.
    • regressor์€ class ์ •๋ณด ์—†์ด anchor box๋ฅผ regressionํ•ฉ๋‹ˆ๋‹ค.
  • Classification Subnet๊ณผ Box Regression Subnet์€ ๊ตฌ์กฐ๋Š” ๊ฐ™์ง€๋งŒ ๊ฐœ๋ณ„์ ์ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต์œ ๊ฐ€ ์ผ์–ด๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.


Inference and Training

RetinaNet์˜ ํŒ๋‹จ ์†๋„ ํ–ฅ์ƒ์„ ์œ„ํ•ด, ๊ฐ FPN level์—์„œ ๊ฐ€์žฅ box prediction ์ ์ˆ˜๊ฐ€ ๋†’์€ 1,000๊ฐœ์˜ box๋งŒ result์— ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๋‹ค๋ฅธ detector์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ตœ์ข… detection์— NMS(non-maximum suppression)๋ฅผ 0.5 ์ž„๊ณ„๊ฐ’์œผ๋กœ ์ ์šฉํ•˜์—ฌ ์ตœ์ข… ๊ฐ’์„ ์‚ฐ์ถœํ•ฉ๋‹ˆ๋‹ค.

RetinaNet์„ COCO ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ด ํ•™์Šต์‹œํ‚จ ํ›„ ์„œ๋กœ ๋‹ค๋ฅธ loss function์„ ์‚ฌ์šฉํ•˜์—ฌ AP ๊ฐ’์„ ์ธก์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • CECE loss๋Š” 30.2%
  • BalancedCEBalanced CE๋Š” 31.1%
  • FLFL์˜ AP๋Š” 34%

๋˜ํ•œ, SSD ๋ชจ๋ธ์„ ํ†ตํ•ด positive/negative ๋น„์œจ์„ 1:3์œผ๋กœ

NMS threshold=0.5๋กœ ์„ค์ •ํ•œ OHEM๊ณผ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ,

Focal loss๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ์˜ AP๊ฐ’์ด 3.2% ๋” ๋†’๊ฒŒ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด, Focal loss๊ฐ€ class imbalance ๋ฌธ์ œ๋ฅผ ๊ธฐ์กด์˜ ๋ฐฉ์‹๋ณด๋‹ค ํšจ๊ณผ์ ์œผ๋กœ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค.



Comparison

RetinaNet์˜ ์„ฑ๋Šฅ์€ ๊ธฐ์กด two-stage, one-stafe ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•ด๋ณด์•˜์„ ๋•Œ๋„, ์šฐ์ˆ˜ํ•จ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.



Conclusion

one-stage์˜ ๊ฐ€์žฅ ์ฃผ์š”ํ•œ obstacle์ด์—ˆ๋˜ class imbalance๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด FocalFocal LossLoss๋ฅผ ์ œ์•ˆํ•˜์˜€๊ณ , ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•ด๋‹น ๋…ผ๋ฌธ์˜ ์ ‘๊ทผ์€ ๊ต‰์žฅํžˆ ๊ฐ„๋‹จํ•˜๋ฉด์„œ ํšจ๊ณผ์ ์ธ๋ฐ์š”.

fully convolutional one-stage detector๋ฅผ ์„ค๊ณ„ํ•˜์—ฌ ํšจ๊ณผ๋ฅผ ์ฆ๋ช…ํ•˜์˜€๊ณ , SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค!

ํŽ˜์ด์Šค๋ถ์ด ๋งŒ๋“  ์„ฑ๊ณผ๋ฅผ ๊ตฌ๊ฒฝํ•˜๊ณ  ์‹ถ์œผ์‹  ๋ถ„๋“ค์€ ํ•ด๋‹น github์œผ๋กœ ์ด๋™ํ•ด์„œ ๋ณด์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค.

Reference

๐Ÿ“Œ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ์ฐธ๊ณ 1: RetinaNet ๋…ผ๋ฌธ(Focal Loss for Dense Object Detection) ๋ฆฌ๋ทฐ
๐Ÿ“Œ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ์ฐธ๊ณ 2: Focal Loss for Dense Object Detection ๋ฆฌ๋ทฐ
๐Ÿ“Œ ๋…ผ๋ฌธ: Feature Pyramid Networks for Object Detection

profile
๋ฏธ๋‚จ์ด ๊ท€์—ฝ์ฃ 

0๊ฐœ์˜ ๋Œ“๊ธ€