Paper Review: You Only Look Once: Unified, Real-Time Object Detection

์ด์†Œ์€ยท2022๋…„ 3์›” 22์ผ
0

Paper Review

๋ชฉ๋ก ๋ณด๊ธฐ
1/1
post-thumbnail

0. ๋ฐฐ๊ฒฝ ์ง€์‹

์ถœ์ฒ˜: ๐Ÿ“บ๋™๋นˆ๋‚˜ ์œ ํŠœ๋ธŒ
(R-CNN ๊ณ„์—ด ์„ค๋ช…์„ ์•„์ฃผ ์•„์ฃผ ์ž˜ ํ•ด์ฃผ์‹ ๋‹ค !)

R-CNN์€
1. CPU ๊ธฐ๋ฐ˜์˜ selective search ์ง„ํ–‰์œผ๋กœ ๋งŽ์€ ์‹œ๊ฐ„์ด ์†Œ์š”๋˜๊ณ ,
2. ์ „์ฒด ์•„ํ‚คํ…์ฒ˜์—์„œ SVM, Regressor ๋ชจ๋“ˆ์ด CNN๊ณผ ๋ถ„๋ฆฌ ๋˜์–ด ์žˆ์–ด End-to-End ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์—†์œผ๋ฉฐ
3. ๋ชจ๋“  Roi๋ฅผ CNN์— ๋„ฃ์–ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ CNN ์—ฐ์‚ฐ์ด ํ•„์š”ํ•˜๋‹ค
๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

R-CNN์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•˜์—ฌ ๋‚˜์˜จ ๊ฒƒ์ด Fast R-CNN์ธ๋ฐ
Fast R-CNN์€ End-to-End ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋˜์—ˆ์ง€๋งŒ
์—ฌ์ „ํžˆ Region Proposal์€ CPU์—์„œ ์ˆ˜ํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— ์†๋„๊ฐ€ ๋А๋ฆฌ๋‹ค.

๋”ฐ๋ผ์„œ Faster R-CNN์—์„œ๋Š”
RPN(Region Proposal Network)๋ฅผ ์ œ์•ˆํ•˜์—ฌ feature map์„ ๋ณด๊ณ  ์–ด๋А ๊ณณ์— ๋ฌผ์ฒด๊ฐ€ ์žˆ์„ ๋ฒ•ํ•œ์ง€ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก
๋งŒ๋“ค์—ˆ๋‹ค.




1. Key point

YOLO์˜ key point 3๊ฐ€์ง€

  1. ๋น ๋ฅด๋‹ค: 2-stage-detector --> 1-stage-detector
    real time์œผ๋กœ object detection์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ๋‹ค. (45 frames per second)
    (30frame ์ด์ƒ์ด๋ฉด ๋น ๋ฅด๋‹ค๊ณ  ํŒ๋‹จ)
  2. background ์—์„œ์˜ False Positive ์ตœ์†Œํ™”
  3. ์ข‹์€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ: CNN์„ ํ†ตํ•ด ๊ฐ์ฒด์˜ feature๋ฅผ ์ถ”์ถœ




2. YOLO

YOLO๊ฐ€ ์ง„ํ–‰๋˜๋Š” ๋‹จ๊ณ„๋ฅผ 3๊ฐœ์˜ step์œผ๋กœ ๋‚˜๋ˆ ๋ณด์•˜๋‹ค.

Step 1. input image๋ฅผ S x S grid๋กœ ๋‚˜๋ˆˆ๋‹ค.
Step 2. B๊ฐœ์˜ bounding box ์ขŒํ‘œ(x, y, w, h)์™€ confidence score๋ฅผ ๊ตฌํ•œ๋‹ค.

(bounding box score ๊ณ„์‚ฐ)

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” S๋ฅผ 7๋กœ ์„ค์ •ํ•˜์—ฌ 7 x 7 grid๋กœ ๋‚˜๋ˆ„์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐ bounding box์˜ ์ขŒํ‘œ(x, y, w, h) ์™€ confidence score๋ฅผ ๊ตฌํ•œ๋‹ค. confidence score๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์€ object์˜ ์œ /๋ฌด(1/0)๊ณผ IoU score๋ฅผ ๊ณฑํ•œ๋‹ค. ์œ„ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ grid cell์— object๊ฐ€ ์—†๋‹ค๋ฉด 0์ด ๋‚˜์˜ค๊ฒŒ ๋˜๊ณ , grid cell์— object๊ฐ€ ์žˆ๋‹ค๋ฉด 1๊ณผ IoU ๊ฐ’์ด ๊ณฑํ•ด์ ธ ์ตœ์ข…์ ์œผ๋กœ IoU score์™€ ๋™์ผํ•œ ๊ฐ’์ด ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค.
(train์‹œ์—๋Š” B๊ฐœ์˜ bounding box ์‚ฌ์šฉ, test์‹œ์—๋Š” 2๊ฐœ์˜ bounding box ๋ผ๊ณ  ๋…ผ๋ฌธ์— ๋‚˜์™€์žˆ๋Š”๋ฐ train์‹œ์—๋„ B=2 ์ธ ๊ฒƒ ๊ฐ™๋‹ค.)


Step3. Class probability๋ฅผ ๊ตฌํ•œ๋‹ค.

conditional class probability์™€ box confidence score๋ฅผ ๊ณฑํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ class confidence score๋ฅผ ๊ตฌํ•˜๊ฒŒ ๋œ๋‹ค.
(๋…ผ๋ฌธ์—์„œ๋Š” test ์‹œ์— conditional class probability์™€ box confidence predict๋ฅผ ๊ณฑํ•œ๋‹ค๊ณ  ๋˜์–ด์žˆ๋Š”๋ฐ, train์—์„œ๋„ ๋™์ผํ•œ๊ฑด์ง€ ์˜๋ฌธ)




3. Network

network๋ฅผ ๋ณด๋ฉด imageNet data๋กœ pretrain ๋œ GoogLeNet์„ ๊ฐ€์ ธ์˜ค๊ณ , ๋’ค์— 4๊ฐœ์˜ convolution layer์™€ 2๊ฐœ์˜ fully connected layers ๊ฐ€ ๋ถ™๋Š”๋ฐ 4๊ฐœ์˜ conv ์™€ 2๊ฐœ์˜ fc๋Š” train ๋œ๋‹ค. ๋˜ํ•œ GoogleNet์˜ inception modules ๋Œ€์‹  1x1 reduction layer ์™€ 3x3 conv layer๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

pretrain๋œ GoogLeNet์€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์— ์‚ฌ์šฉํ–ˆ๋˜ network์ธ๋ฐ, input image์— ๋Œ€ํ•œ spatial information ์ถ”์ถœ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ํŠน์„ฑ ๋•Œ๋ฌธ์— object detection์—์„œ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

train network๋ฅผ ๊ฑฐ์นœ ํ›„ output tensor shape์€ 7x7x30์ด ๋œ๋‹ค.

<< ์„ค์ •๊ฐ’ >>
์‹: S x S x (B * 5 + C)
- S x S (: grid cell) = 7
- B (: bounding box ๊ฐœ์ˆ˜) = 2
- C (: class ๊ฐœ์ˆ˜) = 20

S x S (7x7)๋กœ ๋‚˜๋ˆˆ grid cell์„ ์‚ดํŽด๋ณด๋ฉด, ๊ฐ grid cell ๋‹น B๊ฐœ(B=2)์˜ bounding box ๊ฐ€ ์กด์žฌํ•˜๊ณ  ๊ฐ bounding box ๋‹น ์ขŒํ‘œ๊ฐ’(x, y, w, h) ๊ณผ confidence score ์ฆ‰, ์ด 5๊ฐ€์ง€์˜ ์ •๋ณด๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” PASCAL VOC dataset์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ class ๊ฐœ์ˆ˜ C๋Š” 20์ด ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ์‹์„ ๋”ฐ๋ผ ์ ์–ด๋ณด๋ฉด 7 x 7 x (2 * 5 + 20) = 7x7x30 ์ด๋ผ๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.




4. Unified detection

(bounding box)
๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” 448x448 size์˜ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 7x7x30์˜ feature map์„ ์ถ”์ถœํ•œ๋‹ค. 7x7์˜ ๊ฐ grid ํ•˜๋‚˜ ํ•˜๋‚˜๋Š” 5๊ฐœ์˜ bounding box ์ •๋ณด (x, y, w, h, confidence score)๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค. 30 ์ฐจ์›์ด ์–ด๋–ป๊ฒŒ ์ด๋ฃจ์–ด์ง€๋Š”์ง€ ์‚ดํŽด๋ณด๋ฉด 1 - 5 ์ฐจ์›์€ ๋‘ ๊ฐœ์˜ bounding box ์ค‘ ์ฒซ๋ฒˆ์งธ bounding box ์ •๋ณด๊ฐ€ ๋‹ด๊ธด๋‹ค. 6 - 10 ์ฐจ์›์€ ๋‘๋ฒˆ์งธ bounding box์˜ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด๋‹ค. 11 - 30 ์ฐจ์›์€ ์ฒซ๋ฒˆ์งธ bounding box์—์„œ ํƒ์ง€๋œ object์˜ 20๊ฐœ์˜ class์— ๋Œ€ํ•œ ํ™•๋ฅ ๊ฐ’์ด ๋‹ด๊ธฐ๊ฒŒ ๋œ๋‹ค (PASCAL VOC dataset์˜ class๊ฐ€ 20๊ฐœ ์ด๊ธฐ ๋•Œ๋ฌธ).

(class confidence score)
์ด๋ ‡๊ฒŒ ๋‚˜์˜จ bounding box์˜ confidence socre์™€ class probability๋ฅผ ๊ณฑํ•˜์—ฌ bounding box๋“ค์˜ confidence score๋ฅผ ๊ตฌํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ 7x7 grid ๋‹น 2๊ฐœ์˜ bounding box๋กœ ๊ตฌ์„ฑํ•˜์˜€์œผ๋ฏ€๋กœ 7x7x2=98๊ฐœ์˜ bounding box์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

(NMS(Non-max suppression))
์œ„ ๊ณผ์ •์„ ๊ฑฐ์ณ ๋‚˜์˜จ 98๊ฐœ์˜ ์ •๋ณด๋“ค์— ๋Œ€ํ•ด์„œ NMS๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. NMS๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ detect๋œ object์˜ ์˜ˆ์ธก๊ฐ’์„ ์–ป๊ฒŒ ๋œ๋‹ค.

๐Ÿ’กNon maximum suppresion(NMS)
: ์ œ์ผ ํฐ IoU๋ฅผ ๊ฐ€์ง„ bounding box๋ฅผ ์ œ์™ธํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ์••์ถ•์‹œํ‚จ๋‹ค.
  (IoU๊ฐ€ ํŠน์ • threshold ์ด์ƒ์ธ ์ค‘๋ณต box ์ œ๊ฑฐ)




5. Loss Function

์ „์ฒด loss๋Š” ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™๊ณ , ๊ฒ€์ƒ‰์„ ํ†ตํ•ด ๊ณต๋ถ€ํ•œ ๊ฒƒ์„ ํ† ๋Œ€๋กœ loss ํ•จ์ˆ˜๋ฅผ 3๊ฐœ๋กœ ๋‚˜๋ˆ„์–ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค


  • Localization loss

    ๋งจ ์œ„ ๋‘ ์ค„์€ bounding box์˜ ์ขŒํ‘œ๊ฐ’๋“ค๊ณผ confidence score์— ๋Œ€ํ•œ loss์ด๋‹ค.
๊ทธ๋ฆผ์—์„œ
- ์ดˆ๋ก์ƒ‰ ๋™๊ทธ๋ผ๋ฏธ ๋ถ€๋ถ„: i๋ฒˆ์งธ grid cell์—์„œ j๋ฒˆ์งธ bounding box๊ฐ€ object๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก
                     responsible for (ํ• ๋‹น) ๋ฐ›์•˜์„ ๋•Œ 1, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด 0์ด ๋œ๋‹ค.
- ํŒŒ๋ž€์ƒ‰ ๋™๊ทธ๋ผ๋ฏธ ๋ถ€๋ถ„: object๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š๋Š” grid cell์˜ confidence score๊ฐ€ 0์ด ๋˜์–ด
                     gradient์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•˜๋Š” grid cell์—
                     ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๊ฒŒ ๋œ๋‹ค. (๋ณธ ๋…ผ๋ฌธ setting = 5)

ํฐ bounding box์˜ ์ž‘์€ ์˜ค๋ฅ˜๊ฐ€ ์ž‘์€ bounding box์˜ ์˜ค๋ฅ˜๋ณด๋‹ค ๋œ ์ค‘์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•ด w, h์— ๋ฃจํŠธ๋ฅผ ์”Œ์›Œ์ค€๋‹ค.


  • confidence loss

    ๊ฐ€์šด๋ฐ ๋‘ ์ค„์€ bounding box์˜ confidence score C์— ๋Œ€ํ•œ loss๋กœ, ๊ฐ์ฒด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์„ ๊ฒฝ์šฐ์—” 1, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด 0 ์ด ๋˜๊ณ  C^์€ ์˜ˆ์ธกํ•œ bounding box์˜ confidence score์ด๋‹ค.
๊ทธ๋ฆผ์—์„œ
- ์ดˆ๋ก์ƒ‰ ๋™๊ทธ๋ผ๋ฏธ ๋ถ€๋ถ„: object๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š” grid cell์— ๊ฐ€์ค‘์น˜ 0.5๋ฅผ ๊ณฑํ•˜์—ฌ loss์— ์˜ํ–ฅ์ด ๋œ ๊ฐ€๋„๋ก ํ•œ๋‹ค.

  • classification loss

    P๋Š” ์‹ค์ œ class probability, p^์€ ์˜ˆ์ธกํ•œ class probability๊ฐ€ ๋œ๋‹ค.




6. Experiment

  • Comparison to Other Real-Time Systems

  • Error Analysis

  • Generalizability: Person Detection in Artwork




7. Result

  1. YOLO๋Š” ๊ตฌ์„ฑ์ด ๊ฐ„๋‹จํ•˜๋‹ค.
  2. ์ด๋ฏธ์ง€ ์ „์ฒด์— ๋Œ€ํ•ด ํ•™์Šต ๋œ๋‹ค.
  3. ์ด์ „์˜ ๋ชจ๋ธ๋“ค ๋ณด๋‹ค ๊ต‰์žฅํžˆ ๋น ๋ฅธ ์†๋„๋ฅผ ๋ณด์ธ๋‹ค.
  4. ๊ฝค ์ค€์ˆ˜ํ•œ detection ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค.




8. Limitation of YOLO

  1. Grid cell ๋‹น ํ•˜๋‚˜์˜ class๋งŒ ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
  2. Object๊ฐ€ ๊ฒน์ณ์ ธ ์žˆ๋‹ค๋ฉด ์ •ํ™•ํ•œ ์˜ˆ์ธก์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.
    (ex โ€“ ์ƒˆ ๋–ผ์™€ ๊ฐ™์€ ๊ทธ๋ฃน์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ์ž‘์€ object)
  3. Bounding box์˜ ํ˜•ํƒœ๊ฐ€ training data๋ฅผ ํ†ตํ•ด์„œ๋งŒ ํ•™์Šต๋˜๋ฏ€๋กœ,
    ์ƒˆ๋กœ์šด/๋…ํŠนํ•œ ๊ฐ€๋กœ ์„ธ๋กœ ๋น„์œจ ๋“ฑ์˜ bounding box์˜ ๊ฒฝ์šฐ ์ •ํ™•ํžˆ ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•œ๋‹ค.
  4. ์ž‘์€ bounding box์™€ ํฐ bounding box์˜ error๋ฅผ ๋™์ผํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•จ
    ์ž‘์€ bounding box์˜ loss term์ด IoU์— ๋” ๋ฏผ๊ฐํ•˜๊ฒŒ ์˜ํ–ฅ์„ ์คŒ (localization์ด ๋‹ค์†Œ ๋ถ€์ •ํ™•)

0๊ฐœ์˜ ๋Œ“๊ธ€

๊ด€๋ จ ์ฑ„์šฉ ์ •๋ณด

Powered by GraphCDN, the GraphQL CDN