[CV] You Only Look Once: Unified, Real-Time Object Detection(YOLO v1) review

๊ฐ•๋™์—ฐยท2022๋…„ 1์›” 15์ผ
1

[Paper review]

๋ชฉ๋ก ๋ณด๊ธฐ
4/17

๐ŸŽˆ ๋ณธ ๋ฆฌ๋ทฐ๋Š” YOLO v1 ๋…ผ๋ฌธ ๋ฐ ๋ฆฌ๋ทฐ ๋“ฑ์„ ์ฐธ๊ณ ํ•ด ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

Key Words

๐ŸŽˆ Extremely Fast
๐ŸŽˆ one-stage model
๐ŸŽˆ Grid cell
๐ŸŽˆ DarkNet
๐ŸŽˆ Responsible

Introduction

โœ” ๊ธฐ์กด์˜ Object Detection ๋ชจ๋ธ๋“ค์ธ R-CNN & DPM์€ two-stage detector์ด๋ผ๋Š” ์ , ๊ทธ ๊ฒฐ๊ณผ stage๊ฐ„์˜ ๋ณ‘๋ชฉ ํ˜„์ƒ์œผ๋กœ ์†๋„๊ฐ€ ๋Š๋ฆฝ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” YOLO๋Š” one-stage detector๋กœ์จ, localization๊ณผ classification์„ ํ•˜๋‚˜์˜ ๋„คํŠธ์›Œํฌ๋กœ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ extremely Fast ์†๋„๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. Base YOLO network์˜ ๊ฒฝ์šฐ 45 FPS, Fast YOLO network์˜ ๊ฒฝ์šฐ 150 FPS์˜ ์†๋„๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

โœ” YOLO๋Š” sliding window์™€ regions proposal์™€ ๋‹ฌ๋ฆฌ image๋ฅผ ์ „์ฒด์ ์„์˜ค ์ถ”๋ก ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ YOLO๋Š” ์ƒˆ๋กœ์šด ๋„๋ฉ”์ธ ๋ฐ ์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€์˜ ์ ์šฉ์— ๋Œ€ํ•ด์„œ๋„ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

Unified Detection

โœ” YOLO๋Š” ์ „์ฒด ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ feature์„ ์‚ฌ์šฉํ•ด bound-box + confidence์™€ Class probability์— ๋Œ€ํ•œ ์˜ˆ์ธก์„ ๋™์‹œ์— ์ง„ํ–‰ํ•œ๋‹ค.

โœ” Input image๋ฅผ S x S grid๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. grid๋กœ ๋‚˜๋ˆˆ๋‹ค๋Š” ๊ฑด ์ฒด์ŠคํŒ๊ณผ ๊ฐ™์ด S๋งŒํผ ์ผ์ •ํ•˜๊ฒŒ ์นธ์„ ๋‚˜๋ˆˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค. Grid๊ฐ€ YOLO์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํฌ์ธํŠธ ์ค‘ ํ•˜๋‚˜๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ grid cell์ด ๋ฌผ์ฒด์˜ ์ค‘์‹ฌ์— ์œ„์น˜ํ•œ๋‹ค๋ฉด, ๊ทธ grid cell์—๊ฒŒ ๋ฌผ์ฒด๋ฅผ detectingํ•  responsible์„ ์ค๋‹ˆ๋‹ค.

โœ” ๊ฐ grid cell์€ B bounding boxes์™€ confidence score์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. confidence score์€ box์•ˆ์— ๋ฌผ์ฒด๊ฐ€ ์žˆ๋Š”์ง€, ์˜ˆ์ธกํ•œ box๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ •ํ™•ํ•œ์ง€๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  confidence = pr(Object)pr(Object) * IoU(tredpred)IoU(tred pred)๋กœ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ pr(Object)pr(Object)์€ ๋ฌผ์ฒด๊ฐ€ grid cell์•ˆ์— ์žˆ์ง€ ์•Š์œผ๋ฉด 0์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

โœ” ๊ฐ bounding box๋Š” 5๊ฐœ์˜ ๊ตฌ์„ฑ ์š”์†Œ(x,y,w,h + confidence)๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. (x,y)๋Š” grid cell์•ˆ์˜ ์ค‘์‹ฌ ์ขŒํ‘œ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. (w,h)m width, height์€ ์ „์ฒด ์ด๋ฏธ์ง€์— ๋น„๋ก€ํ•œ ๊ฐ’์ž…๋‹ˆ๋‹ค.

โœ” ๊ฐ grid cell์€ CC๊ฐœ์˜ conditional class probabilities, pr(ClassiโˆฃObject)pr(Class_i|Object)๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ฌผ์ฒด๊ฐ€ ์กด์žฌํ•œ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ํŠน์ • Class i์ผ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ๊ฐ’์ž…๋‹ˆ๋‹ค. ํ•œ ๊ฐ€์ง€ ์ง‘๊ณ  ๋„˜์–ด๊ฐ€์•ผํ•  ๊ฒƒ์„ YOLO๋Š” box B์˜ ๊ฐœ์ˆ˜์™€๋Š” ๊ด€๋ จ์—†์ด, grid cell์˜ ํ™•๋ฅ  ๊ฐ’์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

โœ” YOLO๋Š” PASCAL VOC๋กœ ํ‰๊ฐ€ํ–ˆ๊ณ , S=7, B=2. PASCAL VOC๋Š” 20๊ฐœ์˜ Class๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ์—, 7x7x(2*5+20)์˜ feature output์ด ๋„์ถœ๋ฉ๋‹ˆ๋‹ค.

Network Design

โœ” ์•ž์„  ๋ง๊ณผ ๋™์ผํ•˜๊ฒŒ PASCAL VOC dataset์œผ๋กœ detection ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

โœ” ์œ„์˜ DarkNet์€ 24๊ฐœ์˜ Conv layer ์™€ 2๊ฐœ์˜ FC layer๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ์Šต๋‹ˆ๋‹ค. GooLeNet์˜ inception ๋ชจ๋“ˆ๊ณผ ๋‹ค๋ฅด๊ฒŒ, ๋‹จ์ˆœํžˆ 3x3 conv layer ์ด์ „์— 1x1 reduction layer ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

Training

โœ” ImageNet์˜ 1000-class dataset์œผ๋กœ pretrain์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. first 20 Conv layer์„ ์‚ฌ์šฉํ•ด pretrain์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

โœ” Pretrain ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ detection์‹œ 4๊ฐœ์˜ Conv layer์™€ 2๊ฐœ์˜ Fc layer๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์œ„์˜ ์ด๋ฏธ์ง€์— ์ถ”๊ฐ€๋œ ๋ ˆ์ด์–ด๋“ค์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” Detection์„ ํ•  ๋•Œ ์ข…์ข… ๋ฏธ์„ธํ•œ ์‹œ๊ฐ ์ •๋ณด(?)๊ฐ€ ํ•„์š”ํ•˜๊ธฐ์— image์˜ ์‚ฌ์ด์ฆˆ๋ฅผ 224x224 -> 448x448๋กœ ๋ณ€๊ฒฝํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด๋Š” class prob.์™€ bbox ์ขŒํ‘œ ๋‘˜ ๋‹ค ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.์ฆ‰, output ์‚ฌ์ด์ฆˆ๊ฐ€ 7x7x30(2*5+20) ์ธ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด์—์„œ๋Š” linear activation์„ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, ๋‚˜๋จธ์ง€ ๋ ˆ์ด์–ด์—์„œ๋Š” leaky rectified linear activation์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

Loss function

โœ” YOLO๋Š” regression์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” SSE๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์œ„์˜ Loss Fuction์€ ํฌ๊ฒŒ 3๊ฐ€์ง€๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” ์•ž์˜ ๋‘ ์ค„์„ Localization loss ์ž…๋‹ˆ๋‹ค.

  • ฮปcoord\lambda_{coord}: ๋Œ€๋ถ€๋ถ„์˜ grid cell์€ ๋ฌผ์ฒด๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š๊ธฐ์—, confidence score๊ฐ€ 0์ด ๋˜์–ด ๋ฌผ์ฒด๋ฅผ ํฌํ•จํ•˜๋Š” grdient๋ฅผ ์••๋„ํ•˜์—ฌ, ๋ชจ๋ธ์ด ๋ถˆ์•ˆ์ •ํ•ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ฮปcoord\lambda_{coord}์„ ์‚ฌ์šฉํ•ด ๋ฌผ์ฒด๋ฅผ ํฌํ•จํ•˜๋Š” cell์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€ํ•ฉ๋‹ˆ๋‹ค. (ฮปcoord\lambda_{coord} = 5)
  • S2S^2: grid cell์˜ ์ˆ˜(SS =7)
  • BB: grid cell๋ณ„ bounding box ์ˆ˜(BB=2)
  • 1i,jobj1_{i,j}^{obj}: i๋ฒˆ์งธ grid cell์˜ j๋ฒˆ์งธ bounding box๊ฐ€ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ• ๋‹น ๋ฐ›์•˜์„ ๋•Œ 1, ์•„๋‹ˆ๋ฉด 0๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. grid cell์—์„œ B๊ฐœ์˜ bounding box๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ๊ทธ ์ค‘ ๋†’์€ confidence score ๊ฐ’ ํ•˜๋‚˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • xi,yi,wi,hix_i,y_i,w_i,h_i: ground-truth box์˜ x,y,w,h์˜ ๊ฐ’์ด๋‹ค.
  • xihat,yihat,wihat,hihatx_i^{hat},y_i^{hat},w_i^{hat},h_i^{hat}: ์˜ˆ์ธก๋œ bounding box์˜ x,y์˜ ์ขŒํ‘œ์™€ width, height ์ž…๋‹ˆ๋‹ค.

โœ” 3 ~ 4์ค„์€ Confidence loss ์ž…๋‹ˆ๋‹ค.

  • ฮปnoobj\lambda_{noobj}: ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š๋Š” grid cell์˜ ๊ฐ€์ค‘์น˜ ์ž…๋‹ˆ๋‹ค. (ฮปnoobj\lambda_{noobj} =0.5)
  • 1i,jnoobj1_{i,j}^{noobj}: i๋ฒˆ์จฐ grid cell์˜ j๋ฒˆ์งธ bounding box๊ฐ€ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ• ๋‹น๋ฐ›์ง€ ์•Š์•˜์„ ๋•Œ 1, ์•„๋‹ˆ๋ฉด 0 ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.
  • CiC_i: ๊ฐ์ฒด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์„ ๊ฒฝ์šฐ 1, ๊ทธ๋ ‡์ง€ ์•Š์„ ๊ฒฝ์šฐ 0
  • CihatC_i^{hat}: ์˜ˆ์ธกํ•œ bounding box์˜ confidence score

โœ” ๋งˆ์ง€๋ง‰ ์ค„์€ Classification loss ์ž…๋‹ˆ๋‹ค.

  • pi(c)p_i(c): ์‹ค์ œ class prob.
  • pihat(c)p_i^{hat}(c): ์˜ˆ์ธกํ•œ class prob.

Inference

โœ” Predicting detection ์‚ฌ์šฉํ•ด test image ํ•  ๋•Œ ์˜ค์ง ํ•˜๋‚˜์˜ ๋„คํŠธ์›Œํฌ ํ‰๊ฐ€๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

โœ” ๋ณธ ๋…ผ๋ฌธ์€ PASCAL VOC๋กœ test๋ฅผ ์ง„ํ–‰ํ–ˆ์œผ๋ฉฐ, ์ด๋ฏธ์ง€๋‹น 98๊ฐœ์˜ bonuding box์™€ class prob.๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

โœ” ์˜ˆ์ธก๋œ 98๊ฐœ์˜ ์ •๋ณด๋“ค์„ ์‚ฌ์šฉํ•ด NMS(Non-maximal suppression)์„ ์‚ฌ์šฉํ•ด ์ค‘๋ณต๋˜๋Š” ๋ฌผ์ฒด๋ฅผ ์ œ๊ฑฐํ•˜๊ณ , ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ๋งŒ์„ ์–ป์Šต๋‹ˆ๋‹ค. NMS์˜ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด 2-3% mAP๊ฐ€ ์ฆ๊ฐ€ํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Limitations of YOLO

โœ” YOLO๋Š” ๊ฐ grid cell์€ ์˜ค์ง 2๊ฐœ์˜ box๋งŒ์„ ์˜ˆ์ธกํ•˜๊ณ , ๊ทธ ์ค‘ ํ•˜๋‚˜์˜ class๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ณต๊ฐ„์  ์ œ์•ฝ์€ ์„ธ ๋•Œ๋‚˜ ๊ทธ๋ฃน์œผ๋กœ ๋œ ๋ฌผ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.

โœ” ๋˜ํ•œ small bounding box์™€ large bounding box์—์„œ ๋™์ผํ•˜๊ฒŒ ์—๋Ÿฌ๋ฅผ ์ทจ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค. small bounding box์˜ ๊ฒฝ์šฐ ์•ฝ๊ฐ„์˜ ์›€์ง์ž„์ด IoU์˜ ๋งŽ์€ ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์œ„์˜ ์ด์œ ๊ฐ€ localiztions์˜ ๋ถ€์ •ํ™•ํ•จ์— ์žˆ์–ด ๊ฐ€์žฅ ์˜ํ–ฅ์„ ๋งŽ์ด ์ฃผ๋Š” ์ด์œ ์ž…๋‹ˆ๋‹ค.


Reference

profile
Maybe I will be an AI Engineer?

0๊ฐœ์˜ ๋Œ“๊ธ€