[CV] YOLOv3: An Incremental Improvement review

๊ฐ•๋™์—ฐยท2022๋…„ 2์›” 3์ผ
0

[Paper review]

๋ชฉ๋ก ๋ณด๊ธฐ
9/17

๐ŸŽˆ ๋ณธ ๋ฆฌ๋ทฐ๋Š” YOLOv3 ๋ฐ ๋ฆฌ๋ทฐ๋ฅผ ์ฐธ๊ณ ํ•ด ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

Key Words

๐ŸŽˆ Multilabel Classification(do not use softmax)
๐ŸŽˆ Darknet-53(skip connections and upsampling)
๐ŸŽˆ More bounding boxes

Introduction

โœ” "nothing like super interesting, just a bunch of small changes that make it better". ์ฆ‰ ๊ธฐ์กด์˜ YOLO์—์„œ ๋งŽ์€ ๋ณ€ํ™”๊ฐ€ ์•„๋‹Œ ์•ฝ๊ฐ„์˜ ํ–ฅ์ƒ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

โœ” ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” YOLOv3์— ๋Œ€ํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

1. Weโ€™ll tell you what the deal is with YOLOv3.
2. Tell you about some things we tried that didnโ€™t work.
3. Weโ€™ll contemplate what this all means.

The Deal

โœ” YOLOv3๋Š” ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ์™€ ๋‹ค๋ฅธ ์ข‹์€ ์•„์ด๋””์–ด๋ฅผ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

Bounding Box Prediction

โœ” YOLO9000(v2)์˜ ๊ทธ๋Œ€๋กœ bounding box prediction์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

โœ” YOLOv3๋Š” 4๊ฐœ์˜ ์ขŒํ‘œ(txt_x, tyt_y, twt_w, tht_h,)๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. (์—ญ์‹œ๋‚˜ YOLO9000์ด๋ž‘ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.) ๋˜ํ•œ sum of squared error loss๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

โœ” ๊ฐ๊ฐ์˜ ๋กœ์ง€์Šคํ‹ฑ์„ ์‚ฌ์šฉํ•œ bounding box์— ๋Œ€ํ•œ objectness score์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ GT๋ž‘ ๊ฐ€์žฅ ์˜ค๋ฒ„๋žฉ์ด ๋งŽ์ด ๋˜๋Š” bounding box์˜ confidence๋Š” 1์ด ๋˜์–ด์•ผํ•ฉ๋‹ˆ๋‹ค.

โœ” YOLOv3๋Š” ๊ฐ๊ฐ์˜ GT ๊ฐœ์ฒด์— ๋Œ€ํ•ด ํ•˜๋‚˜์˜ bounding box๋งŒ ํ• ๋‹น๋ฐ›์Šต๋‹ˆ๋‹ค.

Class Prediction

โœ” YOLOv3๋Š” ๊ธฐ์กด์˜ YOLO์™€๋Š” ๋‹ค๋ฅด๊ฒŒ multilabel classification์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. softmax๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , ๊ฐ๊ฐ ๋…๋ฆฝ์ ์œผ๋กœ logisiticํ•œ classifier๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ํ•™์Šต ์ค‘์—๋Š” binary cross-entropy loss๋ฅผ class prediction์— ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

โœ” Softmax๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ์ด์œ ๋กœ๋Š”, ๋งŒ์•ฝ ๊ณ„์ธต์  ๋ฐ์ดํ„ฐ ์…‹์„ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด, ์ค‘๋ณต์˜ ๋‹ต์„ ์˜ˆ์ธกํ•ด์•ผํ•  ํ•„์š”๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Predictions Across Scales

โœ” YOLOv3๋Š” 3๊ฐ€์ง€์˜ ๋‹ค๋ฅธ scale boxes๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ 3๊ฐ€์ง€ scale์— ๋Œ€ํ•ด์„œ 3๊ฐ€์ง€์˜ anchor box๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.(์ด 9๊ฐœ์˜ Anchor box)

โœ” ๊ฐ๊ฐ์˜ Anchor boxes๋Š” YOLO9000์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ k-means clustering์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. COCO ๋ฐ์ดํ„ฐ ์…‹์—์„œ (10 x 13), (16 x 30), (33 x 23), (30 x 61), (62 x 45), (59 x 119), (116 x 90), (156 x 198), (373 x 326) 9๊ฐœ์˜ anchor box๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ๊ฐ๊ฐ์˜ scale์— ๋Œ€ํ•ด 3๊ฐœ์˜ anchor box๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ COCO ๋ฐ์ดํ„ฐ ์…‹์—์„œ N x N x [3 * (4 + 1 + 80]๋ฅผ ์˜ˆ์ธกํ•˜๋ฉฐ, 4์€ bbox์˜ offsets, 1์€ objectness prediction ๊ทธ๋ฆฌ๊ณ  80 class predictions์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

โœ” YOLOv3๋Š” ์ด์ „ ๋ฒ„์ „์— ๋น„ํ•ด์„œ 100๋ฐฐ ์ด์ƒ์˜ bounding boxes ์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Œ YOLOv1 98 boxes(7x7 grid cells, 2boxes per cell @448x448)
๐Ÿ“Œ YOLOv2 845 boxes(13x13 grid cells, 5 anchor boxes)
๐Ÿ“Œ YOLOv3 10,647 boxes(@416x416)

Feature Extractor

โœ” YOLOv3๋Š” ๊ธฐ์กด์˜ Darknet-19์™€๋Š” ๋‹ค๋ฅด ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ์ธ Darknet-53์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Residual์ด๋ผ๋Š” Shortcut connections์„ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ด์ „์˜ ๋ฒ„์ „๋ณด๋‹ค๋Š” ๋„คํŠธ์›Œํฌ๊ฐ€ ์ปค์กŒ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ResNet-101 or ResNet-152๋ณด๋‹ค๋„ ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค.

โœ” ์œ„์˜ ํ‘œ๋Š” ImageNet์˜ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ํ‘œ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ResNet-152์™€ ๋งค์šฐ ๋น„์Šทํ•˜๋ฉฐ, ์ฃผ๋ชฉํ•ด์•ผํ•  ๊ฒƒ์€ BFLOP/s(์ดˆ๋‹น ์—ฐ์‚ฐ๋Ÿ‰)์ด ์••๋„์ ์œผ๋กœ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ Darknet-19๋ณด๋‹ค ๋‚ฎ์€ FPS๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋„คํŠธ์›Œํฌ๊ฐ€ ์ปค์ง€๋ฉด์„œ ์ผ์–ด๋‚˜๋Š” ํ˜„์ƒ์ด๋ผ๊ณ  ์ƒ๊ฐ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, GPU๋ฅผ ๋”์šฑ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

YOLOv3 Architecture

โœ” YOLOv3๋Š” multi-scale ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ด๋Š” FPN๊ณผ ๋น„์Šทํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด 3๊ฐœ์˜ scale์„ output์œผ๋กœ ๊ฐ€์ง€๋ฉฐ, FPN๊ณผ ๋น„์Šทํ•˜๊ฒŒ upsampling๊ณผ concatnation์„ ์‚ฌ์šฉํ•ด 3๊ฐœ์˜ output์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค. (32๋ฐฐ, 16๋ฐฐ, 8๋ฐฐ)

Training

โœ” Full images์„ ์‚ฌ์šฉํ•ด trainํ–ˆ์œผ๋ฉฐ, no hard negative mining ๋“ฑ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. YOLOv3๋Š” multi-scale training, lots of data augmentation, batch normlaization ๋“ฑ์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

How We Do

โœ” ๊ฒฐ๊ณผ์ ์œผ๋กœ YOLOv3๋Š” AP50AP_{50}์—์„œ๋Š” RetinaNet๊ณผ ๋น„์Šทํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์ง€๋งŒ, ๋‚˜๋จธ์ง€ ์ง€ํ‘œ์—์„œ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด์ „์˜ detector๋ณด๋‹ค๋Š” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๊ณ , ํŠนํžˆ YOLOv2 APsAP_{s}์—์„œ ์ƒ๋Œ€์ ์œผ๋กœ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” COCO์—์„œ ์‚ฌ์šฉํ•˜๋Š” ํ‰๊ฐ€์ง€ํ‘œ๋Š” IoU๋ฅผ 0.5 ~ 0.95๊นŒ์ง€ 0.5์”ฉ threshold๋ฅผ ๋†’์—ฌ๊ฐ€๋ฉฐ APAP๋ฅผ ๊ตฌํ•˜๊ณ , ๊ฐ๊ฐ์˜ mAP๋ฅผ ๊ตฌํ•˜๊ณ  ๋˜ ๋‹ค์‹œ ์ „์ฒด์˜ mAP๋ฅผ ๊ตฌํ•œ ๊ฒฐ๊ณผ ๊ฐ’์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.(์ €์ž๋Š” COCO์˜ ํ‰๊ฐ€์ง€ํ‘œ์— ๋Œ€ํ•ด ๋ถˆ๋งŒ์ด ์žˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.) PASCAL voc์—์„œ ์‚ฌ์šฉํ•œ AP50AP_{50}(IoU=0.5)์—์„œ๋Š” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Things We Tried That Didn't Work

Anchor box x,yx,y offset predictions

โœ” ๋‹ค๋ฅธ linear activation ์‚ฌ์šฉํ•œ ์˜ˆ์ธก ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด๋ดค์ง€๋งŒ, ์˜คํžˆ๋ ค ๋ชจ๋ธ์˜ ์•ˆ์ •์„ฑ์„ ๋‚ฎ์ท„์Šต๋‹ˆ๋‹ค.

Linear x,yx,y predictions insteand of logistic

โœ” logisitic activation์ด ์•„๋‹Œ linear activation์œผ๋กœ ์ง์ ‘์ ์œผ๋กœ x,yx,y offset์„ ์˜ˆ์ธกํ• ๋ ค๊ณ  ํ–ˆ์ง€๋งŒ, ์ข‹์€ ์„ฑ๋Šฅ์ด ๋ณด์ด์ง€ ๋ชปํ–ˆ๋‹ค.

Focal Loss

โœ” YOLOv3๋Š” ์ด๋ฏธ obj prediction๊ณผ conditional class predictions์„ ํ†ตํ•ด Focal loss์˜ ๋ฌธ์ œ์— ๋Œ€ํ•ด robustํ•˜๋‹ค.

Dual IOU thresholds and truth assignment

โœ” ๊ธฐ์กด์˜ Fast R-CNN์€ .7๋ณด๋‹ค ๋†’์œผ๋ฉด positive, .3 ~ .7 ์ด๋ฉด ๋ฌด์‹œํ•˜๊ณ , .3์ดํ•˜์ด๋ฉด negative๋กœ ํŒ๋‹จํ–ˆ๋‹ค. ๋น„์Šทํ•œ ์ „๋žต์„ ์‚ฌ์šฉํ•ด ๋ดค์ง€๋งŒ ์ข‹์ง€ ์•Š๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋„์ถœ๋ฌ๋‹ค.

What This All Means

โœ” YOLOv3๋Š” ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•œ detecor์ด๋‹ค. ํ•˜์ง€๋งŒ COCO ๋ฐ์ดํ„ฐ ํ‰๊ฐ€์ง€ํ‘œ(.5 ~ .95 IoU)์—์„œ๋Š” ์ข‹์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค์ง€ ์•Š์•˜๋‹ค. ๋ฐ˜๋Œ€๋กœ old detection metric(.5 IOU)์—์„œ๋Š” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

โœ” ์‚ฌ๋žŒ๋“ค์€ IOU์˜ 0.3๊ณผ 0.5์ธ ๊ฐœ์ฒด๋“ค์„ ๊ตฌ๋ถ„ํ•˜๋Š”๋ฐ ์–ด๋ ค์›€์„ ๊ฒช๋Š”๋‹ค. ๊ฒฐ๊ตญ, .5 ~ .95 IoU์™€ ๊ฐ™์ด ๋นก๋นกํ•˜๊ฒŒ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ์˜๋ฏธ๊ฐ€ ์žˆ๋‚˜? ๋ผ๋Š” ์˜๋ฌธ์„ ์ œ๊ธฐํ•ฉ๋‹ˆ๋‹ค.

Rebuttal

๐Ÿ‘จโ€๐Ÿซ ๋ณธ ๋…ผ๋ฌธ์˜ ์ €์ž๋Š” Rebuttal์ด๋ผ๋Š” ์ถ”๊ฐ€์ ์ธ ๊ธ€๋กœ ํ†ตํ•ด ์˜๊ฒฌ์— ๋Œ€ํ•œ ๋ฐ˜๋ฐ•์„ ์ถ”๊ฐ€์ ์œผ๋กœ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

โœ” ๋จผ์ € ๊ธฐ์กด ๋…ผ๋ฌธ์˜ ๊ทธ๋ž˜ํ”„์˜ ๊ธฐ์ค€์ด 0์ด ์•„๋‹ˆ๋‹ค ๋ผ๋Š” ์˜๊ฒฌ์— ์œ„์™€ ๊ฐ™์€ ๊ทธ๋ž˜ํ”„๋กœ YOLOv3๋Š” ์ •ํ™•ํ•˜๋ฉฐ ์†๋„ ๋˜ํ•œ ๋น ๋ฅด๋‹ค ๋ผ๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” ๋˜ํ•œ COCO ํ‰๊ฐ€์ง€ํ‘œ๋ฅผ ๋น„ํŒํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•œ ๊ทผ๊ฑฐ๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค๋Š” ์˜๊ฒฌ์— ์œ„์™€ ๊ฐ™์€ ์ž๋ฃŒ์™€ ํ•จ๊ป˜ ๋‹ต๋ณ€ํ•ฉ๋‹ˆ๋‹ค.

โœ” COCO ํ‰๊ฐ€์ง€ํ‘œ์— ๋Œ€ํ•ด classification๋ณด๋‹ค bbox์— ๋” ์ค‘์š”๋„๋ฅผ ๋‘๊ณ  ์žˆ๋Š” ์ง€ํ‘œ๋ผ๊ณ  ์ด์•ผ๊ธฐํ•˜๋ฉฐ, ์œ„์˜ ๊ทธ๋ฆผ์ด ๊ณผ์žฅ๋˜์–ด ์žˆ๊ธด ํ•˜์ง€๋งŒ mAP๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด ๋ชจ๋‘ ๋‹ค ์ข‹์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค๊ณ  ์ด์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ํ˜„์žฌ ํ‰๊ฐ€์ง€ํ‘œ๊ฐ€ ์‹ค์ œ์™€์˜ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค๊ณ  ์ด์•ผ๊ธฐํ•˜๋ฉฐ, ์ƒˆ๋กœ์šด ํ‰๊ฐ€์ง€ํ‘œ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๊ณ  ์ด์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค.


Reference

profile
Maybe I will be an AI Engineer?

0๊ฐœ์˜ ๋Œ“๊ธ€