[CV] YOLO9000: Better, Faster, Stronger

๊ฐ•๋™์—ฐยท2022๋…„ 1์›” 22์ผ
0

[Paper review]

๋ชฉ๋ก ๋ณด๊ธฐ
6/17

๐ŸŽˆ ๋ณธ ๋ฆฌ๋ทฐ๋Š” YOLO9000 ๋ฐ ๋ฆฌ๋ทฐ๋ฅผ ์ฐธ๊ณ ํ•ด ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

Key words

๐ŸŽˆ Joint train algorithm
๐ŸŽˆ Anchor Boxes
๐ŸŽˆ Kmeans
๐ŸŽˆ Darknet-19
๐ŸŽˆ Dataset combination with WordTree

Introduction

โœ” Detection Framework๋Š” ์ ์  ๋นจ๋ผ์ง€๋ฉฐ, ์ •ํ™•ํ•ด ์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋Œ€๋ถ€๋ถ„์˜ detection ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ์ž‘์€ ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ ์ง„ํ•ด ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด classification datasets์€ ์ƒ๋Œ€์ ์œผ๋กœ ๋งŽ์€ ๋ฐ์ดํ„ฐ์™€ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” Detection์„ ์œ„ํ•œ ํฐ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป๊ธฐ ์–ด๋ ค์›€์œผ๋กœ , ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์…‹์˜ ๊ฒฐํ•ฉ์œผ๋กœ classification๊ณผ object detection์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. YOLO9000์€ ์‹ค์‹œ๊ฐ„ object detector์ด๋ฉด์„œ 9000๊ฐœ์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์ž…๋‹ˆ๋‹ค.

Better

โœ” ๊ธฐ์กด์˜ YOLO๋Š” Fast R-CNN๊ณผ ์ƒ๋Œ€์ ์œผ๋กœ localization error์™€ low recall ๊ฐ’์ด ๋ณด์—ฌ์ง‘๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ์— ์šฐ๋ฆฌ๋Š” ์œ„์™€ ๊ฐ™์€ ์ทจ์•ฝ์ ์„ ๋ณด์™„ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ YOLOv2๋Š” ์ •ํ™•ํ•˜๊ณ  ๋น ๋ฅธ ๋„คํŠธ์›Œํฌ๋ฅผ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ์— simplify network์™€ make the representation easier to learn ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

Batch Normalization

โœ” Batch Normalization์€ ์ˆ˜๋ ดํ•˜๋Š”๋ฐ ์ค‘์š”ํ•œ ์—ญํ™œ์„ ํ•˜๋ฉฐ ๋™์‹œ์— ๋‹ค๋ฅธ ์ •๊ทœํ™” ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ๋Œ€์‹ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” Batch Normalization์ด๋ž€ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ฏธ๋‹ˆ๋ฐฐ์น˜์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ์ด์šฉํ•ด ์ •๊ทœ์™€ ํ•ฉ๋‹ˆ๋‹ค.๋˜ํ•œ ์‹ ๊ฒฝ๋ง ๋ ˆ์ด์–ด์˜ ์ค‘๊ฐ„ ์ค‘๊ฐ„์— ์œ„์น˜ํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ฉฐ, ๊ฐ๋งˆ(Scale), ๋ฒ ํƒ€(Shifht)๋ฅผ ํ†ตํ•ด ๋น„์„ ํ˜•์„ฑ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” Batch Normalization์„ ํ†ตํ•ด 2%์˜ mAP๋ฅผ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

High Resolution Classifier

โœ” ๋ชจ๋“  SOTA detection ๋ฐฉ๋ฒ•๋“ค์€ pre-train๋œ ImageNet์„ ์‚ฌ์šฉํ•˜๋ฉฐ, 256x256๋ณด๋‹ค ์ž‘์€ input image๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. YOLO์—์„œ๋Š” 224x224๋กœ train์„ ํ•˜๊ณ , detection์‹œ 448 ์‚ฌ์ด์ฆˆ๋กœ detection์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ๋„คํŠธ์›Œํฌ๋Š” ์ƒˆ ์ž…๋ ฅ ํ•ด์ƒ๋„๋กœ ์กฐ์ •ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.

โœ” ๊ทธ๋ž˜์„œ YOLOv2์—์„œ๋Š” 448x448 image๋ฅผ 10 epochs๋กœ fine-tuning์„ ํ–ˆ๊ณ , ๊ทธ ๊ฒฐ๊ณผ 4%์˜ mAP ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Convolutional With Anchor Boxes

โœ” ๊ธฐ์กด YOLO์—์„œ๋Š” bounding box์˜ ์ขŒํ‘œ๋ฅผ fully connected layer๋ฅผ ์‚ฌ์šฉํ•ด ์ง์ ‘์ ์œผ๋กœ ์ฐพ์•˜์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ YOLOv2์—์„œ๋Š” fully connected layer์„ ์ œ๊ฑฐํ•˜๊ณ , anchor boxes๋ฅผ ์‚ฌ์šฉํ•ด bounding box๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

โœ” ๋จผ์ € conv layer์˜ output์ด ๋ณด๋‹ค ๋†’์€ resolution๋ฅผ ๊ฐ€์ง€๋„๋ก pooling layer์„ ์ œ๊ฑฐํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ final feature map ์‚ฌ์ด์ฆˆ๋ฅผ ํ™€์ˆ˜๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด 448x448 input images ์‚ฌ์ด์ฆˆ ๋Œ€์‹  416์˜ ์‚ฌ์ด์ฆˆ๋ฅผ input image๋กœ ๋„ฃ์Šต๋‹ˆ๋‹ค. ํฐ ๊ฐ์ฒด์˜ ๊ฒฝ์šฐ image์˜ ์ค‘์‹ฌ์— ์œ„์น˜ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์€๋ฐ, ์ด๋•Œ feature map ํ™€์ˆ˜์ด๋ฉด feature map ๋‚ด์— ํ•˜๋‚˜์˜ ์ค‘์‹ฌ cell์ด ์กด์žฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 416x416์˜ ํฌ๊ธฐ์—์„œ 32๋ฐฐ์˜ downsampling์„ ํ†ตํ•ด ์ตœ์ข…์ ์œผ๋กœ 13x13 feature map๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” Anchor boxes๋ฅผ ์‚ฌ์šฉํ•ด ์•ฝ๊ฐ„ ๋‚ฎ์€ ์ •ํ™•๋„๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋Š”๋ฐ, Anchor boxes๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ๋•Œ์—๋Š” 69.5mAP์™€ 81% recall์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๊ณ , Anchor boxes๋ฅผ ์‚ฌ์šฉํ•  ์‹œ 69.2mAP์™€ 88% recall์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ •ํ™•๋„๋Š” ์กฐ๊ธˆ ๋‚ด๋ ค๊ฐ”์ง€๋งŒ, 7% ๋†’์€ recall์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ YOLO๋Š” image๋‹น 98 boxes๋งŒ์„ ์˜ˆ์ธกํ–ˆ์ง€๋งŒ, anchor boxes๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ˆ˜์ฒœ๊ฐœ์˜ boxes์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋น„๋ก ๋‚ฎ์€ mAP๋ฅผ ๊ธฐ๋กํ–ˆ์ง€๋งŒ recall์˜ ํ–ฅ์ƒ์€ ๋ชจ๋ธ์ด ๊ฐœ์„ ํ•  ์—ฌ์ง€๊ฐ€ ๋” ๋งŽ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Dimension Clusters

โœ” Anchor boxes๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ 2๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋จผ์ € box dimensions are hand picked ๋œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋„คํŠธ์›Œํฌ๊ฐ€ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋” ๋‚˜์€ prior๋ฅผ ์„ ํƒํ•œ๋‹ค๋งˆ๋…€ ๋„คํŠธ์›Œํฌ๊ฐ€ ์ข‹์€ predict good detection์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์„๊ฒ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ k-means ๊ตฐ์ง‘ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด bounding box๋ฅผ setํ•ฉ๋‹ˆ๋‹ค.

โœ” ๊ธฐ์กด์˜ k-means์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋” ํฐ ๋ฐ•์Šค๋“ค์€ ์ž‘์€ ๋ฐ•์Šค๋“ค์— ๋น„ํ•ด ํฐ ์—๋Ÿฌ๋ฅผ ๋ฐœ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

โœ” ์œ„์˜ ๊ทธ๋ž˜ํ”„์—์„œ K=5 ์ผ๋•Œ ๋ชจ๋ธ ๋ณต์žก์„ฑ๊ณผ ๋†’์€ recall์— ๋Œ€ํ•œ ์ข‹์€ trade-off ๊ด€๊ณ„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.

โœ” ์œ„์˜ ํ‘œ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, k=5์ธ k-means๋ฅผ ์‚ฌ์šฉํ•œ ๋ฐฉ๋ฒ•์ด 9๊ฐœ์˜ Anchor Boxes๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒƒ๋ณด๋‹ค ๋†’์€ Avg IOU๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Direct loactaion prediction

โœ” YOLO์—์„œ anchor boxes๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ๋ฐœ์ƒํ•˜๋Š” 2๋ฒˆ์งธ ์ด์Šˆ๋Š” model instability์ž…๋‹ˆ๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ instability (x,y) ์ขŒํ‘œ๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ ๋ฐœ์ƒํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. RPN์—์„œ ์˜ˆ์ธกํ•˜๋Š” tx, ty ๊ทธ๋ฆฌ๊ณ  (x,y) ์ขŒํ‘œ๋Š” ์•„๋ž˜์˜ ์‹๊ณผ ๊ฐ™์ด ์˜ˆ์ธก์ด ๋ฉ๋‹ˆ๋‹ค.

โœ” ์˜ˆ๋ฅผ ๋“ค๋ฉด tx = 1์ด๋ฉด ์˜ค๋ฅธ์ชฝ์œผ๋กœ box๊ฐ€ ์ด๋™ํ•˜๊ณ , tx = -1์ด๋ฉด ์™ผ์ชฝ์œผ๋กœ ๋ฐ•์Šค๊ฐ€ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค. ์œ„์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์€ ์ œ์•ฝ์ด ์—†๊ธฐ์— anchor box๋Š” ์ด๋ฏธ์ง€ ๋‚ด์˜ ์–ด๋–ค ์ง€์ ์—๋„ ์œ„์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ์ตœ์ ์˜ ๊ฐ’์„ ์ฐพ๋Š”๋ฐ ์˜ค๋ž˜๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” ์œ„์˜ ๋ฌธ์ œ ํ•ด๊ฒฐ๋ฐฉ์•ˆ์œผ๋กœ YOLO์˜ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ grid cell์— ์ƒ๋Œ€์ ์ธ ์œ„์น˜ ์ขŒํ‘œ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค. CxC_x, CyC_y์€ grid cell์˜ ์ขŒ์ƒ๋‹จ์˜ ๊ธธ์ด์ž…๋‹ˆ๋‹ค(์œ„์˜ ์‚ฌ์ง„๊ณผ ๊ฐ™์ด). bounding box regression์„ ํ†ตํ•ด ์–ป์€ txt_x, tyt_y ๊ฐ’์— logisic regression ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ 0~1 ์‚ฌ์˜ ๊ฐ’์„ ๊ฐ€์ง€๋„๋ก ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

โœ” dimension cluster์™€ anchor box ์ขŒํ‘œ๋ฅผ ์ง์ ‘ ์˜ˆ์ธกํ•จ์œผ๋กœ์„œ 5%์˜ recall ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Fine-Grained Features

โœ” YOLO v2๋Š” ์ตœ์ข…์ ์œผ๋กœ 13x13 feature map๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ํฐ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์ˆ˜์›”ํ•œ ๋ฐ˜๋ฉด ์ž‘์€ ๋ฌผ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์–ด๋ ค์›€์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์•ž์— ์žˆ๋Š” 26x26 resolution layer์— passthoriugh layer๋ฅผ ํ†ตํ•ด ๊ฐ€์ง€๊ณ  ์˜ต๋‹ˆ๋‹ค. ์œ„์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์€ 26x26x512 feature map๋ฅผ 4๊ฐœ๋กœ ๋‚˜๋ˆ  concatenate ํ•ด์ค๋‹ˆ๋‹ค. ์ดํ›„ ์ตœ์ข…์ ์œผ๋กœ ์›๋ž˜์˜ 13x13 feature map๊ณผ๋„ concatenate ํ•ฉ๋‹ˆ๋‹ค. ์œ„์™ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ fine-grain feature์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ , ๊ฒฐ๊ณผ์ ์œผ๋กœ 1%์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Multi-Scale Training

โœ” YOLOv2๋Š” ๊ธฐ์กด์˜ ๋ชจ๋ธ๊ณผ ๋‹ค๋ฅด๊ฒŒ ์˜ค์ง conv layer์™€ pooling layer๋งŒ์„ ์‚ฌ์šฉํ•˜๊ธฐ์—, ๋‹ค์–‘ํ•œ ์‚ฌ์ด์ฆˆ์˜ input image๋ฅผ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” YOVL v2๊ฐ€ ๋‹ค์–‘ํ•œ images๋“ค์— robustํ•˜๊ธฐ ์›ํ•ฉ๋‹ˆ๋‹ค.

โœ” ๊ทธ๋ž˜์„œ image ์‚ฌ์ด์ฆˆ๋ฅผ ๊ณ ์ •ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, 10 batches ๋งˆ๋‹ค ๋žœ๋คํ•˜๊ฒŒ ์ƒˆ๋กœ์šด image ์‚ฌ์ด์ฆˆ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. YOLOv2๋Š” 32๋ฐฐ์˜ downsampling์„ ์ง„ํ–‰ํ•˜๊ธฐ์— {320,352,..608} 32์˜ ๋ฐฐ์ˆ˜ ํฌ๊ธฐ ๋งŒํผ์—์„œ ๋žœ๋คํ•˜๊ฒŒ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ์œ„์˜ ํ‘œ์™€ ๊ฐ™์€ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Faster

โœ” YOLOv2์—์„œ๋Š” ๋น ๋ฅธ ์†๋„์™€ ์ •ํ™•๋„๋ฅผ ์œ„ํ•ด์„œ Googlenet ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ปค์Šคํ…€ํ™” ํ–ˆ์œผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ VGG-16๋ณด๋‹ค ๋น ๋ฅด๊ณ  ๋œ ๋ณต์žกํ•œ ๋ชจ๋ธ Darknet-19๋ฅผ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

Darknet-19

โœ” Darknet-19์€ 9๊ฐœ์˜ Conv layer์™€ 5๊ฐœ์˜ maxpooling layer์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ์œ„์˜ ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฅด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์—๋Š” GAP๋ฅผ FCN ๋Œ€์‹  ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Darknet-19์€ ImageNet์—์„œ 72.9% top-1 accuracy, 91.2% top5 accuracy๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Training for classification

โœ” Standard ImageNet 1000 class classification dataset์„ SGD๋ฅผ ์‚ฌ์šฉํ•ด 160 epochs๋งŒํผ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. Training์—๋Š” random crops, rotations, and hue, saturation and exposure shifts๋ฅผ ์‚ฌ์šฉํ•ด data augementation์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ 448x448 ์ด๋ฏธ์ง€ ์‚ฌ์ด์ฆˆ๋ฅผ 10 epochs๋งŒํผ fine-tuning์„ ์ง„ํ–‰ํ–ˆ์œผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ top-1 ์ •ํ™•๋„๋Š” 76.5%, top-5 ์ •ํ™•๋„๋Š” 93.3%์˜ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

Training for detection

โœ” Detection์—์„œ๋Š” Darknet-19์˜ ๋งˆ์ง€๋ง‰ Conv layer๋ฅผ ์ œ๊ฑฐํ•˜๊ณ , 3x3x1024 conv layer๋กœ ๋Œ€์ฒดํ•˜๊ณ , 1x1 conv layer๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ VOC์—์„œ 1x1 conv layer์˜ channel ์ˆ˜๋Š” 125๋กœ, ์ด๋Š” ๊ฐ cell ๋งˆ๋‹ค 5๊ฐœ์˜ bounding-box์™€ 5๊ฐœ์˜ ์ขŒํ‘œ ๊ทธ๋ฆฌ๊ณ  20๊ฐœ์˜ class๋กœ ์ด 125 channel ์ˆ˜๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  passthrough layer ๋”ํ•จ์œผ๋กœ์จ fine grain features๋“ค์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Stronger

โœ” ๋…ผ๋ฌธ์—์„œ๋Š” detection๊ณผ classification datasets์œผ๋กœ ๊ฒฐํ•ฉ๋œ datasets์˜ ํ•™์Šต์„ ํ†ตํ•ด ๋” ๋งŽ์€ class๋ฅผ ์˜ˆ์ธกํ•˜๋Š” YOLO9000์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. Classification image์—์„œ๋Š” YOLO9000๋Š” ๊ตฌ์กฐ์˜ ๋ถ„๋ฅ˜ ํŠน์ • ๋ถ€๋ถ„์—์„œ์˜ loss๋งŒ ์—ญ์ „ํŒŒํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ Detection datasets์—์„œ๋Š” ์ผ๋ฐ˜์ ์ธ ๋ผ๋ฒจ๋ฟ์ด ์—†์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด "dog"์™€ "boat"์™€ ๊ฐ™์€ ๋ผ๋ฒจ๋ฟ์ด ์—†๋‹ค๋ฉด classification datasets์—์„œ๋Š” "Norfolk terrier"...๋“ฑ๋“ฑ๊ณผ ๊ฐ™์ด ๋” ๊ตฌ์ฒด์ ์ธ ๋ผ๋ฒจ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

โœ” ์œ„์™€ ๊ฐ™์€ ๋ฌธ์ œํ•ด๊ฒฐ์„ ์œ„ํ•ด YOLO9000์—์„œ๋Š” ๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒฐํ•ฉํ•œ ์ƒํ˜ธ ๋ฐฐํƒ€์ ์ด์ง€ ์•Š์€ multi-label model์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Hierarchical classification

โœ” ImageNet์€ WordNet ์ด๋ผ๋Š” ์–ธ์–ด ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งŒ๋“ค์–ด์กŒ์Šต๋‹ˆ๋‹ค. WordNet์€ tree ํ˜•ํƒœ๊ฐ€ ์•„๋‹Œ directed graph ํ˜•์‹์œผ๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ์Šต๋‹ˆ๋‹ค. ์™œ๋ƒํ•˜๋ฉด ์–ธ์–ด๋Š” ๋ณต์žกํ•˜๊ธฐ์— ๋‹จ์ˆœํžˆ ํŠธ๋ฆฌ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด "dog"๋Š” "canine"์—๋„ ์†ํ•˜๊ณ , "domestic animal"์—๋„ ์†ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. YOLO9000๋Š” ์ด๋ฅผ ๋‹จ์ˆœํ™”ํ•ด์„œ WordNet๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์ธต์ ์ธ tree๋ฅผ ๊ตฌ์กฐํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

โœ” WordTree๋ฅผ ์‚ฌ์šฉํ•ด classification์„ ์ˆ˜ํ–‰ํ•  ๋•Œ๋Š” ์œ„์™€ ๊ฐ™์ด "terrier"๋ผ๊ณ  ์˜ˆ๋ฅผ ๋“ค๋ฉด ๊ฐ๊ฐ์˜ "terrier"์— ๋Œ€ํ•œ ์กฐ๊ฑด๋ถ€ ๋ถ„ํฌ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

โœ” ๊ทธ๋ฆฌ๊ณ  "Norfolk terrier"์— ๋Œ€ํ•ด ๊ณ„์‚ฐํ•˜๋ฉด ์œ„์™€ ๊ฐ™์ด ๋ชจ๋“  ์กฐ๊ฑด๋ถ€ ๋ถ„ํฌ๋ฅผ ๊ณฑํ•ฉ๋‹ˆ๋‹ค.

โœ” ImageNet ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ด WordTree๋ฅผ ๊ตฌ์„ฑํ•  ๊ฒฝ์šฐ, ์ตœ์ƒ์œ„ ๋…ธ๋“œ๋ถ€ํ„ฐ ์ตœํ•˜์œ„ ๋…ธ๋“œ๊นŒ์ง€ ์ด ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ํ•ฉ์น˜๋ฉด 1369๊ฐœ์˜ ๋ฒ”์ฃผ๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. 369๊ฐœ์˜ ์นดํ…Œ๊ณ ๋ฆฌ๊ฐ€ ๋Š˜์–ด๋‚ฌ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  71.9%์˜ top-1 accuracy์™€ 90.4%์˜ top-5 accuracy๋ผ๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Join classification and detection

โœ” COCO datasets๊ณผ ImageNet์„ ๊ฒฐํ•ฉํ•œ 9418๊ฐœ์˜ ๋ฒ”์ฃผ๋ฅผ ๊ฐ€์ง€๋Š” WordTree๋ฅผ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ImageNet์ด COCO datasets๋ณด๋‹ค ํ›จ์”ฌ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฏ€๋กœ, COCO dataset๋ฅผ oversamplingํ•ด 4:1 ๋น„์œจ๋กœ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

โœ” ์œ„์˜ datasets์„ ๊ฐ€์ง€๊ณ  YOLO9000์„ trainํ•ฉ๋‹ˆ๋‹ค. YOLO v2์˜ ์•„ํ‚คํ…์ณ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ 5๊ฐœ๊ฐ€ ์•„๋‹Œ 3๊ฐœ์˜ anchor boxes๋ฅผ ์‚ฌ์šฉํ•ด ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. detection dataset์˜ image๋Š” ์ผ๋ฐ˜์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ detection loss๋ฅผ backward passํ•˜๊ณ , classification loss์˜ ๊ฒฝ์šฐ์—๋Š” ํŠน์ • ๋ฒ”์ฃผ์—์„œ ์ƒ์œ„ ๋ฒ”์ฃผ์— ๋Œ€ํ•ด์„œ๋งŒ loss๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. classification datasets image์˜ ๊ฒฝ์šฐ์—๋Š” classification loss์— ๋Œ€ํ•ด์„œ๋งŒ backward pass๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ์ด๋•Œ GT box์™€์˜ IoU 0.3 ์ด์ƒ์€ ๊ฒฝ์šฐ๋งŒ ์—ญ์ „ํŒŒ๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

Conclusion

โœ” ๋…ผ๋ฌธ์—์„œ๋Š” YOLO v2์™€ YOLO 9000, ์‹ค์‹œ๊ฐ„ detection ๊ตฌ์กฐ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. YOLO v2๋Š” SOTA์ด๋ฉฐ ๋‹ค๋ฅธ detection ๋„คํŠธ์›Œํฌ๋ณด๋‹ค ๋น ๋ฅธ ์†๋„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๋‹ค์–‘ํ•œ detection datasets์— ๋Œ€ํ•ด์„œ๋„ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. YOLO 9000์€ 9000๊ฐœ์˜ ๊ฐ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ์— ๋Œ€ํ•ด detection๊ณผ classification์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ณ„์ธต์  classifciation์„ ํ†ตํ•œ dataset์˜ ๊ฒฐํ•ฉ์€ classification๊ณผ segmentation domain์— ์œ ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.


Reference

profile
Maybe I will be an AI Engineer?

0๊ฐœ์˜ ๋Œ“๊ธ€