[2021.09.28] 2 Stage Detectors

Seryoungยท2021๋…„ 10์›” 2์ผ
0

Boostcamp AI Tech Level2 P-stage Object Detection

๋ชฉ๋ก ๋ณด๊ธฐ
2/7
post-thumbnail

๐Ÿ’ก R-CNN๋ถ€ํ„ฐ SPPNet, Fast R-CNN, ๊ทธ๋ฆฌ๊ณ  ์ตœ์‹  2 Stage Detector๋“ค์˜ ๊ธฐ์ดˆ๊ฐ€ ๋˜๋Š” Faster R-CNN

Background

์ž…๋ ฅ ์ด๋ฏธ์ง€ -- ๊ณ„์‚ฐ --> Localization -- ๊ณ„์‚ฐ --> Classification
1. ๊ฐ์ฒด ์œ„์น˜
2. ๊ฐ๊ฐ์˜ ๊ฐ์ฒด classification

R-CNN

  1. ์ด๋ฏธ์ง€ ์ž…๋ ฅ
  2. Region proposal (๊ฐ์ฒด๊ฐ€ ์žˆ์„๋ฒ•ํ•œ ํ›„๋ณด ์˜์—ญ) ์ถ”์ถœ
    • Sliding window
    • Selective Search
  3. CNN feature ๊ณ„์‚ฐ
  4. Classify Regions

Pipeline

  1. ์ž…๋ ฅ ์ด๋ฏธ์ง€ ๋ฐ›๊ธฐ
  2. Selective search -> 2000๊ฐœ์˜ RoI ์ถ”์ถœ
  3. RoI -- Warping --> ๋™์ผํ•œ size
    • CNN์˜ ๋งˆ์ง€๋ง‰ FC layer ์ž…๋ ฅ ์‚ฌ์ด์ฆˆ๊ฐ€ ๊ณ ์ •
  4. RoI -- CNN --> Feature ์ถ”์ถœ
    • ๊ฐ region๋งˆ๋‹ค 4096 dim feature vector ์ถ”์ถœ (2000x4096)
    • Pretrained AlexNet ๊ตฌ์กฐ
      • ๋งˆ์ง€๋ง‰์— FC layer ์ถ”๊ฐ€
      • ํ•„์š”์— ๋”ฐ๋ผ Finetuning ์ง„ํ–‰
  5. CNN์—์„œ ์ถ”์ถœํ•œ Feature -- SVM --> ๋ถ„๋ฅ˜
    • Input: 2000x4096 features
    • Output: Class (C+1(๋ฐฐ๊ฒฝ)) + Confidence scores
  6. CNN ํ†ตํ•ด ๋‚˜์˜จ feature -- Regression --> bounding box ์˜ˆ์ธก
    • Selective search ํ†ตํ•ด ๋‚˜์˜จ ํ›„๋ณด ์œ„์น˜๋ฅผ ๋ฏธ์„ธ ์กฐ์ •
    • ์ค‘์‹ฌ์  ์ขŒํ‘œ (x,y), w, h

Training

AlexNet

  • Domain specific finetuning
  • Dataset
    • IoU > 0.5: (+)
    • IoU < 0.5: (-)
    • (+) samples 32 / (-) samples 96

Linear SVM

  • Dataset
    • Ground truth: (+)
    • IoU < 0.3: (-)
    • (+) samples 32 / (-) samples 96
  • Hard negative mining
    • Hard negative: False Positive
    • ๋ฐฐ๊ฒฝ์œผ๋กœ ์‹๋ณ„ํ•˜๊ธฐ ์–ด๋ ค์šด ์ƒ˜ํ”Œ -> ๊ฐ•์ œ๋กœ ๋‹ค์Œ ๋ฐฐ์น˜์˜ negative sample๋กœ minig

Bbox regressor

  • Dataset
    • IoU > 0.6: (+)
    • ์ค‘์‹ฌ์ ์„ ์–ผ๋งˆ๋‚˜ ์ด๋™, width & height ์–ผ๋งˆ๋‚˜ ํ™•๋Œ€/์ถ•์†Œ
  • Loss function: MSE Loss

Shortcomings

  1. 2000๊ฐœ์˜ region ๊ฐ๊ฐ CNN ํ†ต๊ณผ -> ์—ฐ์‚ฐ๋Ÿ‰ ๋งŽ์Œ. ์†๋„ ๋Š๋ฆผ
  2. ๊ฐ•์ œ Warping -> ์ •๋ณด ์†์‹ค -> ์„ฑ๋Šฅ ํ•˜๋ฝ ๊ฐ€๋Šฅ์„ฑ
  3. CNN, SVM classifier, bounding box regressor ๋”ฐ๋กœ ํ•™์Šต
  4. End-to-End X (Selective search)

SPPNet

R-CNN ํ•œ๊ณ„์ 

  • ConvNet ์ž…๋ ฅ ์ด๋ฏธ์ง€ ๊ณ ์ • -> crop/warp
  • RoI(Region of Interest)๋งˆ๋‹ค CNN ํ†ต๊ณผ

Pipeline

  • Image --Conv layers--> Spatial pyramid pooling --FC layers--> Output
    • ํ•œ๋ฒˆ์˜ conv ์—ฐ์‚ฐ์œผ๋กœ ๋‚˜์˜จ feature map์— 2000๊ฐœ์˜ region ๋ฝ‘์•„๋ƒ„
    • warpingํ•˜์ง€ ์•Š๊ณ  spatial pyramid pooling์œผ๋กœ ๊ณ ์ •๋œ ํฌ๊ธฐ๋กœ ๋ณ€ํ™˜

Spatial Pyramid Pooling

  • ๊ฐ™์€ ์‚ฌ์ด์ฆˆ๋กœ ๋‚˜๋ˆ ์„œ ๊ฐ ์˜์—ญ๋งˆ๋‹ค ํ•˜๋‚˜์˜ feature ๋ฝ‘์•„๋ƒ„
    => ๊ฐ™์€ ๊ฐœ์ˆ˜(์˜์—ญ ๊ฐœ์ˆ˜)์˜ feature

Shortcomings

1. 2000๊ฐœ์˜ RoI ๊ฐ๊ฐ CNN ํ†ต๊ณผ - ๋จผ์ € CNN ํ†ต๊ณผํ•ด์„œ RoI ๋ฝ‘์Œ
2. ๊ฐ•์ œ Warping - Spatial pyramid pooling์œผ๋กœ ๊ณ ์ •๋œ ํฌ๊ธฐ feature ๋ฝ‘์Œ
3. CNN, SVM classifier, bounding box regression ๋”ฐ๋กœ ํ•™์Šต
4. End-to-End X

Fast R-CNN

Pipeline

  1. ์ด๋ฏธ์ง€๋ฅผ CNN์— ๋„ฃ์–ด feature ์ถ”์ถœ (CNN ํ•œ ๋ฒˆ ์‚ฌ์šฉ)
    • VGG16
  2. RoI projection --> feature map ์ƒ RoI ๊ณ„์‚ฐ
  3. RoI pooling --> ์ผ์ •ํ•œ ํฌ๊ธฐ์˜ feature
    - Spatial pyramid pooling
    - Pyramid level: 1
    - Target grid size: 7x7
  4. Fully connected layer
  5. Softmax Classifier & Bounding Box Regressor
    • ํด๋ž˜์Šค ๊ฐœ์ˆ˜: C+1(๋ฐฐ๊ฒฝ)๊ฐœ

Training

  • Multi task loss ์‚ฌ์šฉ
    • Classification loss + bounding box regression
  • Loss function
    • Classification: Cross entropy
    • BB regressor: Smooth L1 (๋‹ค๋ฅธ L1, L2๋ณด๋‹ค outlier ๋œ ๋ฏผ๊ฐ)
  • Dataset ๊ตฌ์„ฑ
    • IoU > 0.5: (+)
    • 0.1 < IoU < 0.5: (-)
    • (+) 25% / (-) 75%
  • Hierarchical sampling
    • R-CNN์˜ ๊ฒฝ์šฐ ์ด๋ฏธ์ง€์— ์กด์žฌํ•˜๋Š” RoI ์ „๋ถ€ ์ €์žฅํ•ด ์‚ฌ์šฉ
      • ํ•œ ๋ฐฐ์น˜์— ์„œ๋กœ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€์˜ RoI ํฌํ•จ๋จ
    • Fast R-CNN: ํ•œ ๋ฐฐ์น˜์— ํ•œ ์ด๋ฏธ์ง€์˜ RoI๋งŒ ํฌํ•จ
      • ํ•œ ๋ฐฐ์น˜ ์•ˆ์—์„œ ์—ฐ์‚ฐ, ๋ฉ”๋ชจ๋ฆฌ ๊ณต์œ  ๊ฐ€๋Šฅ

Shortcomings

1. 2000๊ฐœ์˜ RoI ๊ฐ๊ฐ CNN ํ†ต๊ณผ
2. ๊ฐ•์ œ Warping
3. CNN, SVM classifier, bounding box regression ๋”ฐ๋กœ ํ•™์Šต
4. End-to-End X
- Selective search -- CPU --> ํ•™์Šต ๊ฐ€๋Šฅ X

Faster R-CNN

Pipeline

  1. ์ด๋ฏธ์ง€ -- CNN --> feature maps (CNN ํ•œ ๋ฒˆ ์‚ฌ์šฉ)
  2. RPN --> RoI ๊ณ„์‚ฐ
    • ๊ธฐ์กด selective search ๋Œ€์ฒด
    • Anchor box
      • ๊ฐ ์…€๋งˆ๋‹ค N๊ฐœ์˜ Anchor box ์ •์˜ -> ์—ฌ๋Ÿฌ ๊ฐ์ฒด ํฌ๊ธฐ ๋Œ€์‘ ๊ฐ€๋Šฅ

Region Proposal Network (RPN)

  • Input: CNN์—์„œ ๋‚˜์˜จ feature map (H,W,C)
  • 3x3 conv --> intermediate layer ์ƒ์„ฑ
  • 1x1 conv --> binary classification ์ˆ˜ํ–‰
    • 2 (object or not) x 9 (# of anchors) channel
  • 1x1 conv --> bbox regression ์ˆ˜ํ–‰
    • 4 (bbox) x 9 (# of anchors) channel
    • 4: ์ค‘์‹ฌ์  ์ขŒํ‘œ x, y, ๊ฐ€๋กœ, ์„ธ๋กœ ๊ธธ์ด

NMS

  • ์œ ์‚ฌํ•œ RPN Proposals ์ œ๊ฑฐ
  • Class score ๊ธฐ์ค€์œผ๋กœ proposals ๋ถ„๋ฅ˜
  • ๊ฐ bbox์— ๋Œ€ํ•ด ๋‹ค๋ฅธ bbox์™€์˜ IoU ๊ณ„์‚ฐ
  • IoU >= 0.7 proposals ์˜์—ญ๋“ค์€ ์ค‘๋ณต๋œ ์˜์—ญ์œผ๋กœ ํŒ๋‹จ

Training

  • Region Proposal Network (RPN)
    • RPN ๋‹จ๊ณ„์—์„œ classification & regressor ํ•™์Šต ์œ„ํ•ด anchor box (+)/(-) samples ๊ตฌ๋ถ„
    • Dataset
      • IoU > 0.7 or GT ๊ฐ€์žฅ ๋†’์€ IoU: (+)
      • IoU < 0.3: (-)
      • ๋‚˜๋จธ์ง€: ํ•™์Šต๋ฐ์ดํ„ฐ ์‚ฌ์šฉ X
    • Loss ํ•จ์ˆ˜
  • Region proposal ์ดํ›„ Fast RCNN ํ•™์Šต ์œ„ํ•ด (+)/(-) samples ๊ตฌ๋ถ„
  • Dataset
    • IoU > 0.5: (+) -> 32๊ฐœ
    • IoU < 0.5: (-) -> 96๊ฐœ
    • 128 samples๋กœ mini-batch ๊ตฌ์„ฑ
  • Loss ํ•จ์ˆ˜ : Fast RCNN๊ณผ ๋™์ผ
  • RPN & Fast RCNN ํ•™์Šต
    1. Imagenet pretrained backbone load + RPN ํ•™์Šต
    2. Imagenet pretrained backbone load + RPN (1) + Fast RCNN ํ•™์Šต
    3. (2) finetuned backbone load & freeze + RPN ํ•™์Šต
    4. (2) finetuned backbone load & freeze + RPN (3) + Fast RCNN ํ•™์Šต
  • ํ•™์Šต ๊ณผ์ • ๋ณต์žก => Approximate Joint Training ํ™œ์šฉ
    • Loss๋“ค ๋‹ค ๋”ํ•ด์„œ ํ•œ ๋ฒˆ์— backward ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ• ์‚ฌ์šฉ

Summary

R-CNNFaster R-CNNFaster R-CNN
ClassificationSVMsLinearLinear
ResizeWarpRoI PoolingRoI Pooling
End-to-EndXXO

์ถœ์ฒ˜

0๊ฐœ์˜ ๋Œ“๊ธ€