(논문 리뷰)Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5) (2014)

Jonghyeon Sim·2024년 5월 22일

딥러닝 논문 리뷰

목록 보기

5/6

논문 : https://arxiv.org/pdf/1311.2524

Abstract

지난 수년 동안 canonical PASCAL VOC dataset로 측정된 객체 탐지 성능은 정체되었음.
가장 성능이 좋은 방법은 전형적으로 다수의 낮은 레벨의 이미지 특성들과 높은 레벨의 context와 함께 전형적으로 결합하는 복잡한 앙상블 시스템임.
이 논문에서는, mean average precision(mAP)가 향상된 단순하고 확장 가능한 탐지 알고리즘을 제안함.
접근 방법
(1) CNNs을 localize(현지화)와 segment objects을 위해 bottom-up region proposals에 적용함.
(2) 라벨된 training data가 부족할 때, 도메인에 특화된 미세조정(fine-tuning)에 의해 따라오는 보조 작업을 위해 사전-학습을 지도는 상당한 성능 향상을 가져옴.
CNNs와 지원 제안을 결합하므로, 이 방법을 R-CNN이라 부름 (Regions with CNN features).
OverFeat(CNN과 비슷한 구조 기반으로 sliding-window detector를 제안한 모들)과 비교함.
R-CNN은 200-class ILSVRC2013 detection에서 큰 차이로 OverFeat의 성능이 좋음을 찾음.

Figures and Tables

Introduction

CNN 분류 결과(ImageNet을 활용)가 PASCAL VOC Challenge의 객체 탐지(object detection)결과에 얼마나 일반화할 수 있는지에 대한 의문임.
이미지 분류와 객체 탐지 사이의 차이를 연결하여 문제를 대답함.
1. 깊은 신경망으로 객체를 지역화함.
1. 주석화된 작은 양의 탐지 테이터를 high-capacity model로 훈련시킴.
두 번째 도전과제는 큰 CNN을 학습 시킬 충분히 이용할 라벨된 데이터의 부족임. 이 문제의 편리한 해결은 unsupervised pre-training을 사용하는 것임.
큰 보조 데이터(ILSVRC)의 사전 학습을 작은 데이터셋 (PASCAL)에 fine-tuning을 함.
fine-tuning 이후, VOC 2010에서 mAP 54%로 HOG 기반 deformable part model(DPM)의 33% mAP를 비교하면 성능이 향상됨

Conclusions

PASCAL VOC 2012의 이전 결과 보다 상대적으로 30%나 향상시킴.
Insights:
1. high-capacity convolutional neural networks를 지역화와 segment objects를 위해 bottom-up region proposals에 적용함.
2. 라벨된 학습 데이터가 부족할 때 학습시킨 CNNs를 활용하는 것임 (다양한 부족한 학습 데이터를 가진 컴퓨터 비전 문제에 효과적일 것으로 추정함).
컴퓨터 비전의 고전적인 도구와 딥러닝(bottom-up region proposals과 CNN)을 결합하여 사용하여 결과를 얻음.

2. Object detection with R-CNN

3개의 모듈로 구성됨
1. category-independent region proposals.
(These proposals define the set of candidate detections available to our detector.)
2. a large CNN (각 region에서 a fixed-length feature vector를 추출)
3. 고전적 특별한 linear SVMs의 집합(set).

2.1. Module design

Region proposals.
Feature extraction.
We extract a 4096-dimensional fea- ture vector from each region proposal using the Caffe [24] implementation of the CNN described by Krizhevsky et al. [25]. Features are computed by forward propagating a mean-subtracted 227 × 227 RGB image through five con- volutional layers and two fully connected layers. We refer readers to [24, 25] for more network architecture details.
Regardless of the size or aspect ratio of the candidate region, we warp all pixels in a tight bounding box around it to the required size.
Prior to warping, we dilate the tight bounding box so that at the warped size there are ex- actly p pixels of warped image context around the original box (we use p = 16). Figure 2 shows a random sampling of warped training regions.

2.3. Training
Supervised pre-training.
We discriminatively pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 clas- sification) using image-level annotations only (bounding- box labels are not available for this data).

Domain-specific fine-tuning.
we continue stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals.
For VOC, N = 20 and for ILSVRC2013, N = 200. We treat all region proposals with ≥ 0.5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives.
We start SGD at a learning rate of 0.001 (1/10th of the initial pre-training rate), which allows fine-tuning to make progress while not clobbering the initialization.
In each SGD iteration, we uni- formly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128.

Object category classifiers.
3.1. Visualizing learned features
The pool5 feature map is 6 × 6 × 256 = 9216- dimensional.
Ignoring boundary effects, each pool5 unit has a receptive field of 195×195 pixels in the original 227×227 pixel input.
3.2. Ablation studies
3.3. Network architectures
The network has a homogeneous structure consisting of 13 layers of 3 × 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers.
We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet.
To use O-Net in R-CNN, we downloaded the pub- licly available pre-trained network weights for the VGG ILSVRC 16 layers model from the Caffe Model Zoo.1 We then fine-tuned the network using the same pro- tocol as we used for T-Net. The only difference was to use smaller minibatches (24 examples) as required in order to fit within GPU memory.
The results in Table 3 show that R- CNN with O-Net substantially outperforms R-CNN with T- Net, increasing mAP from 58.5% to 66.0%.
However there is a considerable drawback in terms of compute time, with the forward pass of O-Net taking roughly 7 times longer than T-Net.
4.2. Region proposals
We followed the same region proposal approach that was used for detection on PASCAL.
4.3. Training data
Training data is required for three procedures in R-CNN: (1) CNN fine-tuning, (2) detector SVM training, and (3) bounding-box regressor training.
5. Semantic segmentation
CNN features for segmentation.
all of which begin by warping the rectangular window around the re- gion to 227 × 227.

Jonghyeon Sim

이전 포스트

(논문 리뷰, 코드) Going deeper with convolutions (2014)

다음 포스트