논문 : https://arxiv.org/pdf/1311.2524


CNN 분류 결과(ImageNet을 활용)가 PASCAL VOC Challenge의 객체 탐지(object detection)결과에 얼마나 일반화할 수 있는지에 대한 의문임.
이미지 분류와 객체 탐지 사이의 차이를 연결하여 문제를 대답함.
1. 깊은 신경망으로 객체를 지역화함.
두 번째 도전과제는 큰 CNN을 학습 시킬 충분히 이용할 라벨된 데이터의 부족임. 이 문제의 편리한 해결은 unsupervised pre-training을 사용하는 것임.
큰 보조 데이터(ILSVRC)의 사전 학습을 작은 데이터셋 (PASCAL)에 fine-tuning을 함.
fine-tuning 이후, VOC 2010에서 mAP 54%로 HOG 기반 deformable part model(DPM)의 33% mAP를 비교하면 성능이 향상됨
3개의 모듈로 구성됨
1. category-independent region proposals.
(These proposals define the set of candidate detections available to our detector.)
2. a large CNN (각 region에서 a fixed-length feature vector를 추출)
3. 고전적 특별한 linear SVMs의 집합(set).
Region proposals.
Feature extraction.
We extract a 4096-dimensional fea- ture vector from each region proposal using the Caffe [24] implementation of the CNN described by Krizhevsky et al. [25]. Features are computed by forward propagating a mean-subtracted 227 × 227 RGB image through five con- volutional layers and two fully connected layers. We refer readers to [24, 25] for more network architecture details.
Regardless of the size or aspect ratio of the candidate region, we warp all pixels in a tight bounding box around it to the required size.
Prior to warping, we dilate the tight bounding box so that at the warped size there are ex- actly p pixels of warped image context around the original box (we use p = 16). Figure 2 shows a random sampling of warped training regions.

2.3. Training
Supervised pre-training.
We discriminatively pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 clas- sification) using image-level annotations only (bounding- box labels are not available for this data).
Domain-specific fine-tuning.
we continue stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals.
For VOC, N = 20 and for ILSVRC2013, N = 200. We treat all region proposals with ≥ 0.5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives.
We start SGD at a learning rate of 0.001 (1/10th of the initial pre-training rate), which allows fine-tuning to make progress while not clobbering the initialization.
In each SGD iteration, we uni- formly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128.
Object category classifiers.
3.1. Visualizing learned features
The pool5 feature map is 6 × 6 × 256 = 9216- dimensional.
Ignoring boundary effects, each pool5 unit has a receptive field of 195×195 pixels in the original 227×227 pixel input.
3.2. Ablation studies
3.3. Network architectures
The network has a homogeneous structure consisting of 13 layers of 3 × 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers.
We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet.
To use O-Net in R-CNN, we downloaded the pub- licly available pre-trained network weights for the VGG ILSVRC 16 layers model from the Caffe Model Zoo.1 We then fine-tuned the network using the same pro- tocol as we used for T-Net. The only difference was to use smaller minibatches (24 examples) as required in order to fit within GPU memory.
The results in Table 3 show that R- CNN with O-Net substantially outperforms R-CNN with T- Net, increasing mAP from 58.5% to 66.0%.
However there is a considerable drawback in terms of compute time, with the forward pass of O-Net taking roughly 7 times longer than T-Net.
4.2. Region proposals
We followed the same region proposal approach that was used for detection on PASCAL.
4.3. Training data
Training data is required for three procedures in R-CNN: (1) CNN fine-tuning, (2) detector SVM training, and (3) bounding-box regressor training.
5. Semantic segmentation
CNN features for segmentation.
all of which begin by warping the rectangular window around the re- gion to 227 × 227.