VGGNet 논문리뷰

논문리뷰

목록 보기

1/1

VGGNet — 논문의 한 줄, 한 단어까지 다 뜯어보기

Karen Simonyan, Andrew Zisserman.
Very Deep Convolutional Networks for Large-Scale Image Recognition.
ICLR 2015. arXiv:1409.1556

들어가며

VGGNet은 보통 "3×3 conv를 깊게 쌓은 네트워크" 한 줄로 요약된다. 하지만 이 논문은 그 한 줄로는 절대 끝낼 수 없는 텍스트다. 왜 3×3인가, 왜 1×1을 일부만 썼나, 왜 LRN을 뺐나, 왜 pre-initialization이 필요했나, scale jittering은 왜 효과가 있나, FC를 conv로 바꾸면 어떤 일이 일어나나, dense와 multi-crop은 왜 상보적인가 — 논문은 이 모든 질문에 대해 실험 결과와 함께 답을 갖고 있다.

이 글은 논문을 처음부터 끝까지 한 줄씩 따라가면서, "이 문장이 왜 거기 있는가"를 풀어보는 리뷰다. 발표든 면접이든 어떤 질문이 들어와도 막히지 않는 수준의 이해를 목표로 한다.

TL;DR

다른 모든 조건을 동일하게 두고 depth만 변수로 두는 통제 실험을 수행하여, 3×3 conv를 16–19층 깊이로 쌓는 것이 ILSVRC-2014에서 single-net 기준 SOTA를 달성함을 보였다. 핵심은 (1) 작은 필터의 스택이 큰 필터 하나보다 우월하다는 수학적·실험적 증명, (2) scale jittering을 통한 multi-scale augmentation, (3) FC layer를 conv로 바꿔 임의 크기 입력에 dense하게 적용하는 테스트 기법, (4) 학습된 특징이 다른 데이터셋·태스크에 그대로 잘 전이된다는 것이다.

0. Abstract — 한 문단을 일곱 토막으로

논문의 Abstract는 한 단락이지만 그 안에 모든 핵심이 들어있다. 한 문장씩 뜯어본다.

(1) "In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting."

키워드: investigate, effect, depth, accuracy. 이 문장은 논문 전체의 research question을 단도직입적으로 박는다. "depth가 accuracy에 미치는 effect" — 즉 다른 모든 것은 잡아두고 depth만 움직였을 때 무슨 일이 벌어지는가. 이게 통제 실험(controlled study)임을 첫 줄에서 선언한다.

(2) "Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers."

키워드: thorough evaluation, very small (3×3), pushing the depth to 16–19.

"thorough evaluation" — A부터 E까지 6개 구성을 만들어 비교하겠다는 예고.
"very small (3×3)" — 작은 필터가 핵심 설계 선택임을 명시.
"16–19 weight layers" — 단, weight를 가진 layer만 카운트. pooling과 ReLU는 빼고 conv + FC만 센다. VGG-16은 conv 13 + FC 3, VGG-19는 conv 16 + FC 3.

(3) "These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively."

localisation 1등, classification 2등. 보통은 GoogLeNet에게 진 2등으로만 기억되지만, localisation에서는 우승한 사실을 abstract에 박아둔다. Appendix A에서 자세히 다뤄지는데, 이건 단순 자랑이 아니라 "VGG의 깊은 representation은 분류만 잘하는 게 아니다"라는 메시지의 복선이다.

(4) "We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results."

generalise well. 이게 곧 transfer learning의 baseline을 정립한 문장. VOC, Caltech 등에서도 SOTA — Appendix B의 예고편이다.

(5) "We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision."

코드 공개. 이게 가져온 영향이 어마어마하다. 이후 수년간 R-CNN, Faster R-CNN, FCN, SSD 등 거의 모든 detection·segmentation 논문이 VGG-16을 backbone으로 시작한다.

이 한 단락에서 이미 "depth가 변수, 3×3 작은 필터, 16–19층, 두 트랙 결과, 일반화, 코드 공개" 다섯 가지가 모두 선언된 것이다. 좋은 abstract란 이런 것이다.

1. Introduction — 흐름과 동기를 어떻게 설계했나

Introduction은 다섯 단락으로 구성되어 있는데, 각 단락이 정확히 하나의 일을 한다.

단락 1: ConvNet이 왜 가능해졌나

"Convolutional networks (ConvNets) have recently enjoyed a great success in large-scale image and video recognition…"

ConvNet 성공의 3대 원인을 명시한다:
1. 대규모 public 이미지 저장소: ImageNet
2. 고성능 컴퓨팅 시스템: GPU와 large-scale 분산 클러스터
3. 벤치마크: ILSVRC

이게 단순한 nice-to-have 배경이 아니라, "본 논문이 가능한 이유"의 전제다. ImageNet과 GPU가 없으면 19층을 학습시킬 수도, 검증할 수도 없다.

단락 2: AlexNet의 등장과 그 이후의 개선 시도

"In particular, an important role in the advance of deep visual recognition architectures has been played by the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)…"

ILSVRC가 사실상 deep ConvNet의 testbed였음을 정리한 후, AlexNet(Krizhevsky 2012) 이후 개선 시도들을 세 갈래로 정리한다:

(a) 1st conv layer의 receptive window를 줄이고 stride를 작게: ZFNet(Zeiler & Fergus 2013), OverFeat(Sermanet 2014). AlexNet은 첫 layer가 11×11 stride 4였는데 이게 너무 거칠다는 문제.
(b) 학습/평가 시 multi-scale 사용: Sermanet 2014, Howard 2014. dense하게 여러 scale에서 평가.
(c) 본 논문의 갈래 — depth를 늘리는 것: 위 두 갈래는 건드리지 않고, 다른 모든 파라미터를 고정한 채 weight layer 수만 늘림. 이게 가능한 이유는 모든 conv를 3×3으로 통일하기 때문.

이 단락이 사실상 방법론의 절반을 미리 설명한다. "왜 3×3인가?"라는 질문의 답이 여기 숨어있다 — 모든 conv를 동일한 크기(3×3)로 두면 깊이만 변수로 둘 수 있어서.

단락 3: 본 논문의 contribution과 의의 예고

"As a result, we come up with significantly more accurate ConvNet architectures…"

ILSVRC 1, 2등 성과 + 다른 데이터셋 일반화 + 두 모델 공개. Abstract의 거울처럼 반복.

단락 4 & 5: 논문 구조 안내

"The rest of the paper is organised as follows…"

Sect 2 → 구조, Sect 3 → 학습/평가 세팅, Sect 4 → ILSVRC 결과, Sect 5 → 결론. Appendix A는 localisation, Appendix B는 generalisation에 대한 별도 분석이라는 것도 짚어준다. Appendix지만 분량이 매우 크고, 사실상 별개의 작은 논문 두 편이 붙어있는 셈이다.

2. ConvNet Configurations — 설계의 모든 결정 분해

논문의 핵심 챕터. 2.1 Architecture, 2.2 Configurations, 2.3 Discussion으로 나뉜다.

2.1 Architecture — 한 줄씩

이 절은 "모든 VGG 구성이 공유하는 골격"을 정의한다. 다섯 가지 디자인 결정을 차례로 박는다.

입력

"During training, the input to our ConvNets is a fixed-size 224×224 RGB image."

왜 224? AlexNet 이후 ILSVRC 사실상의 표준. 256으로 isotropic rescale 후 224×224 random crop이 관례.

전처리

"The only pre-processing we do is subtracting the mean RGB value, computed on the training set, from each pixel."

픽셀 단위 평균이 아니라 전역 mean RGB 벡터 (3차원)다. 색 채널마다 하나씩, 총 3개 숫자만 빼준다. 이미지마다 다른 평균을 빼는 게 아니라 train set 전체의 평균을 일관되게 뺀다.

Conv filter 크기 — 핵심 선언

"The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: 3×3 (which is the smallest size to capture the notion of left/right, up/down, center)."

왜 3×3이 "the smallest size to capture left/right, up/down, center"인가? 1×1은 중심 한 점만 보고 공간 정보가 없다. 2×2는 대칭 중심이 없고 좌/우, 상/하 개념이 모호하다. 3×3이 되어야 비로소 중심 픽셀과 그 8-이웃이 모두 표현 가능해진다. 이게 "공간 정보를 잡는 최소 단위"라는 의미.

"In one of the configurations we also utilise 1×1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity)."

1×1은 공간 정보 X, 채널 방향 선형 변환 + ReLU. 이게 Configuration C에서만 등장한다. 자세한 건 2.3에서.

Stride, padding

"The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3×3 conv. layers."

stride=1, padding=1. 3×3 conv를 거쳐도 feature map 크기가 변하지 않도록 한다. 이건 매우 중요한 결정인데, 깊은 net에서 해상도를 유지해야 정보 손실이 적기 때문이다. 해상도를 줄이는 일은 오로지 max-pooling만 담당.

수식으로 확인: 입력 크기 $H$ , 필터 $f=3$ , padding $p=1$ , stride $s=1$ 일 때 출력 크기는
$\left\lfloor \frac{H + 2p - f}{s} \right\rfloor + 1 = H - 3 + 2 + 1 = H$
입력과 출력이 같다.

Pooling

"Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2×2 pixel window, with stride 2."

5번의 max-pooling, 2×2 window, stride 2. 매 pool마다 해상도가 절반.

처음 224에서 시작해 5번의 절반 → $224 \to 112 \to 56 \to 28 \to 14 \to 7$ . 그래서 마지막 conv 출력이 7×7이 되고, 이게 나중에 FC→Conv 변환에서 7×7 conv로 매핑되는 이유다.

"not all conv. layers are followed by max-pooling" — 이게 중요한 디테일. 모든 conv 뒤에 pooling을 두면 layer를 많이 못 쌓는다. VGG는 pool 사이에 conv를 2~4개씩 연속해서 쌓는 "block" 구조를 만든다.

FC와 활성화

"A stack of convolutional layers is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks."

FC는 4096-4096-1000으로 모든 구성에서 동일. 이건 의도된 통제 — depth만 conv 쪽에서 바뀌고 FC는 일정.

"All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012)) non-linearity."

ReLU. AlexNet 이후 표준.

LRN을 (거의) 안 쓴다

"We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time."

LRN 무용론. AlexNet에서는 의무적으로 썼던 LRN을 빼버린다. A-LRN config가 만들어진 유일한 이유는 "LRN이 효과 없음"을 실험으로 보여주기 위함. 4.1에서 A(29.6%) vs A-LRN(29.7%)로 거의 동일함이 확인되고, 이후 모든 구성에서 LRN을 안 쓴다.

2.2 Configurations — Table 1 정독

여섯 가지 구성 A, A-LRN, B, C, D, E. Table 1을 그대로 읽으면:

Config	Weight layers	Conv 변화
A	11	baseline
A-LRN	11	A + LRN
B	13	Block 1, 2에 conv를 하나씩 추가
C	16	Block 3, 4, 5에 1×1 conv 추가
D	16	C의 1×1을 3×3으로 교체 → VGG-16
E	19	D의 Block 3, 4, 5에 3×3 conv 한 개씩 더 → VGG-19

채널 수 규칙: 첫 layer 64로 시작 → max-pool 통과할 때마다 2배 → 최대 512에서 정지.

수식: 채널 수 $C_k = \min(64 \cdot 2^k, 512)$ . 5번의 pool을 거치는 동안 $64 \to 128 \to 256 \to 512 \to 512$ (마지막은 capping).

Table 2: 파라미터 수

Config	Parameters
A, A-LRN	133M
B	133M
C	134M
D	138M
E	144M

11층에서 19층으로 늘려도 파라미터는 8%만 증가. 이게 가능한 이유는 (a) 모든 conv가 3×3으로 작고, (b) 파라미터의 대부분이 FC에 몰려있어서 ( $\approx$ 123M) conv를 더 쌓아도 비중이 작기 때문이다.

Sermanet의 OverFeat 1-net이 이미 144M이었음 — VGG는 그 정도 규모로 훨씬 더 깊은 모델을 만든 것.

2.3 Discussion — 논문의 진짜 심장

이 절을 이해하지 못하면 VGG를 이해한 게 아니다. 왜 3×3을 깊게 쌓는 게 큰 필터 한 개보다 좋은가를 수식과 직관 두 방향에서 증명한다.

Receptive Field 분석

"Rather than using relatively large receptive fields in the first conv. layers (e.g. 11×11 with stride 4 in (Krizhevsky et al., 2012), or 7×7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al., 2014)), we use very small 3×3 receptive fields throughout the whole net…"

"It is easy to see that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5; three such layers have a 7×7 effective receptive field."

Effective Receptive Field 일반화 공식: $n$ 개의 3×3 conv를 stride 1로 쌓으면 effective RF는
$(2n+1) \times (2n+1)$

$n=1$ : 3×3
$n=2$ : 5×5
$n=3$ : 7×7
$n=4$ : 9×9

증명 직관: stride=1이면 한 layer를 통과할 때마다 양옆으로 1픽셀씩 RF가 확장. 시작 3×3, 매 layer마다 +2.

그래서 무엇이 좋은가 — 세 가지 이점

논문 본문 그대로:

"So what have we gained by using, for instance, a stack of three 3×3 conv. layers instead of a single 7×7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3×3 convolution stack has C channels, the stack is parametrised by $3(3^2C^2) = 27C^2$ weights; at the same time, a single 7×7 conv. layer would require $7^2C^2 = 49C^2$ parameters, i.e. 81% more. This can be seen as imposing a regularisation on the 7×7 conv. filters, forcing them to have a decomposition through the 3×3 filters (with non-linearity injected in between)."

세 가지로 정리:

① 더 많은 비선형성

7×7 한 번 = ReLU 1회
3×3 세 번 = ReLU 3회

ReLU가 많을수록 표현 가능한 함수 공간이 풍부해진다. 한 번의 비선형 변환보다 세 번의 합성 비선형 변환이 더 복잡한 결정 경계를 그릴 수 있다.

② 더 적은 파라미터 (수학적 증명)

가정: 입력 채널 수 = 출력 채널 수 = $C$ , 같은 RF (7×7)를 만들도록 구성.

단일 7×7 conv: $7 \times 7 \times C \times C = 49C^2$
세 개 3×3 conv stack: $3 \times (3 \times 3 \times C \times C) = 3 \times 9C^2 = 27C^2$

비율: $\frac{49 - 27}{27} = \frac{22}{27} \approx 0.815$ .

→ 단일 7×7은 3-stack 3×3 대비 81% 더 많은 파라미터를 사용.

③ 암묵적 regularization

"7×7 conv 필터에 '3개의 3×3 합성으로 분해 가능해야 한다'는 제약을 부과하는 것과 같다." 자유도가 제한된 가설 공간에서 최적화를 수행하므로 overfitting에 강해진다. 명시적 weight decay가 아니라 구조 자체가 regularizer 역할을 한다.

셋이 동시에 작동한다: 파라미터는 줄어들고(②), 비선형성은 늘어나고(①), 제약은 추가된다(③). 이게 VGG의 본질.

1×1 Convolution의 의미

"The incorporation of 1×1 conv. layers (configuration C, Table 1) is a way to increase the non-linearity of the decision function without affecting the receptive fields of the conv. layers. Even though in our case the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality (the number of input and output channels is the same), an additional non-linearity is introduced by the rectification function. It should be noted that 1×1 conv. layers have recently been utilised in the 'Network in Network' architecture of Lin et al. (2014)."

1×1 conv는:

공간적으로는 한 픽셀 (RF가 1)
채널 방향의 선형 변환 (linear projection)
뒤따라오는 ReLU 덕에 비선형성 +1
VGG-C에서는 입출력 채널 수가 같아서 차원 축소 효과 X (GoogLeNet에서는 채널 축소용으로 사용)

C 구성을 통해 이걸 검증하려는 것 — "비선형성만 추가하면 좋을까, 아니면 공간 정보(3×3)까지 잡아야 할까?" 4.1 결과로 답: 공간 정보가 더 중요(D > C).

3. Classification Framework — 학습과 평가를 어떻게 했나

3.1 Training — 하이퍼파라미터 한 줄씩

"The training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al., 1989)) with momentum."

Objective: multinomial logistic regression (= softmax cross-entropy).

수식: 클래스가 $K$ 개, 예측 확률 $p_k$ , 정답이 $y$ 일 때
$\mathcal{L} = -\log p_y = -\log \frac{e^{z_y}}{\sum_{k=1}^{K} e^{z_k}}$

Optimizer: SGD with momentum. Adam이 없던 시절(2012-2014)의 표준.

하이퍼파라미터 표

항목	값	의미
Batch size	256	AlexNet과 동일
Momentum	0.9
Weight decay (L2)	$5 \times 10^{-4}$	overfitting 방지
Dropout	0.5 (앞 두 FC)	FC에만 적용, conv는 X
Initial LR	$10^{-2}$	AlexNet과 동일
LR schedule	val accuracy 정체 시 ×0.1, 총 3번 감소	step decay
Total iterations	370K (74 epochs)

흥미로운 관찰

"In spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al., 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. filter sizes; (b) pre-initialisation of certain layers."

AlexNet(8층, 90 epochs)보다 VGG(19층, 74 epochs)가 더 적은 epoch에 수렴. 깊이가 늘었는데 학습이 빠르다는 게 직관과 반대다. 이유:

(a) 암묵적 regularization이 효율적 학습 유도
(b) pre-initialization (다음 항목)

Pre-initialization — 깊은 net 학습 노하우

"The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets."

이게 ResNet 이전 시대의 가장 큰 골칫거리. 깊은 net에 randomly init하면 gradient가 vanish/explode해서 학습이 안 됨.

해결책 (3단계):

"To circumvent this problem, we began with training the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when training deeper architectures, we initialised the first four convolutional layers and the last three fully-connected layers with the layers of net A (the intermediate layers were initialised randomly)."

Step 1: 얕은 모델 A(11층)를 $N(0, 10^{-2})$ random init으로 학습 (bias=0).
Step 2: 깊은 모델 B-E 학습 시 처음 4개 conv layer + 마지막 3개 FC layer를 A의 가중치로 초기화. 중간 layer는 random init.
Step 3: pre-init된 layer의 LR을 줄이지 않음 — 학습 중 변경 허용.

"After the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010)."

논문 제출 후 발견: Xavier/Glorot init만으로도 학습 가능했음. 이게 ResNet(2015)에서는 Kaiming init + BN + residual로 더 잘 해결되고, pre-init이라는 기법 자체는 사라진다. 시대의 흔적인 셈.

Data augmentation — Training image size

이 부분이 살짝 복잡한데, 두 가지 크기 개념을 구분해야 한다:

$S$ = isotropically rescaled training image의 smallest side (training scale)
crop은 항상 $224 \times 224$ , 따라서 $S \geq 224$ 여야 함

"To obtain the fixed-size 224×224 ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration)."

224×224 crop을 image당 매 iteration마다 random 위치에서 뽑는다.

"To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift (Krizhevsky et al., 2012)."

추가 augmentation: random horizontal flip, random RGB color shift (AlexNet의 PCA color jittering).

$S$ 값을 정하는 두 가지 접근:

Approach 1: Single-scale training

"The first approach is to fix S, which corresponds to single-scale training (note that image content within the sampled crops can still represent multi-scale image statistics)."

$S$ 를 고정. 두 가지 값으로 평가:

$S = 256$ : prior art 표준
$S = 384$ : 더 큰 scale. $S=256$ 으로 학습한 가중치로 초기화하고 더 작은 LR $10^{-3}$ 로 fine-tune.

Approach 2: Multi-scale training (★ Scale Jittering)

"The second approach to setting S is multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] (we used Smin = 256 and Smax = 512)."

각 학습 이미지마다 $S$ 를 $[256, 512]$ uniform random에서 샘플링. 이게 scale jittering이다.

수식적으로:
$S \sim \text{Uniform}(256, 512), \quad \text{독립 샘플링 per image per iteration}$

"Since objects in images can be of different size, it is beneficial to take this into account during training. This can also be seen as training set augmentation by scale jittering, where a single model is trained to recognise objects over a wide range of scales."

해석: 같은 객체라도 이미지 안에서 차지하는 크기가 다양함. multi-scale로 학습하면 어떤 크기든 인식 가능.

학습 속도를 위해 $S=384$ 로 사전학습된 single-scale 모델의 모든 layer를 fine-tune.

3.2 Testing — Dense Evaluation의 마법

이 절이 VGG의 또 다른 핵심 기여다. FC layer를 conv로 바꿔서 임의 크기 입력에 한 번에 적용한다.

기본 절차

"At test time, given a trained ConvNet and an input image, it is classified in the following way. First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q (we also refer to it as the test scale). We note that Q is not necessarily equal to the training scale S."

테스트 이미지를 $Q$ 로 isotropic rescale ( $Q$ 는 $S$ 와 다를 수 있음).

FC → Conv 변환 (OverFeat의 아이디어)

"Then, the network is applied densely over the rescaled test image in a way similar to (Sermanet et al., 2014). Namely, the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7×7 conv. layer, the last two FC layers to 1×1 conv. layers)."

Train 시	→	Test 시
FC-4096	→	7×7 conv-4096
FC-4096	→	1×1 conv-4096
FC-1000	→	1×1 conv-1000

왜 7×7? 마지막 conv block을 통과한 직후 feature map이 7×7×512 (224 입력 기준). 이 전체를 4096차원 벡터로 mapping하는 FC는 결국 7×7×512×4096 weight = 하나의 7×7 conv-4096과 수학적으로 동등.

왜 1×1? 두 번째와 세 번째 FC는 1D 벡터 위에서 작동 → spatial dim 1×1, 채널만 변환 → 1×1 conv.

그래서 무슨 일이 일어나는가

"The resulting fully-convolutional net is then applied to the whole (uncropped) image. The result is a class score map with the number of channels equal to the number of classes, and a variable spatial resolution, dependent on the input image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled)."

전체 fully-conv net이 만들어진다. 입력이 임의 크기든 OK. 출력은 class score map — spatial dim × 1000 channels.

5단계 파이프라인:

Rescale: 이미지를 $Q$ 로 isotropic rescale
FC → Conv 변환: fully-convolutional net으로 변환
Dense apply: 전체 이미지에 sliding 없이 한 번에 conv 적용
Sum-pool: spatial dim 평균 → 1000-D 벡터
Flip avg: 원본 + 좌우반전 score 평균

장점

"We also augment the test set by horizontal flipping of the images; the soft-max class posteriors of the original and flipped images are averaged to obtain the final scores for the image."

"Since the fully-convolutional network is applied over the whole image, there is no need to sample multiple crops at test time (Krizhevsky et al., 2012), which is less efficient as it requires network re-computation for each crop."

AlexNet은 10 crops × forward pass = 10번의 net 실행이 필요했음. Dense는 단 한 번의 forward로 끝남.

Multi-crop도 병행

"At the same time, using a large set of crops, as done by Szegedy et al. (2014), can lead to improved accuracy, as it results in a finer sampling of the input image compared to the fully-convolutional net. Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured."

GoogLeNet과 비교 위해 multi-crop도 수행. 둘은 상보적(complementary):

Dense: 효율적, RF가 자연스럽게 이웃 픽셀 받음 (이미지 내부에서 padding 필요 없음)
Multi-crop: sampling이 fine-grained, crop마다 zero-padding → boundary 처리가 dense와 다름

둘 다 사용하면 boundary 효과가 보완되어 정확도 ↑.

VGG는 50 crops × 3 scales = 150 crops 사용 (GoogLeNet의 4 scales × 144 crops보다 적음).

3.3 Implementation Details

"Our implementation is derived from the publicly available C++ Caffe toolbox (Jia, 2013) (branched out in December 2013), but contains a number of significant modifications, allowing us to perform training and evaluation on multiple GPUs installed in a single system, as well as train and evaluate on full-size (uncropped) images at multiple scales (as described above)."

C++ Caffe 기반, multi-GPU 지원으로 수정
4× NVIDIA Titan Black (6GB): 당시 컨슈머급 최고 GPU
Data parallelism — batch를 4등분, 각 GPU가 gradient 계산, synchronous로 평균
3.75× speedup (4 GPU 대비 이론치 4배에서 약간 손실)
단일 net 학습: 2-3 주

지금 기준으로는 비효율적이지만 2014년에는 cutting-edge. 학습 시간이 길었기에 신중한 실험 설계와 pre-initialization 같은 트릭이 절실했다.

4. Classification Experiments — 결과와 그 해석

데이터셋

ILSVRC-2012:

1000 class
Train: 1.3M, Val: 50K, Test: 100K (라벨 비공개)

평가지표: top-1 error (multi-class 분류 정확도), top-5 error (top-5 예측 안에 정답 있는지). ILSVRC 공식 지표는 top-5.

4.1 Single Scale Evaluation — Table 3

테스트 시 single scale $Q$ 적용. fixed $S$ 의 경우 $Q=S$ , jittered $S$ 의 경우 $Q=0.5(S_{\min}+S_{\max})=384$ .

핵심 결과 (top-5 best per config):

Config	Test set best (top-5)
A (11)	10.4
A-LRN (11)	10.5
B (13)	9.9
C (16)	8.8
D (16)	8.1
E (19)	8.0

다섯 가지 관찰

① LRN은 효과 없음

A (29.6% top-1) vs A-LRN (29.7%). 사실상 동일. 결론: 이후 B-E에서 LRN 미사용. memory와 compute만 늘릴 뿐.

② 깊이 ↑ → 에러 ↓

11(A) → 13(B) → 16(C, D) → 19(E)로 단조 감소. Depth가 핵심임을 정량적으로 입증.

③ C < D의 의미

둘 다 16층이지만 D(모두 3×3)가 C(일부 1×1)보다 1% 더 좋음. 같은 깊이, 같은 비선형성 개수임에도 D가 우위 → 공간 정보를 잡는 conv (3×3)가 단순 채널 변환(1×1)보다 더 중요.

④ Saturation at 19 layers

D(16)와 E(19)의 차이가 거의 없음 (top-5: 8.1 vs 8.0). ImageNet 규모에서는 19층이 한계. 이게 ResNet(2015)의 등장 배경 — "더 깊은 net을 학습 가능하게 만들자".

⑤ Scale jittering 효과

학습 시 $S \in [256, 512]$ (multi-scale) vs $S = 256$ 고정:

E의 top-1: 27.3% → 25.5% (+1.8%p 향상)

Test가 single-scale이었음에도 학습 augmentation의 효과가 크다.

추가 실험: 5×5 vs 깊은 3×3 stack

"We also compared the net B with a shallow net with five 5×5 conv. layers, which was derived from B by replacing each pair of 3×3 conv. layers with a single 5×5 conv. layer (which has the same receptive field as explained in Sect. 2.3). The top-1 error of the shallow net was measured to be 7% higher than that of B (on a centre crop), which confirms that a deep net with small filters outperforms a shallow net with larger filters."

같은 RF, 다른 깊이: B의 3×3 pair를 5×5 하나로 교체 → top-1 에러가 7% 더 높음.

→ 논문의 가설 "깊고 작은 필터 > 얕고 큰 필터" 재확인.

4.2 Multi-Scale Evaluation — Table 4

이제 test에서도 multi-scale 적용. 여러 $Q$ 값에서 softmax 평균.

Fixed $S$ : $Q = \{S-32, S, S+32\}$
Jittered $S \in [S_{\min}, S_{\max}]$ : $Q = \{S_{\min}, 0.5(S_{\min}+S_{\max}), S_{\max}\}$

Train	Test Q	Top-1	Top-5
D / 256	224,256,288	26.6	8.6
D / [256,512]	256,384,512	24.8	7.5
E / [256,512]	256,384,512	24.8	7.5

Multi-scale test가 single-scale test 대비 ~0.8%p 추가 개선. Single-scale에서 가장 좋았던 E(8.0%)가 multi-scale로 7.5%로 떨어짐.

학습 시 $S$ 를 jitter한 모델이 테스트 시 multi-scale에 더 잘 반응.

4.3 Multi-Crop Evaluation — Table 5

Dense vs Multi-crop vs 둘 다 사용 (Net E, $S \in [256, 512]$ ):

Eval	Top-1	Top-5
Dense	24.8	7.5
Multi-crop	24.6	7.4
Both	24.4	7.1

Multi-crop이 dense보다 약간 좋고, 둘을 합치면 가장 좋음. Boundary 처리가 서로 보완되기 때문.

4.4 ConvNet Fusion — Table 6

여러 모델의 soft-max 확률 평균 (ensemble).

ILSVRC 제출: 7 net 앙상블 → top-5 7.3%
Post-submission: D + E 2 net (multi-crop & dense) → 6.8%
단일 모델 최고: E (multi-crop & dense) → 7.0%

흥미로운 사실: VGG 2-net (6.8%)는 GoogLeNet 7-net (6.7%)에 0.1%p 차이로 따라붙음. 모델 수는 적은데 성능은 거의 비등.

4.5 SOTA 비교 — Table 7

Method	# nets	Top-5 test error
AlexNet (2012)	5	16.4%
OverFeat (2013)	7	13.6%
Clarifai (2013)	multi	11.7%
MSRA (He'14)	11	8.1%
GoogLeNet (2014)	7	6.7%
VGG (2014 submission)	7	7.3%
VGG (post-submit)	2	6.8%

ILSVRC-2014 1위는 GoogLeNet. VGG는 2위. 하지만:

단일 모델 기준 VGG > GoogLeNet (7.0% vs 7.9%)
VGG는 구조가 훨씬 단순함 — Inception module, auxiliary classifier 등 GoogLeNet의 복잡성 없이 단순 stacked conv만으로 거의 동일한 결과

이게 VGG가 살아남은 이유다. 단순하면 디버깅하기 쉽고, 변형하기 쉽고, 다른 태스크에 적용하기 쉽다.

5. Conclusion — 다시 짚는 요점

"In this work we evaluated very deep convolutional networks (up to 19 weight layers) for large-scale image classification. It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al., 1989; Krizhevsky et al., 2012) with substantially increased depth."

세 가지 메시지:
1. Depth가 분류 정확도에 도움된다 — controlled study로 증명.
2. 전통적 ConvNet 구조(LeCun 1989, AlexNet 2012)로도 Inception 같은 새로운 토폴로지 없이 SOTA 달성 가능.
3. 단순함의 힘 — 모든 conv를 3×3으로 통일하는 것만으로 충분.

"In the appendix, we also show that our models generalise well to a wide range of tasks and datasets, matching or outperforming more complex recognition pipelines built around less deep image representations. Our results yet again confirm the importance of depth in visual representations."

Appendix A, B의 결과로 "깊은 representation"의 중요성을 한 번 더 강조.

Appendix A — Localization

Task

"Object localisation is a special case of object detection, where the goal is to predict a single object bounding box for each of the top-5 classes, irrespective of the actual number of objects of the class."

Detection이 아니라 단일 bounding box 예측 — 클래스마다 하나의 bbox만 내놓는 단순화된 task.

Localization ConvNet

Backbone: VGG-D (16층, 분류에서 가장 좋았던 모델)
변경점: 마지막 layer를 class score → bounding box 좌표 (center $x$ , $y$ , width $w$ , height $h$ )

Loss

"We use the Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth."

분류는 cross-entropy였지만 localization은 회귀 문제이므로 Euclidean loss (= L2 loss):
$\mathcal{L} = \sum_{i \in \{x, y, w, h\}} (\hat{b}_i - b_i)^2$

두 가지 회귀 방식

"We consider two variants: Single-Class Regression (SCR) where the bounding box prediction is shared across all classes (the last layer is 4-D), and Per-Class Regression (PCR), where the last layer is class-specific (the last layer is 4000-D, since there are 1000 classes)."

SCR (Single-Class Regression): 모든 클래스 공유, last layer 4-D
PCR (Per-Class Regression): 클래스별 4-D × 1000 클래스 = 4000-D

PCR이 약간 더 좋음 (top-5 localisation error 26.9% vs 27.7%, validation set 기준).

Testing

"We use two testing protocols. The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class (to factor out the classification errors). The bounding box is obtained by applying the network only to the central crop of the image."

Protocol 1 (val 비교용): 정답 class의 bbox만 평가, center crop
Protocol 2 (test set용): OverFeat과 유사한 greedy merging + classification probability 기반 score merging

결과 (Table 9)

ILSVRC 2014 localisation test error:

Method	Test error (%)
AlexNet (2012)	34.2
OverFeat (2013, 우승)	29.9
GoogLeNet (2014)	26.7
VGG (2014, 우승)	25.3

ILSVRC 2014 localisation 1위.

논문은 흥미로운 비교를 한다:

"Notably, our best localisation result was achieved without using the localisation-specific tricks employed by Sermanet et al., such as resolution enhancement and multi-scale testing. We can thus conclude that better classification (representation) leads to better localisation."

OverFeat의 resolution enhancement, multi-scale testing 등 trick 없이도 더 좋은 결과 → "더 좋은 분류 모델 = 더 좋은 representation = 더 좋은 localization." VGG의 representation power가 일반적임을 보여주는 강한 증거.

Appendix B — Generalization to Other Datasets

Feature Extraction 방식

"To use our models in other settings, we remove the last fully-connected layer (which performs 1000-way ILSVRC classification), and use the activations of the penultimate layer (i.e. the activations of the second-to-last fully-connected layer, which we call 'features') as the image features."

마지막 FC-1000 제거
FC-4096 활성값을 이미지 descriptor로 사용
여러 scale $Q$ 에서 dense 적용 → 각 scale별 feature map → global average pool → L2 normalize
선형 SVM을 target dataset에 학습 (ConvNet 가중치는 고정, no fine-tuning)

이게 deep features as a feature extractor의 전형. R-CNN(Girshick 2014) 등이 이미 했던 접근.

결과 (Table 11)

Dataset	Prev SOTA	VGG D+E
VOC-2007	He'14: 82.4 mAP	89.7 mAP (Δ +7.3)
VOC-2012	Chatfield'14: 83.2 mAP	89.3 mAP (Δ +6.1)
Caltech-101	He'14: 93.4 Acc	92.7 Acc (Δ -0.7)
Caltech-256	Chatfield'14: 77.6 Acc	86.2 Acc (Δ +8.6)

Caltech-101에서만 살짝 낮음. 논문은 그 이유를 다음과 같이 분석:

"An interesting finding is that on PASCAL VOC datasets, our models substantially outperform those of (He et al., 2014), which is in spite of the fact that the model of (He et al., 2014) employs more training data... This is an indirect confirmation that our models, while pre-trained on a related dataset (ILSVRC), would generalise well to the new ones... On Caltech-101 and Caltech-256, both He et al.'s and our nets perform on par..."

He et al. (2014)이 SPP-Net인데, VGG가 SPP-Net보다 PASCAL VOC에서 훨씬 앞섬에도 Caltech-101에서는 동등. Caltech-101이 ImageNet과 통계적으로 더 멀기 때문 (iconic, clean images vs natural scenes).

기타 응용 (Sect. B.3)

논문 공개 후 VGG가 backbone으로 사용된 다양한 태스크:

VOC-2012 Action Classification: 84.0% mAP (SOTA)
Object Detection: R-CNN, Faster R-CNN, SSD의 backbone
Semantic Segmentation: FCN (Long et al., 2015)
Image Captioning: Show and Tell, NeuralTalk 등
Texture Recognition

→ 모든 곳에서 baseline 향상. Universal feature extractor로 자리잡음.

핵심 인사이트 정리 — 다섯 가지

1. Controlled study의 미학

"depth만 변수로 둔다"는 단순한 설계 원칙이 논문의 모든 결과에 결정적 신뢰성을 부여한다. 다른 변수가 동시에 바뀌면 인과 추론이 불가능하다. 3×3으로 conv를 통일한 이유는 깊이 변동을 자유롭게 하기 위함이라는 메타적 동기를 놓치면 안 된다.

2. 작은 필터 + 깊은 net의 수학적 우위

Parameter: $27C^2 < 49C^2$ (3-stack 3×3 vs 7×7)
Non-linearity: 3 ReLUs vs 1
Implicit regularization
이 세 가지가 동시에 작동.

3. Scale Jittering — 명시적 multi-scale augmentation

$S \sim \text{Uniform}(256, 512)$ 로 학습 시 매번 다른 scale의 crop을 본다. 이게 약 +1%p의 안정적 성능 향상을 가져옴. 데이터 증강의 표준 기법으로 정착.

4. FC → Conv 변환 — Dense Evaluation

임의 크기 입력에 ConvNet을 한 번에 적용. AlexNet 식 10-crop이 사라지고, fully-convolutional 패러다임이 일반화. 이 아이디어는 곧 FCN(2015)에서 segmentation의 표준이 된다.

5. Universal representation

VGG-16/19의 FC-4096 features는 PASCAL VOC, Caltech, action recognition, detection, segmentation 모두에서 baseline을 끌어올림. 이게 transfer learning의 baseline 시대를 연다.

자주 받을 만한 질문 — Q&A 모음

Q1. 왜 모든 conv를 3×3으로 통일했나?

A. 두 가지 이유. ① 방법론적: depth를 단일 변수로 두기 위해. 필터 크기가 layer마다 달라지면 깊이의 효과를 분리해서 측정할 수 없다. ② 수학적: 같은 RF를 만드는 3×3 스택은 큰 필터 하나보다 파라미터가 적고(27 vs 49 C²), 비선형성이 많으며, 암묵적 regularization을 갖는다.

Q2. 1×1 conv가 16층 D보다 안 좋은 이유는?

A. C와 D는 같은 16층, 같은 비선형성 횟수다. 차이는 일부 layer에서 C는 1×1, D는 3×3이라는 점. 1×1은 RF가 1픽셀이라 공간 정보를 못 잡는다. D는 그 자리에 3×3을 두어 이웃 픽셀과의 spatial context를 학습할 수 있다. 결국 ConvNet의 본질은 spatial pattern recognition이고, 그걸 더 잘하는 게 우위.

Q3. 왜 19층에서 성능이 saturation됐나?

A. 두 가지 가설. ① ImageNet 규모(1.3M 이미지)가 더 깊은 모델을 학습시키기에 부족. ② Vanishing gradient 문제. residual connection 같은 학습 기법이 없으면 19층이 사실상 학습 가능한 한계. 이게 ResNet(2015)의 등장 동기다 — skip connection으로 152층까지 학습 가능.

Q4. Pre-initialization 없이도 학습 가능했다는데, 그럼 왜 했나?

A. 논문 작성 당시(2014)에는 Glorot/Xavier init이 깊은 ConvNet에 잘 작동한다는 게 확립되지 않았음. 안전을 위해 Net A를 random init으로 학습한 후, deeper net의 일부 layer를 그 가중치로 초기화. 논문 제출 후에야 Glorot init만으로도 학습됨을 확인. ResNet 이후 He init + BN으로 완전히 해결.

Q5. Multi-scale training과 multi-scale testing은 어떻게 다른가?

A. 둘은 독립적으로 적용 가능.

Training side: $S \sim \text{Uniform}(256, 512)$ 로 학습 이미지의 scale을 jitter. 모델이 다양한 크기의 객체를 학습.
Testing side: 여러 $Q$ 값에서 dense 평가 후 softmax 평균. 추론 시 ensemble 효과.
둘 다 사용하면 단조 누적: single-scale eval에서 +0.8%p, multi-scale eval에서 추가 +0.8%p.

Q6. Dense evaluation과 multi-crop evaluation의 차이는?

A. 둘 다 여러 위치의 prediction을 합치는 추론 기법이지만:

Dense: 전체 이미지에 fully-convolutional net을 한 번 적용. 효율적. RF가 자연스럽게 이웃 픽셀을 포함.
Multi-crop: 여러 crop을 잘라 각각 net 적용. crop마다 zero-padding이 일어남.

차이는 boundary 처리. Dense는 image-level padding이 없고, multi-crop은 있음. 이게 서로 다른 정보를 잡아내므로 둘을 평균하면 가장 좋음 (24.4% top-1).

Q7. VGG vs GoogLeNet, 누가 더 우수한가?

A. ILSVRC-2014 공식 순위는 GoogLeNet 우승. 하지만:

Single net 기준: VGG (7.0%) > GoogLeNet (7.9%) — 약 1%p VGG 우위.
Ensemble: GoogLeNet 7-net (6.7%) vs VGG 2-net (6.8%) — 거의 동등.
구조: VGG는 단순(3×3 stacked), GoogLeNet은 복잡(Inception module + auxiliary classifier + sparse connectivity).
계산량: GoogLeNet이 더 효율적 (12배 더 적은 파라미터).
이식성: VGG가 압도적. detection/segmentation backbone으로 수년간 표준.

Q8. VGG가 살아남은 이유는?

A. "단순함의 힘". 모든 conv가 3×3, 모든 pool이 2×2/stride2, FC는 4096-4096-1000. 새 태스크에 적용할 때 변형이 쉽고, 디버깅이 쉽고, 가르치기 쉽고, 이해하기 쉽다. GoogLeNet은 더 좋지만 더 복잡하다. 이후 R-CNN, FCN, SSD 등이 모두 VGG를 baseline으로 시작.

Q9. Caltech-101에서만 성능이 SOTA보다 낮은 이유는?

A. Caltech-101은 iconic, clean, centered 이미지 (객체가 화면 중앙에 잘 보임). PASCAL VOC는 natural scene, multi-object, cluttered. ImageNet도 후자에 가깝다. VGG는 ImageNet에서 학습됐기에 통계가 비슷한 PASCAL에서는 큰 이득. Caltech-101은 통계가 멀어서 이득이 작음. 또한 Caltech-101 자체가 작은 데이터셋이라 SOTA가 이미 거의 포화 상태.

Q10. Scale jittering이 왜 효과가 있는가? 단순히 데이터를 늘리는 효과인가?

A. 두 가지 효과의 결합.

① Data augmentation: 같은 이미지를 여러 scale로 봐서 학습 셋이 사실상 늘어남.
② Multi-scale invariance: 모델이 scale에 invariant한 feature를 학습. 같은 객체가 다른 크기로 보여도 인식 가능.

특히 ②가 핵심. ImageNet은 객체 크기 변동이 큰 자연 이미지라 multi-scale로 학습한 모델이 더 robust해진다.

Q11. 모든 layer를 ReLU로 두는 게 최선인가? PReLU/ELU 같은 건?

A. 2014년 당시 ReLU가 표준. 이후 PReLU(He'15), Leaky ReLU, ELU, GELU 등이 등장하면서 일부 개선 가능. 하지만 ImageNet 같은 대규모 dataset에서는 활성화 함수의 영향이 상대적으로 작고, VGG는 ReLU로 충분히 잘 학습됨.

Q12. Dropout을 왜 FC에만 적용했나? Conv에는 안 적용해도 되나?

A. Conv layer는 파라미터가 적고 공간적 weight sharing이 있어 overfitting에 강함. FC는 파라미터의 압도적 비중(VGG-D 기준 conv 15M + FC 123M)을 차지하므로 dropout 적용이 효과적. 후속 연구에서 SpatialDropout(2014)이 conv에 적용되기도 하지만 효과는 미미.

Q13. 224×224는 어디서 온 숫자인가?

A. AlexNet의 표준. 256×256 isotropic rescale 후 224×224 random crop이 관례. 224는 $2^5 \times 7 = 224$ 로 5번의 stride-2 pooling 후 정확히 7×7이 됨. 깊은 net 설계에 편리한 숫자.

Q14. weight decay $5 \times 10^{-4}$ 는 어떻게 정했나?

A. AlexNet과 동일. 이후 hyperparameter sweep 없이 그대로 사용. 사실상 ImageNet ConvNet의 표준 기본값이 됐다.

Q15. 왜 학습 시 epoch이 AlexNet보다 적은데도 수렴이 잘 되나?

A. 두 가지 요인. ① 암묵적 regularization: 깊이 + 작은 필터 → 더 나은 inductive bias. ② Pre-initialization: 깊은 net을 처음부터 학습하지 않고 얕은 net의 가중치를 활용.

Q16. VGG의 한계는?

A. 세 가지.

파라미터 수: 138M (D), 144M (E). FC layer에 집중 (~123M). 메모리/저장 부담 큼.
계산 비용: 깊고 큰 net이라 inference도 느림 (forward pass ~3-4× AlexNet).
Vanishing gradient: 19층에서 saturation. 더 깊이 못 감 → ResNet(2015)이 이 문제를 해결.

Q17. 본 논문이 후속 연구에 미친 영향은?

A. 거의 모든 곳.

분류 SOTA: ResNet, DenseNet의 비교 baseline.
Detection: R-CNN(2014), Fast R-CNN(2015), Faster R-CNN(2015), SSD(2016) 모두 VGG-16 backbone.
Segmentation: FCN(2015), DeepLab(2014), SegNet(2015) 모두 VGG 기반.
Style Transfer: Gatys(2015)의 neural style transfer가 VGG의 feature map을 사용.
Perceptual Loss: VGG feature 거리가 image similarity 측정의 표준.

Q18. 왜 1×1 conv가 GoogLeNet/ResNet에서는 효과적인데 VGG-C에서는 약했나?

A. 목적이 다름.

VGG-C: 입출력 채널 수가 같음. 단순히 선형 변환 + ReLU 추가 효과만.
GoogLeNet (Inception): 1×1으로 채널 수를 축소한 뒤 비싼 3×3/5×5 적용. 계산량 절감 + 표현력 향상.
ResNet (bottleneck block): 1×1로 축소 → 3×3 → 1×1로 확장. 계산 효율 + 표현력.

VGG-C는 1×1의 잠재력을 못 살린 셈. 1×1 자체가 안 좋은 게 아니라 사용법이 중요.

Q19. 학습 데이터로 random crop만 쓰면 충분한가? Multi-scale을 안 하면?

A. Single-scale( $S=256$ fixed) 학습 시 E의 top-1은 27.3%. Multi-scale( $S \in [256, 512]$ ) 학습 시 25.5%. 차이 1.8%p가 작아 보이지만 ImageNet 규모에서는 거대한 차이. 또한 multi-scale 학습 모델이 test 시 multi-scale 평가에 더 잘 반응 (downstream에서 누적 효과).

Q20. 224×224 crop이지만 train과 test에서 다른 scale을 쓸 수 있다? 어떻게?

A. Train: $224 \times 224$ random crop이 고정. 하지만 crop의 출처가 되는 rescaled image의 $S$ 를 jitter할 수 있음.
Test: FC를 conv로 바꾸면 임의 크기 입력 처리 가능. 따라서 $Q$ 를 train의 $S$ 와 무관하게 자유롭게 선택. dense evaluation으로 전체 이미지 처리.

이게 VGG의 영리한 점: train의 입력 크기 제약을 test 시 풀어버린다.

마무리

VGGNet 논문이 위대한 이유는 새로운 구조를 제안해서가 아니다. 기존 ConvNet 패러다임(LeCun, AlexNet)을 받아들이고, 단 하나의 변수(depth)만을 통제 실험으로 검증한 후, 그 결과를 다른 도메인에서도 재현 가능함을 보여줬기 때문이다. 단순함을 끝까지 밀고 간 결과 "딥러닝의 depth가 representation의 풍부함을 결정한다"는 명제를 ImageNet 규모에서 확립한 것.

오늘날 모든 backbone 비교에서 VGG-16이 등장하고, 모든 perceptual loss가 VGG feature를 쓰며, 모든 학부 CNN 수업이 VGG로 시작한다. 이 글이 그 무게를 한 번이라도 더 가깝게 느끼는 데 도움이 되길.

References: Simonyan, K. & Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015. arXiv:1409.1556.

Bewonoverby