[논문 리뷰] VGG net

이영락·2024년 7월 24일

cv 논문 리뷰

CV & NLP 논문 리뷰

목록 보기

1/14

[논문 리뷰] VGGnet(2014) 설명

[논문 리뷰] Very Deep Convolutional Networks for Large-Scale Image Recognition 리뷰, VGG Net

1 | Introduction

Background

Convolutional Networks (ConvNets) have achieved significant success in large-scale image and video recognition
- availability of large public image repositories such as ImageNet
- high-performance computing systems like GPUs.
3 Way improving original architecture
- better accuracy
- training and testing the networks densely
- design its depth(in this paper)
how to design depth?
1. fix other parameters
2. streadily increase the depth of the network = adding more convolutional layers
result? : more accurate convnet architectures → applicable to otehr image recognition datasets

Objective

This study investigates the effect of ConvNet depth on accuracy in large-scale image recognition.
The primary contribution is a thorough evaluation of networks of increasing depth using architectures with small 3×3 convolution filters.
Contribution: The proposed networks showed substantial performance improvements over previous configurations and achieved top positions in the ImageNet Challenge 2014. The models also generalize well to other datasets.

2 | ConvNet Configurations

2.1. Architecture

Input: a fixed-size 224×224 RGB image.
Preprocessing: subtracting the mean RGB value computed on the training set from each pixel.
Convolutional Layers: Stacks of convolutional layers using very small 3×3 filters, fixed stride of 1 pixel and padding(preserve spatial resolution)
Pooling Layers: Max-pooling layers with 2×2 pixel windows and stride of 2
Fully Connected Layers (FC): Three FC layers
- the first two with 4096 channels each,
- third performing 1000-way classification using the softmax function.(데이터셋이 1000개이기 때문)
Activation Function: ReLU non-linearity

2.2. Configurations

differ only in the depth!

Network Depth: Five configurations (A to E) ranging from 11 to 19 weight layers.

2.3. Discussion

different of top-performing entries VS our Convnet(VGGnet)

📢 Use very small 3 X 3 receptive fields(top performing = **large receptive fileds** in the first conv)

3 non-lineare rectification layers = decision function more discriminative
decrease the number of parameters(7*7 = 81% more)

chat gpt

Advantages of Small Filters: Using small 3×3 filters introduces more non-linearity and reduces the number of parameters compared to larger filters (e.g., 7×7).
1×1 Filters: Configuration C uses 1×1 filters to increase non-linearity without changing the receptive field.
Comparison with Previous Work: Highlights the advantages of deeper networks with small filters over previous architectures, showing improved performance.

3 | Classification Framework

3.1. Training

mini-batch gradient descent with momentum(The batch size : 256, momentum : 0.9.)
weight decay(L2: $5*10^-4$ ) , dropout(dropout : 0.5)
Learning Rate → decrease 3 times(learning rate : 10^-2)

chat gpt

Training Procedure: Follows the method of Krizhevsky et al. (2012), using mini-batch gradient descent with momentum. The batch size is set to 256, and momentum to 0.9.
Regularization: Uses weight decay (L2 penalty) and dropout regularization to prevent overfitting.
Learning Rate: Starts with an initial learning rate of 0.01, which is decreased by a factor of 10 when the validation accuracy stops improving.
Weight Initialization: Starts with training a shallow network (configuration A) and uses its weights to initialize deeper networks.

3.2. Testing

테스트 절차: 고정된 크기의 입력 이미지를 사용하여 네트워크를 적용하고, 클래스 점수 맵을 생성한 후 공간 평균을 통해 고정 크기의 클래스 점수 벡터를 얻습니다. 이미지의 여러 크기에서 테스트하여 성능을 평가합니다.

🤔 **Class Score Map?**

Class score map은 이미지의 각 부분(픽셀이나 작은 영역)에 대해 해당 영역이 특정 클래스에 속할 확률을 나타내는 맵(지도)입니다. 이는 주로 물체 인식(Object Detection) 또는 분할(Segmentation) 작업에서 사용됩니다.

작동 방식

입력 이미지: 주어진 입력 이미지를 ConvNet(합성곱 신경망)에 입력합니다.
합성곱 연산: 여러 합성곱 층을 거치면서 이미지의 특징을 추출합니다.
출력: 네트워크의 최종 출력은 클래스별 점수 맵(예: 각 픽셀이 특정 클래스일 확률)입니다.

예시

물체 인식(Object Detection): 입력 이미지가 고양이와 강아지를 포함하고 있는 경우, class score map은 이미지의 각 픽셀이 고양이일 확률과 강아지일 확률을 나타냅니다. 각 픽셀의 값은 특정 클래스에 대한 신뢰도를 나타냅니다.
분할(Segmentation): 이미지의 각 픽셀이 어느 객체에 속하는지를 예측하여, 결과적으로 이미지의 각 부분을 서로 다른 객체로 분할합니다.

단계별 과정

이미지 전처리: 입력 이미지를 전처리(예: 크기 조정, 평균값 빼기)합니다.
합성곱 층 적용: 여러 합성곱 층을 통해 이미지의 특징을 추출합니다.
클래스 예측: 네트워크의 마지막 층에서 각 클래스에 대한 점수를 계산하여 class score map을 생성합니다.
후처리: 필요에 따라 이 점수 맵을 기반으로 최종 클래스 예측을 수행하거나, 경계 상자(Bounding Box)를 생성하는 등의 후처리 과정을 거칩니다.

사용 예시

Semantic Segmentation: 각 픽셀에 대한 클래스 점수를 계산하여, 결과적으로 각 픽셀이 어느 클래스에 속하는지 예측합니다.
Object Detection: 이미지의 여러 영역에 대한 클래스 점수를 계산하여, 물체의 위치와 클래스를 예측합니다.

•다중 크기 테스트: 여러 크기의 입력 이미지를 사용하여 네트워크 성능을 평가합니다. 단일 크기 테스트보다 더 나은 성능을 보입니다. → 다른 convolution boundary conditions 때문!

3.3. Implementation Details

Caffe 툴박스: Caffe 툴박스를 기반으로 여러 GPU를 사용하여 병렬로 훈련 및 평가를 수행합니다.
다중 GPU 훈련은 데이터 병렬 처리를 활용하여 각 배치를 여러 GPU로 나누어 처리합니다.(Multi-Gpu training = data parallelism)

🤔 **Caffe Toolbox?**

사실 무엇이 더

4 | Classification Experiments

4. Classification Experiments

4.1. Single Scale Evaluation

Single Scale Testing: Evaluates the performance of each network configuration using a fixed-size input image.
Results: Shows that deeper networks (e.g., configuration E) perform better, with configuration E achieving 25.5% top-1 error and 8.0% top-5 error.

4.2. Multi-Scale Evaluation

Multi-Scale Testing: Uses images of different sizes for testing, showing better performance than single-scale testing.
Results: Configuration E achieves 24.8% top-1 error and 7.5% top-5 error in multi-scale testing.

4.3. Multi-Crop Evaluation

Multi-Crop Evaluation: Uses multiple parts of an image to evaluate the network and averages the outputs to get the final result.
Results: Multi-crop evaluation performs slightly better than single-scale evaluation.

4.4. Network Fusion

Network Fusion: Combines the outputs of multiple networks to improve performance. For example, combining configurations D and E achieves 23.7% top-1 error and 6.8% top-5 error.

4.5. Comparison with the State of the Art

Comparison: The proposed networks outperform existing state-of-the-art models. For example, configuration E is competitive with GoogLeNet (6.7% error) and significantly outperforms Clarifai (11.7% error).

5. Conclusion

Importance of Depth: Demonstrates that increasing the depth of ConvNets significantly improves image classification accuracy.
Generalization Performance: Shows that the proposed models perform well across various datasets, achieving state-of-the-art results on PASCAL VOC and Caltech-256 datasets.

이영락

AI Engineer / 의료인공지능

다음 포스트