CV 논문스터디 2주차

daniayo·2025년 3월 18일

논문스터디

목록 보기
2/5
post-thumbnail

Going deeper with convolutions


Abstract

Inception
Responsible for setting the new state of the art for classification and detection

  • Main hallmark : Improved utilization of the computing resources inside the network
    • Achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant
  • Optimize quality : architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing

    Hebbian Principle
    👉 " Cells that fire together, wire together "
    뉴런 간의 연결(시냅스)이 강화되는 원리

1. Introduction

The biggest gains in object-detection come from the synergy of deep architectures and classical computer vision.
Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithm gains importance.

The models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time
→ Could be put to real world use

We will focus on an efficient deep neural network architecture for computer vision, codenamed Inception
The word "deep" is used in two different meanings

  • a new level of organization in the form of the "Inception module"
  • in the more direct sense of increased network depth
  • Existing studies(Alexnet, VGGnet) show that deeper networks improve performance, but they come with a significant increase in computational cost
  • While deeper and more powerful models are needed, computational efficiency and structural optimization are also essential

In our setting, 1 x 1 convolutions have dual purpose :

  • Dimension reduction
    • To remove computrational bottlenecks
    • limit the size of our networks
  • Increasing depth & width
    • Without significant performance penalty

We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages

  • Multi-box prediction for higher object bounding box recall
  • Ensemble approaches for better categorization of bounding box proposals

3. Motivation and High Level Considerations

To improve the performance of deep neural network ,

Increase the Size

  • Increasing the depth : the number of levels of the network
  • Increasing the width : the number of units at each level
  • 👍 Easy and Safe way of training higher quality models

  • 👎 Larger number of parameters

    • Makes the enlarged network more prone to overfitting
  • 👎 Dramatically increased use of computational resources

    • Any uniform increase in the number of their filters results in a quadratic increase of computation

To solve both issues, ultimately move from fully connected to sparsely connected achitectures, even inside the convolutions

" Neurons that fire together, wire together " suggests that the underlying idea is applicable even under less strict conditions, in practice

On the downside, todays computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data-structures

The uniformly of the structure and a large number of filters and greater batch size allow for utilizing efficient dense computation

The vast literature on the sparse matrix computations suggests that clustering sparse matrices into relatively dense submatrices tends to give state of the art practical performance for sparse matrix multiplication

After further tuning of learning rate, hyperparameters and improved training methodology, we established that the resulting Inception architecture was especially useful in the context of localization and object detection

🙋❓: Although the proposed architecture has become a success for computer vision, it is still questionable whether its quality can be attributed to the guiding principles that have lead to its construction.

4. Architectural Details

The main idea of the INCEPTION architecture

☝️ Finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components

  • All we need is to find the optimal local construction and to repear it spatially
    • Translation invariance
      : 입력 데이터가 이동해도 출력이 변하지 않는 성질
  • Filter Bank
    : 여러 개의 필터 (또는 커널)를 모아놓은 집합, 주로 주파수, 패턴, 특징 추출
    → 여러 필터가 다양한 패턴을 한 번에 탐지

As "Inception modules" are stacked on top of each other, their output correlation statistics are bound to vary
→ 👎 Even a modest number of 5x5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters


✌️ Judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise

  • This is based on the success of embeddings
    • Even low dimensional embeddings might contain a lot of information about a relatively large image patch

In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid.
This is not strictly necessary, simply reflecting some infrastructure inefficiencies in our current implementation.

Main beneficial Aspects of..

  • Architecture
    • it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity
  • Design
    -it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously

The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties.

Another way to utilize the inception architecture

  • Create slightly inferior, but computationally cheaper versions of it
    All the included knobs and levers allow for a controlled balancing of computational resources, however this requires careful manual design

    Knob (손잡이)
    → 연속적인 값 조절 가능!
    ex ) Learning Rate, Regularization Strength, Dropout Rate
    Lever (레버)
    → 이산적인 선택 조정 가능!
    ex ) 활성화 함수 선택, Optimizer 선택

5. GoogLeNet

They used a deeper and wider Inception network, the quality of which was slightly inferior, but addingit to the ensemble seemed to improved the results marginally.

All the convolutions, including those inside the Inception modules, use rectified linear activation.

The network was designed with computational efficiency and pratically in mind, so that inference can be run on individual devices including even those with limited computational resources. The layer number depends on the machine learning infrastructure system used.
A move from fully connected layers to average pooling improved the top-1 accuracy, however the use of dropout remained essential even after removing the fully connected layers.

Given the relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern.
Strong performance of relatively shallower networks on this task suggests that features produced by the layers in the middle of the network should be very discriminative

Propagate gradient
: 오차(손실)가 네트워크를 따라 거꾸로 전달되면서 각 층의 가중치를 업데이트를 하는 과정

6. Training Methodology

Networks were trained using the DistBelief distributed machine learning system using modest amount of model and data-parallelism. They used CPU based implementation only, the main limitation being the memory usage.
Their training used asynchronous stochastic gradient descent with 0.9 momentum, fixed learning rate schedule.
Polyak averaging was used to create the final model used at inference time.

👍 Sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area and whose aspect ratio is chosen randomly between 3/4 and 4/3

👍 Photometic distortions by Andrew Haward were useful to combat overfitting to some extent

Parallelism
n. 병렬처리
: 계산을 여러 장치 (GPU, TPU, ...)에 나눠서 동시에 수행하는 것

Stochastic Gradient Descent (SGD)
: 훈련 데이터 전체가 아니라, 무작위로 뽑은 일부 데이터만 사용해 가중치를 업데이트

Asynchronous SGD
: 여러 워커가 독립적으로 훈련하고, 계산이 끝나는 즉시 기울기를 반영해 가중치를 업데이트

7. ILSVRC 2014 Classification Challenge Setup and Results

The ILSVRC 2014 classification challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet hierarchy.

Participated in the challenge with no external data used for training.

Training Techniques
1. Independently trained 7 versions of the same GoogLeNet model, and performed ensemble prediction with them
2. During testing, adopted a more aggressive cropping approach
3. The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction

When they used one model, they chose the one with the lowest top-1 error rate on the validation data. All numbers are reported on the validation dataset in order to not overfit to the testing data statistics.

8. ILSVRC 2014 Detection Challenge Setup and Results

The ILSVRC detection task is to produce bounding boxes around objects in images among 200 possible classes

They report the official scores and common strategies for each team
: The use of external data, ensemble models or contextual models

9. Conclusions

Approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision.

Main advantage

  • Significant quality gain at a modest increase of computational requriements compared to shallower and less wide networks.
  • Detection work was competitive despite of neither utilizing context nor performing bounding box regression and this fact provides further evidence of the strength of the Inception architecture.

10. Acknowledgements

Thank you!


한국어 요약

Inception 모듈의 핵심 아이디어

  • 다양한 크기의 필터(1x1, 3x3, 5x5)를 병렬적으로 적용하여 여러 스케일의 정보를 동시에 학습
  • 1x1 컨볼루션을 활용한 차원 축소를 통해 연산량을 효율적으로 조절
  • 이러한 설계를 통해 네트워크의 깊이와 폭을 동시에 증가시키면서도 연산 비용은 최소화

GoogLeNet의 주요 특징

  • 기존 모델(AlexNet, VGGNet)보다 훨씬 깊고 넓은 구조를 가지면서도 연산량은 절감
  • Fully connected layer를 제거하고 global average pooling을 사용하여 파라미터 수를 줄이고 과적합 방지
  • Auxiliary classifier를 중간 레이어에 추가하여 기울기 소실 문제 해결
  • 모든 컨볼루션 연산에서 ReLU 활성화 함수 사용

실험 및 성능 결과

  • ILSVRC 2014 이미지 분류 대회에서 top-5 error rate 6.67%로 1위 차지
  • Detection task에서도 좋은 성능을 보였으며, 특히 bounding box regression 없이도 경쟁력 있는 결과를 기록

결론 및 의의

  • 희소한 네트워크 구조를 밀집 연산으로 근사하는 접근 방식이 실용적이고 효과적임을 입증
  • 깊이와 너비를 늘리면서도 연산량을 효율적으로 관리하는 전략을 통해, 컴퓨터 비전에서의 딥러닝 모델 성능을 크게 향상
profile
댜니에요

0개의 댓글