
Inception
Responsible for setting the new state of the art for classification and detection
- Main hallmark : Improved utilization of the computing resources inside the network
- Achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant
- Optimize quality : architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing
Hebbian Principle
👉 " Cells that fire together, wire together "
뉴런 간의 연결(시냅스)이 강화되는 원리
The biggest gains in object-detection come from the synergy of deep architectures and classical computer vision.
Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithm gains importance.
The models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time
→ Could be put to real world use
We will focus on an efficient deep neural network architecture for computer vision, codenamed Inception
The word "deep" is used in two different meanings
In our setting, 1 x 1 convolutions have dual purpose :
We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages
To improve the performance of deep neural network ,
Increase the Size
- Increasing the depth : the number of levels of the network
- Increasing the width : the number of units at each level
👍 Easy and Safe way of training higher quality models
👎 Larger number of parameters
👎 Dramatically increased use of computational resources
To solve both issues, ultimately move from fully connected to sparsely connected achitectures, even inside the convolutions
" Neurons that fire together, wire together " suggests that the underlying idea is applicable even under less strict conditions, in practice
On the downside, todays computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data-structures
The uniformly of the structure and a large number of filters and greater batch size allow for utilizing efficient dense computation
The vast literature on the sparse matrix computations suggests that clustering sparse matrices into relatively dense submatrices tends to give state of the art practical performance for sparse matrix multiplication
After further tuning of learning rate, hyperparameters and improved training methodology, we established that the resulting Inception architecture was especially useful in the context of localization and object detection
🙋❓: Although the proposed architecture has become a success for computer vision, it is still questionable whether its quality can be attributed to the guiding principles that have lead to its construction.
☝️ Finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components
- Translation invariance
: 입력 데이터가 이동해도 출력이 변하지 않는 성질
- Filter Bank
: 여러 개의 필터 (또는 커널)를 모아놓은 집합, 주로 주파수, 패턴, 특징 추출
→ 여러 필터가 다양한 패턴을 한 번에 탐지
As "Inception modules" are stacked on top of each other, their output correlation statistics are bound to vary
→ 👎 Even a modest number of 5x5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters
✌️ Judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise
In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid.
This is not strictly necessary, simply reflecting some infrastructure inefficiencies in our current implementation.
Main beneficial Aspects of..
- Architecture
- it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity
- Design
-it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously
The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties.
Another way to utilize the inception architecture
- Create slightly inferior, but computationally cheaper versions of it
All the included knobs and levers allow for a controlled balancing of computational resources, however this requires careful manual designKnob (손잡이)
→ 연속적인 값 조절 가능!
ex ) Learning Rate, Regularization Strength, Dropout Rate
Lever (레버)
→ 이산적인 선택 조정 가능!
ex ) 활성화 함수 선택, Optimizer 선택
They used a deeper and wider Inception network, the quality of which was slightly inferior, but addingit to the ensemble seemed to improved the results marginally.
All the convolutions, including those inside the Inception modules, use rectified linear activation.
The network was designed with computational efficiency and pratically in mind, so that inference can be run on individual devices including even those with limited computational resources. The layer number depends on the machine learning infrastructure system used.
A move from fully connected layers to average pooling improved the top-1 accuracy, however the use of dropout remained essential even after removing the fully connected layers.
Given the relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern.
→ Strong performance of relatively shallower networks on this task suggests that features produced by the layers in the middle of the network should be very discriminative
Propagate gradient
: 오차(손실)가 네트워크를 따라 거꾸로 전달되면서 각 층의 가중치를 업데이트를 하는 과정
Networks were trained using the DistBelief distributed machine learning system using modest amount of model and data-parallelism. They used CPU based implementation only, the main limitation being the memory usage.
Their training used asynchronous stochastic gradient descent with 0.9 momentum, fixed learning rate schedule.
Polyak averaging was used to create the final model used at inference time.
👍 Sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area and whose aspect ratio is chosen randomly between 3/4 and 4/3
👍 Photometic distortions by Andrew Haward were useful to combat overfitting to some extent
Parallelism
n. 병렬처리
: 계산을 여러 장치 (GPU, TPU, ...)에 나눠서 동시에 수행하는 것
Stochastic Gradient Descent (SGD)
: 훈련 데이터 전체가 아니라, 무작위로 뽑은 일부 데이터만 사용해 가중치를 업데이트
Asynchronous SGD
: 여러 워커가 독립적으로 훈련하고, 계산이 끝나는 즉시 기울기를 반영해 가중치를 업데이트
The ILSVRC 2014 classification challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet hierarchy.
Participated in the challenge with no external data used for training.
Training Techniques
1. Independently trained 7 versions of the same GoogLeNet model, and performed ensemble prediction with them
2. During testing, adopted a more aggressive cropping approach
3. The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction
When they used one model, they chose the one with the lowest top-1 error rate on the validation data. All numbers are reported on the validation dataset in order to not overfit to the testing data statistics.
The ILSVRC detection task is to produce bounding boxes around objects in images among 200 possible classes
They report the official scores and common strategies for each team
: The use of external data, ensemble models or contextual models
Approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision.
Main advantage
- Significant quality gain at a modest increase of computational requriements compared to shallower and less wide networks.
- Detection work was competitive despite of neither utilizing context nor performing bounding box regression and this fact provides further evidence of the strength of the Inception architecture.
Thank you!
한국어 요약
Inception 모듈의 핵심 아이디어
GoogLeNet의 주요 특징
실험 및 성능 결과
결론 및 의의