paper
reference
동빈나 유튜브 https://youtu.be/58fuWVu5DVU
code https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
블로그 https://eehoeskrap.tistory.com/430
We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.
저자는 internal covariate shift를 해결하기 위해서 BN을 제안했지만 뒤에 나오는 'How Does Batch Normalization Help Optimization?'에 따르면 BN은 internal covariate shift를 해결하지 못했다고 주장한다.
> phenomenon: the distribution of each layer's inputs changes during training, as the parameters of the previous layers change.
Batch Normalization allows us to use much higher learning rates and be less careful about initializaion. It also acts as a regularizer.
As a result, BN achieves the same accuracy with fewer training steps and beats the original model by a significant margin.
we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would accelerate.
then, we propose a new mechanism, which we call Batch Normalization. It makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.
Whitening transformation
https://dive-into-ds.tistory.com/19
whitening: zero means, unit variances, decorrelated.
We could consider whitening activations at every training step or at some interval, either by modifying the network directly or by changing the parameters of the optimization algorithm.
However, the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated.
The issue with the above approach is that the gradient descent optimization does not take into account the fact that the normalizatino takes place.
Within this framework, whitening the layer inputs is expensive. This motivates us to seek an alternative that performs input normalizatino in a way that is differentiable and does not require the analysis of the entire training set.
y = r * x(^) + b
x(^)
학습 패러미터
추론할 때,
각 레이어의 활성화값이 적당히, 골고루 퍼지도록 하는 레이어
시작은 배치 정규화 논문 이해와 구현이었지만,
효과를 확인하기 위해서 fully connected layers를 직접 구현해야 했다.
바닥부터 구현하는 신경망 구조와 제대로된 역전파 이해를 할 수 있었다.