BatchNorm reduces internal covariate shift, which leads to more stable gradients and faster convergence during training. This can lead to shorter training times and require fewer training epochs.
→ makes the landscape of the corresponding optimization problem
significantly smooth :
ensure that gradients are more predictive → gradient direction remains fairly accurate when taking larger step in a direction of a computed gradient
allows for use of larger range of learning rates and faster network convergence
improved Lipschitzness(ß-smoothness) of both the loss & gradients
↔ non-convex, flat region, sharp minima
BatchNorm acts as a form of regularization, reducing the need for other regularization techniques like dropout. It can mitigate overfitting to some extent.
BatchNorm allows for the use of higher learning rates without the risk of causing divergence or instability in the training process.
BatchNorm can make neural networks more robust to changes in initialization and architecture choices. less sensitive to hyperparameter choices
Batch Norm in RNN can be problematic because we need to compute and store separate statistics for each time step in a sequence & when the length of sequence is different for test cases. On the other hand, Layer Norm depend only on the summed inputs to a layer at the current time-step & has only one set of gain and bias parameters shared over all time-steps.
According to Xiong et.al
learning-rate warm up stage is essential ?
Original-designed Post-LN Transformer places the layer normalization between the residual blocks
→ expected gradients of the parameters near the output layer are large
→ Using large learning rate on the gradients make training unstable
location of layer normalization matters ?
If the layer-normalization is put inside the residual blocks
(Pre-LN Transformer), the gradients are well-behaved at initialization
→ remove warm up stage
started from real-time image generation / style transfer
Instance Norm serves as a pre-conditioner(for graph aggregation) for GNNs.
Preconditioning is weaker with Batch Norm due to heavy batch noise in graph dataset : larger variance of batch-level statistics on graph dataset .
Shift operation in Instance Norm(subtracts the mean statistics from node hidden representations) has expressiveness degradation of GNNs for highly regular graphs - removing mean statistics that has structural information can hurt the performance
: function that applies to each node separately
: matrix representing the neighbor aggregation
: weight/parameter matrix in layer k
Reference
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Layer Normalization
Instance Normalization: The Missing Ingredient for Fast Stylization
GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training
On Layer Normalization in the Transformer Architecture