This post deals with batch normalization and my questions 'unsolved' to 'the paper' not the concept of batch normalization. Well, those two things are the same, but I can't follow the flow of the paper. Honestly, for me, the paper is little bit hard to follow. Let's break down the concept of batch normalization and check some parts which made me confused.
BN's interaction with gradient descent goes beyond simple normalization due to the following mechanisms:
During training, BN calculates 𝜇 (mean) and (variance) from the current mini-batch. These statistics depend on the inputs from the mini-batch, which in turn depend on the learnable parameters Θ (e.g., weights and biases of preceding layers).
When computing the loss gradient, 𝜇 and 𝜎 are treated as functions of Θ, so their partial derivatives ( / and / ) are incorporated into the backpropagation process. This ensures that parameter updates account for the effect of normalization on the training dynamics.
The trainable parameters 𝛾 and 𝛽 are explicitly optimized during gradient descent.
Their inclusion allows the network to "learn" the appropriate scale and shift for normalized values, ensuring flexibility in representation.
Without these parameters, the network might be overly constrained by zero-mean and unit-variance outputs, limiting learning.
BN indirectly regularizes the gradients by keeping activations within a stable range. This reduces the risk of vanishing or exploding gradients, especially in deep networks.
By stabilizing activations, BN helps gradient updates remain meaningful across layers, improving convergence.