[paper review] Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift

Jude's Sound Lab·2025년 1월 16일

Paper Review

목록 보기

16/17

This post deals with batch normalization and my questions 'unsolved' to 'the paper' not the concept of batch normalization. Well, those two things are the same, but I can't follow the flow of the paper. Honestly, for me, the paper is little bit hard to follow. Let's break down the concept of batch normalization and check some parts which made me confused.

Batch Normalization (BN)

How BN Interacts with Gradient Descent

BN's interaction with gradient descent goes beyond simple normalization due to the following mechanisms:

1. Gradients Flow Through Batch Statistics (𝜇 and 𝜎):

During training, BN calculates 𝜇 (mean) and $\sigma^2$ (variance) from the current mini-batch. These statistics depend on the inputs $x_i$ from the mini-batch, which in turn depend on the learnable parameters Θ (e.g., weights and biases of preceding layers).
When computing the loss gradient, 𝜇 and 𝜎 are treated as functions of Θ, so their partial derivatives ( $\partial\mu$ / $\partial\Theta$ and $\partial\sigma$ / $\partial\Theta$ ) are incorporated into the backpropagation process. This ensures that parameter updates account for the effect of normalization on the training dynamics.

2. Trainable Parameters 𝛾 and 𝛽:

The trainable parameters 𝛾 and 𝛽 are explicitly optimized during gradient descent.
Their inclusion allows the network to "learn" the appropriate scale and shift for normalized values, ensuring flexibility in representation.
Without these parameters, the network might be overly constrained by zero-mean and unit-variance outputs, limiting learning.

3. Normalizing Activation Gradients:

BN indirectly regularizes the gradients by keeping activations within a stable range. This reduces the risk of vanishing or exploding gradients, especially in deep networks.
By stabilizing activations, BN helps gradient updates remain meaningful across layers, improving convergence.

Jude's Sound Lab

chords & code // harmony with structure

이전 포스트

[paper review] Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

다음 포스트