[Init] Kaiming Initialization

안암동컴맹·2024년 4월 21일

Deep Learning

목록 보기

17/31

Kaiming Initialization

Introduction

Kaiming Initialization, also known as He Initialization, is a technique used to initialize the weights of deep neural networks, particularly those with ReLU activation functions. It is named after Kaiming He, who proposed this method to address the issues of vanishing and exploding gradients in deep networks. Proper weight initialization is crucial as it significantly impacts the convergence and performance of the model.

Background and Theory

Problem with Poor Initialization

In neural networks, especially deep architectures, the choice of weight initialization can affect the training dynamics significantly. Poor initialization can lead to problems such as:

Vanishing gradients: Where gradients become too small, causing the weights to stop updating during backpropagation.
Exploding gradients: Where gradients become too large, leading to unstable training processes.

Theoretical Foundations

Kaiming Initialization addresses these issues by considering the variance of the outputs from a layer to be equal to the variance of its inputs. This balance helps in maintaining a stable gradient flow across layers, which is critical in deep networks.

Mathematical Formulation

For a layer with $n_{in}$ incoming connections (fan-in), Kaiming Initialization sets the weight $W$ as:

W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)

where $\mathcal{N}\left(0, \frac{2}{n_{in}}\right)$ denotes a normal distribution with mean $0$ and variance $\frac{2}{n_{in}}$ . This choice is derived from the consideration that the ReLU function, which outputs zero for any negative input and acts as a linear function for positive inputs, effectively reduces the variance of the outputs by half compared to the variance of the inputs.

Procedural Steps

Identify fan-in ( $n_{in}$ ): Determine the number of incoming connections to a neuron in the layer for which the weights need to be initialized.
Calculate variance: Set the variance of the weights to $\frac{2}{n_{in}}$ .
Initialize weights: Draw random values for the weights from a normal distribution with the calculated variance.

Applications

Kaiming Initialization is widely used in networks with ReLU activations and its variants. It has been proven effective in:

Convolutional Neural Networks (CNNs)
Fully Connected Networks
Deep Learning models that use variants of ReLU like Leaky ReLU and Parametric ReLU

Strengths and Limitations

Strengths

Improves convergence rate: By maintaining the variance across layers, it helps in faster and more stable convergence.
Reduces the problem of vanishing/exploding gradients: Especially critical in deep networks.

Limitations

Specific to ReLU activations: The initialization may not be optimal for other activation functions like sigmoid or tanh.
Empirical nature: The effectiveness can vary depending on the specific architecture and settings of the neural network.

Advanced Topics

Extension to Different Activation Functions

Adjustments to Kaiming Initialization can be made to better suit other activation functions. For example, different scaling factors might be used based on the expected variance reduction of other activations.

References

He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification." Proceedings of the IEEE international conference on computer vision. 2015.

안암동컴맹

𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀