[Init] Xavier Initialization

안암동컴맹·2024년 4월 22일

Deep Learning

목록 보기

18/31

Xavier Initialization

Introduction

Xavier Initialization, also known as Glorot Initialization, is a strategy for weight initialization in neural networks. It was introduced by Xavier Glorot and Yoshua Bengio to tackle the problem of vanishing and exploding gradients in deep neural networks, particularly those with sigmoid and tanh activation functions. This technique is aimed at maintaining a balanced variance of activations and gradients throughout the network, thus facilitating stable and efficient training.

Background and Theory

The Challenge of Weight Initialization

Initial weight settings in neural networks have a profound impact on the training dynamics and the ultimate performance of the model. Improper initialization can lead to:

Vanishing gradients: Gradients diminish exponentially during backpropagation, halting the network's learning.
Exploding gradients: Gradients grow exponentially, causing numerical instability and divergent behaviors.

Theoretical Foundations of Xavier Initialization

Xavier Initialization addresses these issues by aiming to keep the variance of the inputs and outputs of each network layer approximately equal. This approach is rooted in the need to maintain the flow of gradients in a controlled manner across the depth of the network, avoiding the extremes of vanishing and exploding values.

Mathematical Formulation

For a layer with $n_{in}$ incoming connections (fan-in) and $n_{out}$ outgoing connections (fan-out), Xavier Initialization proposes setting the weights from a distribution with a variance given by:

Var(W) = \frac{2}{{n_{in} + n_{out}}}

This variance is derived under the assumption that the activations are linear. For a uniform distribution, the weights are initialized as:

W \sim U\left(-\sqrt{\frac{6}{{n_{in} + n_{out}}}}, \sqrt{\frac{6}{{n_{in} + n_{out}}}}\right)

For a normal distribution, the initialization would be:

W \sim \mathcal{N}\left(0, \frac{2}{{n_{in} + n_{out}}}\right)

This approach ensures that the variance remains balanced across layers, a crucial factor when using activation functions like sigmoid or tanh, which are sensitive to input scale.

Procedural Steps

Determine fan-in and fan-out: Calculate the number of incoming and outgoing connections for each layer in the network.
Select variance formula: Choose whether to use a uniform or normal distribution based on the nature of the problem and personal preference.
Initialize weights: Using the chosen distribution and calculated variance, randomly assign initial weights to the neurons in the layer.

Applications

Xavier Initialization is particularly beneficial for networks using activation functions like sigmoid and tanh, due to their sensitivity to the scale of input distribution. It is widely applied in:

Multi-layer Perceptrons (MLPs)
Simple Recurrent Neural Networks (RNNs)
Early layers of Convolutional Neural Networks (CNNs) where sigmoid or tanh activations are used

Strengths and Limitations

Strengths

Facilitates convergence: By maintaining a balanced variance, it helps in achieving faster and more stable convergence during training.
Reduces training difficulties: Mitigates the issues of vanishing and exploding gradients, particularly in deep networks with non-ReLU activations.

Limitations

Not ideal for ReLU activations: For layers using ReLU or its variants, methods like Kaiming Initialization are more effective.
Based on linear assumption: The initialization assumes linear activations, which might not hold true for non-linear functions, potentially leading to suboptimal performance.

Advanced Topics

Adjustments for Non-linear Activations

Considering non-linear activation functions, adjustments to the Xavier method have been proposed. For instance, modifying the variance calculation to better align with the specific properties of non-linear functions like tanh or sigmoid can enhance performance.

References

Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249-256. 2010.

안암동컴맹

𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀