[Optimizer] Adam Optimization

안암동컴맹·2024년 4월 11일

Deep Learning

목록 보기

11/31

Adam Optimization

Introduction

Adam (Adaptive Moment Estimation) is a popular optimization algorithm widely used in the training of deep neural networks. Introduced by Kingma and Ba in their 2014 paper, Adam combines the advantages of two other extensions of stochastic gradient descent, namely RMSProp and Adaptive Gradient Algorithm (AdaGrad), to handle sparse gradients on noisy problems. Adam is particularly effective because it combines the benefits of adaptive gradient algorithm (adaptive learning rates for each parameter) with the benefits of momentum (considering past gradients to smooth the updates).

Background and Theory

Adam is distinguished by its use of squared gradients to scale the learning rate and its incorporation of an exponentially decaying average of past gradients, similar to momentum. It calculates individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

The algorithm's update rule is motivated by the desire to perform a more efficient step of gradient descent, adjusting the parameter updates based on a computationally efficient estimate of the first and second moments of the gradients. The inclusion of bias corrections helps Adam to make effective updates even in the initial time steps of the optimization process when the moment estimates may be highly inaccurate due to their initialization at zero.

Mathematical Formulation

Adam's parameters are updated according to the following equations:

Gradient Calculation
$g_t = \nabla_\theta L(\theta_t)$
Where $g_t$ is the gradient of the loss function $L$ with respect to the parameters at step $t$ .
Update Bias-corrected First Moment Estimate
$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$
Here, $m_t$ represents the moving average of the gradients, and $\beta_1$ is the decay rate for this moving average.
Update Bias-corrected Second Raw Moment Estimate:
$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
Where $v_t$ is the moving average of the squared gradients, and $\beta_2$ is the decay rate for the squared gradients.
Compute Bias-corrected Moment Estimates:
$\hat{m}_t = \frac{m_t}{1 - \beta_1^t},\quad\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
These bias corrections help adjust for the initialization bias towards zero.
Update Parameters:
$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$
Where $\eta$ is the step size (learning rate), and $\epsilon$ is a small number added to prevent division by zero.

Procedural Steps

Initialization: Initialize parameters $\theta$ , first moment vector $m$ , second moment vector $v$ , and timestep $t$ .
Compute Gradient: Calculate the gradient $g_t$ of the loss function with respect to the parameters.
Update Moments: Update the first moment $m$ and second moment $v$ estimates, including their bias-corrected versions $\hat{m}$ and $\hat{v}$ .
Parameter Update: Adjust the parameters based on the corrected first and second moment estimates.
Iteration: Repeat steps 2-4 until convergence or a specified number of epochs is completed.

Applications

Adam's effectiveness and straightforward implementation have made it a popular choice in a wide range of machine learning tasks, including but not limited to training deep convolutional neural networks, recurrent neural networks, and large-scale unsupervised learning models.

Strengths and Limitations

Strengths

Efficiency: Performs well with large datasets and high-dimensional spaces.
Robustness: Less sensitive to hyperparameter settings, particularly the learning rate.

Limitations

Memory Usage: Requires more memory to store the first and second moment vectors.
Bias Towards Initial Steps: Early in training, estimates of $m$ and $v$ can be significantly off due to initialization at zero.

Advanced Topics

AdamW: A variant that decouples weight decay from the optimization steps, often leading to better training stability.
AMSGrad: A modification that ensures the monotonic decrease of learning rates, aiming to improve the theoretical properties of Adam.

References

Kingma, Diederik P., and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).

안암동컴맹

𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀