[Optimizer] NAdam Optimization

안암동컴맹·2024년 4월 14일

Deep Learning

목록 보기

16/31

Nesterov-accelerated Adam (NAdam)

Introduction

NAdam, or Nesterov-accelerated Adaptive Moment Estimation, is an optimization algorithm that combines the techniques of Adam and Nesterov momentum. It was developed to enhance the convergence properties of the Adam optimizer by integrating the predictive update step characteristic of Nesterov accelerated gradient (NAG), thus potentially leading to faster and more stable convergence in training deep learning models.

Background and Theory

Adam Optimization

Adam is an optimizer that calculates adaptive learning rates for each parameter by estimating the first (the mean) and second (the variance) moments of the gradients. Its update rule is:

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

where:

$\theta_t$ is the parameter vector at time step $t$ ,
$\eta$ is the learning rate,
$\hat{m}_t$ and $\hat{v}_t$ are bias-corrected estimates of the first and second moments of the gradients,
$\epsilon$ is a small constant added for numerical stability.

Nesterov Momentum

Nesterov momentum is a variant of the momentum update that provides a "look-ahead" capability, making it more responsive to changes in the gradient. The Nesterov update is generally formulated as:

\theta_{t+1} = \theta_t + \mu_t \Delta \theta_{t-1} - \eta \nabla f(\theta_t + \mu_t \Delta \theta_{t-1})

where $\mu_t$ is the momentum coefficient.

NAdam Algorithm

NAdam incorporates the benefits of both Adam's adaptive learning rates and Nesterov momentum's anticipatory updates. The update equations for NAdam are a modification of the Adam update equations, using the Nesterov momentum term as follows:

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\beta_1 \hat{m}_t + \frac{(1 - \beta_1) g_t}{1 - \beta_1^t})

where $g_t$ is the gradient at time step $t$ , and $\beta_1$ is the exponential decay rate for the first moment estimates.

Procedural Steps

Initialize Parameters: Define initial parameters $\theta_0$ , learning rate $\eta$ , $\beta_1$ , $\beta_2$ , momentum $\mu$ , and $\epsilon$ .
Compute Gradients: For each parameter $\theta_t$ , calculate the gradient $g_t$ of the loss function at iteration $t$ .
Update Biased Moments:
- First moment (mean) estimate: $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$
- Second moment (variance) estimate: $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$
Correct Bias in Moments:
- Correct first moment: $\hat{m}_t = \frac{m_t}{1-\beta_1^t}$
- Correct second moment: $\hat{v}_t = \frac{v_t}{1-\beta_2^t}$
Nesterov Update:
- Use Nesterov's lookahead to adjust the gradients: $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\beta_1 \hat{m}_t + \frac{(1 - \beta_1) g_t}{1 - \beta_1^t})$

Applications

NAdam is employed in scenarios where the benefits of both Adam and Nesterov momentum are desired, such as in training complex neural networks where faster convergence may reduce training time and improve model performance.

Strengths and Limitations

Strengths:

Combines the advantages of adaptive learning rates with the predictive updates of Nesterov momentum.
Can lead to faster convergence compared to using Adam alone.

Limitations:

More sensitive to the setting of hyperparameters.
The benefits over Adam or Nesterov alone are not guaranteed and can depend heavily on the specific problem and data characteristics.

Advanced Topics

Explorations into further modifications of NAdam might include adaptive adjustments of the hyperparameters during training, or combining it with other regularization methods to enhance model generalization.

References

Dozat, Timothy. "Incorporating Nesterov Momentum into Adam." ICLR Workshop, 2016.

안암동컴맹

𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀