[Optimizer] NAdam Optimization

안암동컴맹·2024년 4월 14일
0

Deep Learning

목록 보기
16/31

Nesterov-accelerated Adam (NAdam)

Introduction

NAdam, or Nesterov-accelerated Adaptive Moment Estimation, is an optimization algorithm that combines the techniques of Adam and Nesterov momentum. It was developed to enhance the convergence properties of the Adam optimizer by integrating the predictive update step characteristic of Nesterov accelerated gradient (NAG), thus potentially leading to faster and more stable convergence in training deep learning models.

Background and Theory

Adam Optimization

Adam is an optimizer that calculates adaptive learning rates for each parameter by estimating the first (the mean) and second (the variance) moments of the gradients. Its update rule is:

θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

where:

  • θt\theta_t is the parameter vector at time step tt,
  • η\eta is the learning rate,
  • m^t\hat{m}_t and v^t\hat{v}_t are bias-corrected estimates of the first and second moments of the gradients,
  • ϵ\epsilon is a small constant added for numerical stability.

Nesterov Momentum

Nesterov momentum is a variant of the momentum update that provides a "look-ahead" capability, making it more responsive to changes in the gradient. The Nesterov update is generally formulated as:

θt+1=θt+μtΔθt1ηf(θt+μtΔθt1)\theta_{t+1} = \theta_t + \mu_t \Delta \theta_{t-1} - \eta \nabla f(\theta_t + \mu_t \Delta \theta_{t-1})

where μt\mu_t is the momentum coefficient.

NAdam Algorithm

NAdam incorporates the benefits of both Adam's adaptive learning rates and Nesterov momentum's anticipatory updates. The update equations for NAdam are a modification of the Adam update equations, using the Nesterov momentum term as follows:

θt+1=θtηv^t+ϵ(β1m^t+(1β1)gt1β1t)\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\beta_1 \hat{m}_t + \frac{(1 - \beta_1) g_t}{1 - \beta_1^t})

where gtg_t is the gradient at time step tt, and β1\beta_1 is the exponential decay rate for the first moment estimates.

Procedural Steps

  1. Initialize Parameters: Define initial parameters θ0\theta_0, learning rate η\eta, β1\beta_1, β2\beta_2, momentum μ\mu, and ϵ\epsilon.
  2. Compute Gradients: For each parameter θt\theta_t, calculate the gradient gtg_t of the loss function at iteration tt.
  3. Update Biased Moments:
    • First moment (mean) estimate: mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
    • Second moment (variance) estimate: vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2
  4. Correct Bias in Moments:
    • Correct first moment: m^t=mt1β1t\hat{m}_t = \frac{m_t}{1-\beta_1^t}
    • Correct second moment: v^t=vt1β2t\hat{v}_t = \frac{v_t}{1-\beta_2^t}
  5. Nesterov Update:
    • Use Nesterov's lookahead to adjust the gradients:
      θt+1=θtηv^t+ϵ(β1m^t+(1β1)gt1β1t)\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} (\beta_1 \hat{m}_t + \frac{(1 - \beta_1) g_t}{1 - \beta_1^t})

Applications

NAdam is employed in scenarios where the benefits of both Adam and Nesterov momentum are desired, such as in training complex neural networks where faster convergence may reduce training time and improve model performance.

Strengths and Limitations

Strengths:

  • Combines the advantages of adaptive learning rates with the predictive updates of Nesterov momentum.
  • Can lead to faster convergence compared to using Adam alone.

Limitations:

  • More sensitive to the setting of hyperparameters.
  • The benefits over Adam or Nesterov alone are not guaranteed and can depend heavily on the specific problem and data characteristics.

Advanced Topics

Explorations into further modifications of NAdam might include adaptive adjustments of the hyperparameters during training, or combining it with other regularization methods to enhance model generalization.

References

  1. Dozat, Timothy. "Incorporating Nesterov Momentum into Adam." ICLR Workshop, 2016.
profile
𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀

0개의 댓글