[Optimizer] Adam Optimization

안암동컴맹·2024년 4월 11일
0

Deep Learning

목록 보기
11/31

Adam Optimization

Introduction

Adam (Adaptive Moment Estimation) is a popular optimization algorithm widely used in the training of deep neural networks. Introduced by Kingma and Ba in their 2014 paper, Adam combines the advantages of two other extensions of stochastic gradient descent, namely RMSProp and Adaptive Gradient Algorithm (AdaGrad), to handle sparse gradients on noisy problems. Adam is particularly effective because it combines the benefits of adaptive gradient algorithm (adaptive learning rates for each parameter) with the benefits of momentum (considering past gradients to smooth the updates).

Background and Theory

Adam is distinguished by its use of squared gradients to scale the learning rate and its incorporation of an exponentially decaying average of past gradients, similar to momentum. It calculates individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

The algorithm's update rule is motivated by the desire to perform a more efficient step of gradient descent, adjusting the parameter updates based on a computationally efficient estimate of the first and second moments of the gradients. The inclusion of bias corrections helps Adam to make effective updates even in the initial time steps of the optimization process when the moment estimates may be highly inaccurate due to their initialization at zero.

Mathematical Formulation

Adam's parameters are updated according to the following equations:

  1. Gradient Calculation

    gt=θL(θt)g_t = \nabla_\theta L(\theta_t)

    Where gtg_t is the gradient of the loss function LL with respect to the parameters at step tt.

  2. Update Bias-corrected First Moment Estimate

    mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

    Here, mtm_t represents the moving average of the gradients, and β1\beta_1 is the decay rate for this moving average.

  3. Update Bias-corrected Second Raw Moment Estimate:

    vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

    Where vtv_t is the moving average of the squared gradients, and β2\beta_2 is the decay rate for the squared gradients.

  4. Compute Bias-corrected Moment Estimates:

    m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t},\quad\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

    These bias corrections help adjust for the initialization bias towards zero.

  5. Update Parameters:

    θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

    Where η\eta is the step size (learning rate), and ϵ\epsilon is a small number added to prevent division by zero.

Procedural Steps

  1. Initialization: Initialize parameters θ\theta, first moment vector mm, second moment vector vv, and timestep tt.
  2. Compute Gradient: Calculate the gradient gtg_t of the loss function with respect to the parameters.
  3. Update Moments: Update the first moment mm and second moment vv estimates, including their bias-corrected versions m^\hat{m} and v^\hat{v}.
  4. Parameter Update: Adjust the parameters based on the corrected first and second moment estimates.
  5. Iteration: Repeat steps 2-4 until convergence or a specified number of epochs is completed.

Applications

Adam's effectiveness and straightforward implementation have made it a popular choice in a wide range of machine learning tasks, including but not limited to training deep convolutional neural networks, recurrent neural networks, and large-scale unsupervised learning models.

Strengths and Limitations

Strengths

  • Efficiency: Performs well with large datasets and high-dimensional spaces.
  • Robustness: Less sensitive to hyperparameter settings, particularly the learning rate.

Limitations

  • Memory Usage: Requires more memory to store the first and second moment vectors.
  • Bias Towards Initial Steps: Early in training, estimates of mm and vv can be significantly off due to initialization at zero.

Advanced Topics

  • AdamW: A variant that decouples weight decay from the optimization steps, often leading to better training stability.
  • AMSGrad: A modification that ensures the monotonic decrease of learning rates, aiming to improve the theoretical properties of Adam.

References

  1. Kingma, Diederik P., and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).
profile
𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀

0개의 댓글