[Optimizer] AdaMax Optimization

안암동컴맹·2024년 4월 12일
0

Deep Learning

목록 보기
14/31

AdaMax Optimizer

Introduction

Adamax is a variant of the Adam optimization algorithm, which is itself an extension of the stochastic gradient descent method incorporating momentum and adaptive learning rates. Introduced by Kingma and Ba in their paper alongside Adam, Adamax is often cited as a generalization of Adam based on the infinity norm. Adamax is designed to provide an alternative to the usual L2L_2 norm updates used in Adam, potentially offering more stable and consistent updates under certain conditions.

Background and Theory

While Adam uses the L2L_2 norm of gradients to scale the learning rates, Adamax uses the LL_\infty norm (maximum norm), which may be less sensitive to outliers and large gradients. This can result in a more robust and stable optimization under specific scenarios, especially in cases where the gradient distributions do not conform to expected patterns. Adamax is considered to maintain more of the theoretical properties of AdaGrad while potentially avoiding the vanishing learning rate problem common to Adam.

Mathematical Formulation

The Adamax update rules can be defined through the following steps, incorporating concepts from momentum and scaling based on an infinite norm:

  1. Gradient Calculation:

    gt=θL(θt)g_t = \nabla_\theta L(\theta_t)

    Where gtg_t is the gradient of the loss function LL with respect to the parameters θ\theta at time step tt.

  2. Update Biased First Moment Estimate:

    mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

    This is similar to Adam, where mtm_t is the exponential moving average of the gradient moments, and β1\beta_1 is the decay factor for the first moment.

  3. Update Biased Infinite Norm:

    ut=max(β2ut1,gt)u_t = \max(\beta_2 u_{t-1}, |g_t|)

    Here, utu_t updates based on the infinity norm, where β2\beta_2 is the decay factor for the scaled gradient norms, and gt|g_t| is the element-wise absolute value of the gradients.

  4. Correct Biased First Moment:

    m^t=mt1β1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

    This correction is necessary to account for the initialization bias toward zero.

  5. Parameter Update:

    θt+1=θtηutm^t\theta_{t+1} = \theta_t - \frac{\eta}{u_t} \hat{m}_t

    Here, η\eta is the step size, and the update is scaled inversely by the maximum norm, represented by utu_t.

Procedural Steps

  1. Initialization: Initialize the parameters θ\theta, the first moment vector mm, and the scaled gradient norm uu, and set all values to zero.
  2. Compute Gradient: Calculate the gradient gtg_t at each time step.
  3. Update Moment Vectors: Update the first moment vector mm and the maximum norm uu.
  4. Bias Correction: Apply bias correction to the first moment.
  5. Update Parameters: Adjust the parameters θ\theta based on the corrected moment and the scaled learning rate.
  6. Iteration: Repeat steps 2-5 for each iteration until convergence or a fixed number of epochs is reached.

Applications

Adamax is useful in deep learning applications where the gradients may have large or unpredictable spikes, as its normalization factor based on the LL_\infty norm might confer more stability than the L2L_2 norm used in standard Adam.

Strengths and Limitations

Strengths

  • Robustness to Large Gradients: Less sensitive to anomalies in gradient values.
  • Simplified Hyperparameter Tuning: Similar to Adam, it requires less tuning of the learning rate compared to basic SGD.

Limitations

  • Performance Variability: May not consistently outperform Adam in all scenarios and can be more dataset or problem-specific.
  • Complexity: Slightly more complex than standard SGD due to additional moments and normalization calculations.

Advanced Topics

  • Comparison with Other Optimizers: Understanding how Adamax differs from Adam, RMSProp, and other adaptive learning rate methods can provide deeper insights into its best use cases.

References

  1. Kingma, Diederik P., and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).
profile
𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀

0개의 댓글