[Optimizer] Adaptive Delta (AdaDelta)

안암동컴맹·2024년 4월 12일
0

Deep Learning

목록 보기
13/31

AdaDelta Optimizer

Introduction

AdaDelta is an optimization algorithm designed to address the rapidly diminishing learning rates encountered in AdaGrad. Introduced by Zeiler in his 2012 paper, AdaDelta modifies AdaGrad's accumulation of all past squared gradients by limiting the accumulation to a fixed-size window of past gradients. This results in an adaptive learning rate approach that overcomes the key challenges of AdaGrad—namely, the aggressive, monotonically decreasing learning rate—without the need for a manually set global learning rate.

Background and Theory

AdaDelta extends the AdaGrad approach of adapting the learning rate to each parameter by considering the decaying average of past squared gradients. Unlike AdaGrad, which continues accumulating squared gradients throughout training, potentially leading to very small learning rates, AdaDelta uses a sliding window of gradient updates (exponential moving average) to keep this accumulation under control. This makes it robust to a wide range of initial configurations and reduces the need to set a default learning rate.

Mathematical Formulation

AdaDelta is characterized by the following key equations, which describe its approach to updating parameters without the need for an explicit learning rate:

  1. Gradient Calculation:

    gt=θL(θt)g_t = \nabla_\theta L(\theta_t)

    Where gtg_t is the gradient of the loss function LL with respect to the parameters θ\theta at time step tt.

  2. Accumulate Exponential Moving Averages of Squared Gradients

    E[g2]t=ρE[g2]t1+(1ρ)gt2E[g^2]_t = \rho E[g^2]_{t-1} + (1 - \rho) g_t^2

    Here, E[g2]tE[g^2]_t is the decaying average of past squared gradients, and ρ\rho is the decay constant.

  3. Compute Update Amounts:

    Δθt=Δθt12+ϵE[g2]t+ϵgt\Delta \theta_t = - \frac{\sqrt{\Delta \theta_{t-1}^2 + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} g_t

    Δθt\Delta \theta_t is the amount by which parameters are adjusted, where Δθt12\Delta \theta_{t-1}^2 is the exponentially decaying average of past squared parameter updates.

  4. Accumulate Exponential Moving Averages of Squared Parameter Updates:

    E[Δθ2]t=ρE[Δθ2]t1+(1ρ)Δθt2E[\Delta \theta^2]_t = \rho E[\Delta \theta^2]_{t-1} + (1 - \rho) \Delta \theta_t^2
  5. Parameter Update:

    θt+1=θt+Δθt\theta_{t+1} = \theta_t + \Delta \theta_t

    This update uses the square roots of the decaying averages to scale the gradient and adjust the parameters.

Procedural Steps

  1. Initialization: Initialize parameters θ\theta, the decayed average of past squared gradients E[g2]E[g^2], and the decayed average of past squared updates E[Δθ2]E[\Delta \theta^2] to zero.
  2. Gradient Computation: Calculate the gradient gtg_t.
  3. Update Squared Gradient Moving Average: Compute E[g2]tE[g^2]_t.
  4. Calculate Parameter Update: Compute Δθt\Delta \theta_t using the ratio of the square roots of the accumulated averages.
  5. Update Squared Update Moving Average: Update E[Δθ2]tE[\Delta \theta^2]_t.
  6. Apply Update: Adjust the parameters θt\theta_t.
  7. Repeat: Iterate steps 2-6 until convergence.

Applications

AdaDelta is effectively used in training deep neural networks, particularly in scenarios where the choice of a correct learning rate is difficult or where gradients vary significantly in magnitude.

Strengths and Limitations

Strengths

  • Elimination of Learning Rate: Does not require the manual tuning of a learning rate, which simplifies configuration and improves robustness.
  • Robust to Vanishing Learning Rate: Overcomes the problem of diminishing learning rates encountered in AdaGrad.

Limitations

  • Complexity: More complex implementation compared to simpler methods like SGD or AdaGrad.
  • Parameter Sensitivity: The performance of AdaDelta can be sensitive to the choice of decay rate ρ\rho and the initialization of parameters.

Advanced Topics

  • Comparison with RMSProp: Both AdaDelta and RMSProp use a moving average of squared gradients, but AdaDelta further refines the approach by also adapting the update magnitudes directly based on historical update magnitudes.

References

  1. Zeiler, Matthew D. "ADADELTA: an adaptive learning rate method." arXiv preprint arXiv:1212.5701 (2012).
profile
𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀

0개의 댓글