[Optimizer] Adaptive Delta (AdaDelta)

안암동컴맹·2024년 4월 12일

Deep Learning

목록 보기

13/31

AdaDelta Optimizer

Introduction

AdaDelta is an optimization algorithm designed to address the rapidly diminishing learning rates encountered in AdaGrad. Introduced by Zeiler in his 2012 paper, AdaDelta modifies AdaGrad's accumulation of all past squared gradients by limiting the accumulation to a fixed-size window of past gradients. This results in an adaptive learning rate approach that overcomes the key challenges of AdaGrad—namely, the aggressive, monotonically decreasing learning rate—without the need for a manually set global learning rate.

Background and Theory

AdaDelta extends the AdaGrad approach of adapting the learning rate to each parameter by considering the decaying average of past squared gradients. Unlike AdaGrad, which continues accumulating squared gradients throughout training, potentially leading to very small learning rates, AdaDelta uses a sliding window of gradient updates (exponential moving average) to keep this accumulation under control. This makes it robust to a wide range of initial configurations and reduces the need to set a default learning rate.

Mathematical Formulation

AdaDelta is characterized by the following key equations, which describe its approach to updating parameters without the need for an explicit learning rate:

Gradient Calculation:
$g_t = \nabla_\theta L(\theta_t)$
Where $g_t$ is the gradient of the loss function $L$ with respect to the parameters $\theta$ at time step $t$ .
Accumulate Exponential Moving Averages of Squared Gradients
$E[g^2]_t = \rho E[g^2]_{t-1} + (1 - \rho) g_t^2$
Here, $E[g^2]_t$ is the decaying average of past squared gradients, and $\rho$ is the decay constant.
Compute Update Amounts:
$\Delta \theta_t = - \frac{\sqrt{\Delta \theta_{t-1}^2 + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} g_t$
$\Delta \theta_t$ is the amount by which parameters are adjusted, where $\Delta \theta_{t-1}^2$ is the exponentially decaying average of past squared parameter updates.
Accumulate Exponential Moving Averages of Squared Parameter Updates:
$E[\Delta \theta^2]_t = \rho E[\Delta \theta^2]_{t-1} + (1 - \rho) \Delta \theta_t^2$
Parameter Update:
$\theta_{t+1} = \theta_t + \Delta \theta_t$
This update uses the square roots of the decaying averages to scale the gradient and adjust the parameters.

Procedural Steps

Initialization: Initialize parameters $\theta$ , the decayed average of past squared gradients $E[g^2]$ , and the decayed average of past squared updates $E[\Delta \theta^2]$ to zero.
Gradient Computation: Calculate the gradient $g_t$ .
Update Squared Gradient Moving Average: Compute $E[g^2]_t$ .
Calculate Parameter Update: Compute $\Delta \theta_t$ using the ratio of the square roots of the accumulated averages.
Update Squared Update Moving Average: Update $E[\Delta \theta^2]_t$ .
Apply Update: Adjust the parameters $\theta_t$ .
Repeat: Iterate steps 2-6 until convergence.

Applications

AdaDelta is effectively used in training deep neural networks, particularly in scenarios where the choice of a correct learning rate is difficult or where gradients vary significantly in magnitude.

Strengths and Limitations

Strengths

Elimination of Learning Rate: Does not require the manual tuning of a learning rate, which simplifies configuration and improves robustness.
Robust to Vanishing Learning Rate: Overcomes the problem of diminishing learning rates encountered in AdaGrad.

Limitations

Complexity: More complex implementation compared to simpler methods like SGD or AdaGrad.
Parameter Sensitivity: The performance of AdaDelta can be sensitive to the choice of decay rate $\rho$ and the initialization of parameters.

Advanced Topics

Comparison with RMSProp: Both AdaDelta and RMSProp use a moving average of squared gradients, but AdaDelta further refines the approach by also adapting the update magnitudes directly based on historical update magnitudes.

References

Zeiler, Matthew D. "ADADELTA: an adaptive learning rate method." arXiv preprint arXiv:1212.5701 (2012).

안암동컴맹

𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀