AdaDelta is an optimization algorithm designed to address the rapidly diminishing learning rates encountered in AdaGrad. Introduced by Zeiler in his 2012 paper, AdaDelta modifies AdaGrad's accumulation of all past squared gradients by limiting the accumulation to a fixed-size window of past gradients. This results in an adaptive learning rate approach that overcomes the key challenges of AdaGrad—namely, the aggressive, monotonically decreasing learning rate—without the need for a manually set global learning rate.
AdaDelta extends the AdaGrad approach of adapting the learning rate to each parameter by considering the decaying average of past squared gradients. Unlike AdaGrad, which continues accumulating squared gradients throughout training, potentially leading to very small learning rates, AdaDelta uses a sliding window of gradient updates (exponential moving average) to keep this accumulation under control. This makes it robust to a wide range of initial configurations and reduces the need to set a default learning rate.
AdaDelta is characterized by the following key equations, which describe its approach to updating parameters without the need for an explicit learning rate:
Gradient Calculation:
Where is the gradient of the loss function with respect to the parameters at time step .
Accumulate Exponential Moving Averages of Squared Gradients
Here, is the decaying average of past squared gradients, and is the decay constant.
Compute Update Amounts:
is the amount by which parameters are adjusted, where is the exponentially decaying average of past squared parameter updates.
Accumulate Exponential Moving Averages of Squared Parameter Updates:
Parameter Update:
This update uses the square roots of the decaying averages to scale the gradient and adjust the parameters.
AdaDelta is effectively used in training deep neural networks, particularly in scenarios where the choice of a correct learning rate is difficult or where gradients vary significantly in magnitude.
- Zeiler, Matthew D. "ADADELTA: an adaptive learning rate method." arXiv preprint arXiv:1212.5701 (2012).