[Optimizer] Adaptive Gradient (AdaGrad)

안암동컴맹·2024년 4월 11일
0

Deep Learning

목록 보기
12/31

Adaptive Gradient (AdaGrad)

Introduction

AdaGrad (Adaptive Gradient Algorithm) is a gradient-based optimization algorithm that adjusts the learning rate to the parameters, performing smaller updates (i.e., lower learning rates) for parameters associated with frequently occurring features, and larger updates (higher learning rates) for parameters associated with infrequent features. This adaptive learning rate method, introduced by Duchi et al. in 2011, is particularly useful for dealing with sparse data, as seen in large-scale machine learning problems like natural language processing and image recognition.

Background and Theory

AdaGrad's key feature is its ability to modify the learning rate for each parameter individually, scaling inversely with the square root of the sum of all past squared gradients. This approach helps to compensate for the changing distribution of each feature and improves convergence performance over standard SGD, particularly on problems with sparse gradients and varying scales across input features.

Mathematical Formulation

The update rules for AdaGrad are based on accumulating a square gradient for each parameter, which is then used to scale the learning rate for that parameter across each iteration. This can be mathematically described as follows:

  1. Gradient Calculation:

    gt=θL(θt)g_t = \nabla_\theta L(\theta_t)

    Where gtg_t is the gradient of the loss function LL with respect to the parameters θ\theta at time step tt.

  2. Accumulate Squared Gradients:

    Gt=Gt1+gtgtG_{t} = G_{t-1} + g_t \otimes g_t

    Here, GtG_t is a diagonal matrix where each diagonal element i,ii, i is the sum of the squares of the gradients w.r.t. parameter θi\theta_i up to time step tt, and \otimes denotes the element-wise product.

  3. Parameter Update:

    θt+1=θtηGt+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \otimes g_t

    Where η\eta is the initial learning rate and ϵ\epsilon is a small smoothing term to prevent division by zero.

Procedural Steps

  1. Initialization: Start with initial parameters θ\theta, and initialize a gradient accumulation matrix GG as zero.
  2. Compute Gradient: At each step, compute the gradient gtg_t of the loss function with respect to θ\theta.
  3. Accumulate Squared Gradients: Update the gradient accumulation matrix GG by adding the square of the gradient.
  4. Update Parameters: Modify θ\theta using the adjusted learning rate, which depends on the historical squared gradients.
  5. Repeat: Iterate steps 2-4 until convergence or for a specified number of iterations.

Applications

AdaGrad is particularly suited for applications where data is sparse and features have very different frequencies of occurrence, such as:

  • Natural language processing (NLP), especially in tasks involving large vocabularies like language modeling and machine translation.
  • Recommendation systems, where user-item interactions are typically sparse.

Strengths and Limitations

Strengths

  • Robustness in Sparse Settings: AdaGrad performs well in cases with sparse data by adapting the learning rate to the frequency of parameters' updates.
  • Eliminates the Need for Manual Tuning of the Learning Rate: The algorithm adapts the learning rates based on the data, reducing the need for detailed tuning.

Limitations

  • Early Stopping: The continual accumulation of squared gradients can lead to an overly aggressive decrease in the learning rate, sometimes causing the training to stall if run for too many epochs.
  • Memory Intensive: Storing the gradient squared for every parameter can be memory-intensive, making it less suitable for larger models.

Advanced Topics

  • Variants and Extensions: Modifications like AdaDelta and RMSProp address some limitations of AdaGrad by modifying the gradient accumulation into a more sustainable form, preventing the early saturation of learning rates.

References

  1. Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient methods for online learning and stochastic optimization." Journal of Machine Learning Research 12.Jul (2011): 2121-2159.
profile
𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀

0개의 댓글