[Optimizer] Adaptive Gradient (AdaGrad)

안암동컴맹·2024년 4월 11일

DL Optimizer

Adaptive Gradient (AdaGrad)

Introduction

Background and Theory

Mathematical Formulation

Procedural Steps

Applications

Strengths and Limitations

Deep Learning

목록 보기

12/31

Adaptive Gradient (AdaGrad)

Introduction

AdaGrad (Adaptive Gradient Algorithm) is a gradient-based optimization algorithm that adjusts the learning rate to the parameters, performing smaller updates (i.e., lower learning rates) for parameters associated with frequently occurring features, and larger updates (higher learning rates) for parameters associated with infrequent features. This adaptive learning rate method, introduced by Duchi et al. in 2011, is particularly useful for dealing with sparse data, as seen in large-scale machine learning problems like natural language processing and image recognition.

Background and Theory

AdaGrad's key feature is its ability to modify the learning rate for each parameter individually, scaling inversely with the square root of the sum of all past squared gradients. This approach helps to compensate for the changing distribution of each feature and improves convergence performance over standard SGD, particularly on problems with sparse gradients and varying scales across input features.

Mathematical Formulation

The update rules for AdaGrad are based on accumulating a square gradient for each parameter, which is then used to scale the learning rate for that parameter across each iteration. This can be mathematically described as follows:

Gradient Calculation:
$g_t = \nabla_\theta L(\theta_t)$
Where $g_t$ is the gradient of the loss function $L$ with respect to the parameters $\theta$ at time step $t$ .
Accumulate Squared Gradients:
$G_{t} = G_{t-1} + g_t \otimes g_t$
Here, $G_t$ is a diagonal matrix where each diagonal element $i, i$ is the sum of the squares of the gradients w.r.t. parameter $\theta_i$ up to time step $t$ , and $\otimes$ denotes the element-wise product.
Parameter Update:
$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \otimes g_t$
Where $\eta$ is the initial learning rate and $\epsilon$ is a small smoothing term to prevent division by zero.

Procedural Steps

Initialization: Start with initial parameters $\theta$ , and initialize a gradient accumulation matrix $G$ as zero.
Compute Gradient: At each step, compute the gradient $g_t$ of the loss function with respect to $\theta$ .
Accumulate Squared Gradients: Update the gradient accumulation matrix $G$ by adding the square of the gradient.
Update Parameters: Modify $\theta$ using the adjusted learning rate, which depends on the historical squared gradients.
Repeat: Iterate steps 2-4 until convergence or for a specified number of iterations.

Applications

AdaGrad is particularly suited for applications where data is sparse and features have very different frequencies of occurrence, such as:

Natural language processing (NLP), especially in tasks involving large vocabularies like language modeling and machine translation.
Recommendation systems, where user-item interactions are typically sparse.

Strengths and Limitations

Strengths

Robustness in Sparse Settings: AdaGrad performs well in cases with sparse data by adapting the learning rate to the frequency of parameters' updates.
Eliminates the Need for Manual Tuning of the Learning Rate: The algorithm adapts the learning rates based on the data, reducing the need for detailed tuning.

Limitations

Early Stopping: The continual accumulation of squared gradients can lead to an overly aggressive decrease in the learning rate, sometimes causing the training to stall if run for too many epochs.
Memory Intensive: Storing the gradient squared for every parameter can be memory-intensive, making it less suitable for larger models.

Advanced Topics

Variants and Extensions: Modifications like AdaDelta and RMSProp address some limitations of AdaGrad by modifying the gradient accumulation into a more sustainable form, preventing the early saturation of learning rates.

References

Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient methods for online learning and stochastic optimization." Journal of Machine Learning Research 12.Jul (2011): 2121-2159.

안암동컴맹

𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀

[Optimizer] Adaptive Gradient (AdaGrad)

Deep Learning

Adaptive Gradient (AdaGrad)

Introduction

Background and Theory

Mathematical Formulation

Procedural Steps

Applications

Strengths and Limitations

Strengths

Limitations

Advanced Topics

References

[Optimizer] Adam Optimization

[Optimizer] Adaptive Delta (AdaDelta)

0개의 댓글

관련 채용 정보