Definition: A method to find the minimum value of a given function (usually a Cost function or Loss function) using its gradient.
When a function f(x) is given, the goal is to find its minimum value -> Gradient Descent calculates the gradient Ξf(x) and moves in the opposite direction of the gradient.
Formula
ΞΈ=ΞΈβΞ·βΞΈβLβ
In the formula, Ξ· is called the learning rate, which determines how much the parameters should be updated in one iteration, i.e., how much to adjust the parameter value.
This formula represents one iteration of the update, and in practice, this step is repeated multiple times.
Example
If the loss function is a parabolic second-order function like the figure below, Gradient Descent applies differentiation from the initial ΞΈ, and then sequentially updates ΞΈ in the direction where the gradient of the function decreases.
Finally, it returns the value of ΞΈ at the point where the gradient no longer decreases, which is considered the point where the loss function is minimized.
π₯¨ Stochastic Gradient Descent
Definition: Unlike standard Gradient Descent, which calculates the gradient over the entire dataset and updates the weights at once, SGD calculates the gradient for a single data point at a time and updates the weights immediately.
SGD Update Formula
W=WβΞ·βL(W;x(i),y(i))
ΞΈ: The parameter to be learned.
Ξ·: Learning rate, which determines how much the parameter moves during one update.
ΞΞΈβJ(ΞΈ;x(i),y(i)): The gradient of the cost function for the i-th data point (x(i),y(i)).
Operation Process
Initialization: Randomly initialize the model's weight W.
Random Sample Selection: Randomly select one sample (x(i),y(i)) from the dataset.
Gradient Calculation: Calculate the gradient of the loss function βL(W;x(i),y(i)) for that sample.
Weight Update: Use the gradient to update the weight W.
Repeat: Repeat this process until all samples in the dataset are processed, and the training is performed over multiple epochs.
Disadvantages
Oscillation and Instability: Since the gradient is calculated for random samples, the weight updates can be inconsistent and oscillate, which may slow down convergence.
Difficulty in Ensuring Convergence: Unlike standard Gradient Descent, the convergence of the loss function is not always guaranteed and can be unstable, so proper adjustment of the learning rate is crucial.
π₯¨ Mini-Batch Gradient Descent
Definition: A method that divides the entire dataset into small batches and calculates the gradient for each mini-batch to update the weights.
Each mini-batch is composed of randomly selected samples from the entire dataset.
The gradient of the loss function is calculated using this mini-batch, and the weights are updated.
Mini-Batch Gradient Descent Update Formula
W=WβΞ·β£Bβ£1βiβBβββL(W;x(i),y(i))
W: Weight vector of the model.
Ξ·: Learning rate.
B: A set of data samples in the mini-batch.
β£Bβ£: Size of the mini-batch.
βL(W;x(i),y(i)): Gradient of the loss function for the sample (x(i),y(i)).
Operation Process
Initialization: Randomly initialize the model's weights W.
Dataset Division: Divide the entire dataset into small mini-batches, each mini-batch consisting of β£Bβ£ data points.
Mini-Batch Selection: Select one mini-batch B and calculate the gradient.
Gradient Calculation: Calculate the gradient of the loss function for all data points in the mini-batch and compute the average of the gradients.
Weight Update: Update the weights using the average gradient from the mini-batch (using the formula above).
Repeat: Repeat this process until all mini-batches in the dataset have been processed, and training is performed over multiple epochs.
Disadvantages
Noise Issues: If the mini-batch size is too small, the gradient may oscillate like in SGD, causing slower or unstable convergence.
Impact of Mini-Batch Size: If the mini-batch size is too large, it may use too much memory and slow down learning like Batch Gradient Descent. If too small, it may cause noisy and unstable learning.
Computational Complexity: If hyperparameters are not set correctly, the learning process can be unstable or inefficient.
π₯¨ Momentum
Definition: Instead of simply moving in the direction of the gradient, Momentum adds acceleration by partially reflecting the gradient information from the previous step.
It increases speed in flat regions and reduces oscillation in regions where the gradient changes sharply.
Momentum Update Formula: The previous step's update is treated as an acceleration.
Velocity calculation:
vtβ=Ξ³vtβ1β+Ξ·βL(W)
Weight update:
W=Wβvtβ
vtβ: Velocity, representing the direction of weight update.
Ξ³: Momentum coefficient, determining how much of the previous gradient is reflected.
Ξ·: Learning rate.
Features of Momentum
When Gradient Descent reaches flat regions of the loss function, the gradient becomes small, and weight updates slow down. Momentum maintains speed in such cases, enabling faster convergence by using the previous gradient.
Proper setting of the momentum coefficient Ξ³ and learning rate Ξ· is crucial.
π₯¨ RMSprop
Definition: RMSprop calculates the moving average of the squared gradients and adjusts the learning rate based on the magnitude of the gradients.
It makes the size of each weight update inversely proportional to the size of the gradient.
Helps maintain a stable learning rate when the gradient magnitude varies significantly during training, particularly effective in mitigating the vanishing gradient problem.
RMSprop Update Formula: The exponential moving average (EMA) of the squared gradient gtβ is calculated, and this is used to adjust the learning rate.
EMA of squared gradients calculation:
E[g2]tβ=Ξ²E[g2]tβ1β+(1βΞ²)gt2β
E[g2]tβ: Exponential moving average of the squared gradient at time step t
gtβ: The current gradient
Ξ²: The decay rate of the exponential moving average, typically set to 0.9
Weight update:
Wt+1β=WtββE[g2]tβ+Ο΅βΞ·βgtβ
Ξ·: Learning rate
Ο΅: A small constant for numerical stability, usually set to 10β8 to prevent division by zero
gtβ: The current gradient
As shown in the formula, the larger the gradient, the smaller the learning rate becomes, and the smaller the gradient, the larger the learning rate. RMSprop dynamically adjusts the learning rate to perform appropriate weight updates.
Disadvantages:
Complex Hyperparameter Tuning: Proper tuning of Ξ² and Ξ· is crucial.
Difficulty in Moving Toward Global Minimum: If it converges too quickly at the start of training, there is a risk of getting stuck in a local minimum.
π₯¨ Adam
Definition: Adam is a combination of Momentum and RMSprop.
Adam uses separate learning rates for each weight and adaptively adjusts them while applying momentum to account for the direction of the gradients.
Adam updates the weights using two EMA estimates:
First Moment: The mean of the gradients, capturing the direction of the gradients through momentum.
Second Moment: The squared gradients, adjusting the learning rate based on gradient magnitude, similar to RMSprop.
Adam Update Process:
First Moment Estimate:
The EMA of the gradient gtβ (first moment) is calculated as:
mtβ=Ξ²1βmtβ1β+(1βΞ²1β)gtβ
mtβ: Exponential moving average of the first moment (mean) of the gradient gtβ
Ξ²1β: Decay rate for the first moment, typically set to 0.9
gtβ: The current gradient
Second Moment Estimate:
The EMA of the squared gradient (second moment) vtβ is calculated as:
vtβ=Ξ²2βvtβ1β+(1βΞ²2β)gt2β
vtβ: Exponential moving average of the second moment (variance) of the squared gradient
Ξ²2β: Decay rate for the second moment, typically set to 0.999
gt2β: The square of the current gradient
Bias-Correction: At the early stages of Adam, the estimates for mtβ and vtβ are biased toward 0, so bias correction is applied:
Bias correction for the first moment:
m^tβ=1βΞ²1tβmtββ
Bias correction for the second moment:
v^tβ=1βΞ²2tβvtββ
Weight Update: The weights are updated using the bias-corrected moment estimates:
Wt+1β=Wtββv^tββ+ϡηβm^tβ
Wtβ: The current weight
Ξ·: Learning rate
Ο΅: A small constant for numerical stability, typically set to 10β8 to prevent division by zero
m^tβ: Bias-corrected first moment (mean of the gradients)
v^tβ: Bias-corrected second moment (variance of the gradients)