Stochastic Gradient Descent (SGD) is a fundamental optimization technique used widely in machine learning and deep learning for minimizing the loss function associated with a model. The "stochastic" part of SGD refers to the fact that the gradient—the vector of partial derivatives of the loss function—is calculated with respect to a randomly selected subset of data rather than the full dataset. This approach contrasts with traditional gradient descent, which uses the entire dataset to compute the gradient at every step. SGD's popularity stems from its efficiency and effectiveness in handling large datasets and complex models.
Gradient Descent (GD) forms the backbone of many optimization algorithms. It operates on the premise that if one moves in the direction opposite to the gradient of the function at a given point, it is possible to reach the local minimum of that function. Mathematically, the update rule for GD when minimizing a loss function , where denotes the parameters of the model, is given by:
Here, represents the learning rate, a hyperparameter that controls the size of the steps taken towards the minimum. denotes the gradient of the loss function with respect to the parameters .
SGD modifies this approach by estimating the gradient based on a subset of the data, typically one example (mini-batch size of 1) or a small batch of examples. This estimation significantly reduces the computational burden, making it viable to train on large datasets. The update rule for SGD can be similarly expressed as:
Where is the loss computed on a randomly selected subset (mini-batch) of the data.
The steps to implement SGD are as follows:
Initialization: Initialize the parameters of the model randomly or with a predefined strategy.
Mini-batch Selection: Randomly select a mini-batch of data points from the training set.
Gradient Computation: Compute the gradient of the loss function with respect to the parameters , but only for the selected mini-batch. This gradient is an estimation of the true gradient computed on the entire dataset.
Parameter Update: Update the model's parameters using the estimated gradient and the learning rate according to the formula
Repeat: Repeat steps 2-4 for a predefined number of iterations or until the loss converges to an acceptable value.
The essence of SGD lies in the gradient computation step, where the gradient is an estimate of , the true gradient. For a loss function defined over the entire dataset and a mini-batch , the estimated gradient is:
where is the loss associated with a single data point , and is the size of the mini-batch.
SGD is universally applicable across various machine learning tasks, including but not limited to:
- Bottou, Léon. "Large-scale machine learning with stochastic gradient descent." Proceedings of COMPSTAT'2010. Physica-Verlag HD, 2010.
- Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv preprint arXiv:1609.04747 (2016).