π‘ Backward propagation
Algorithm based on the chain rule
π₯¨ Forward Propagation
Data is propagated from the input layer to the output layer, with calculations occurring at each neuron.

-
Transmission of input data: The input data X is passed to the input layer of the neural network, typically in the following form:
X=[x1β,x2β,x3β,...,xnβ]T
- Here, n is the number of input features.
-
Linear Transformation: The neurons in each layer perform a linear transformation by receiving the output of the previous layer. This transformation is calculated by taking the dot product of the weights with the input and adding the bias:
- In the first hidden layer, A[0]=X
Z[l]=W[l]A[lβ1]+b[l]
- Z[l]: Linear transformation result (pre-activation value) of the l-th layer.
- W[l]: Weight matrix of the l-th layer.
- A[lβ1]: Activation (output) of the previous layer.
- b[l]: Bias vector of the l-th layer.
-
Application of Activation Function: An activation function Ο is applied to the result of the linear transformation, adding non-linearity and generating the output to be passed to the next layer.
A[l]=Ο(Z[l])
-
Prediction Calculation at Output Layer: In the output layer, a suitable activation function is chosen based on the problem type to generate the final predicted values.
- Regression problem: Linear output or identity function is used without an activation function.
- Binary classification problem: Sigmoid function is used to convert the output to probability values.
- Multiclass classification problem: Softmax function is used.
-
Loss Function Calculation: The predicted value Y^ is compared with the actual value Y.
- For regression problems: MSE (Mean Squared Error).
- For binary classification problems: Cross-Entropy Loss.
- For multiclass classification problems: Multiclass Cross-Entropy Loss.
π₯¨ Chain Rule
π₯¨ Backward propagation
-
Definition
- The process of calculating the gradients of each weight by propagating the error from the output layer back through the hidden layers to the input layer.
- Chain rule is used to compute the gradients at each layer.
-
Applying Chain Rule in a Single Layer
-
In a multilayer perceptron, the linear transformation in the l-th layer is defined as:
Z[l]=W[l]A[lβ1]+b[l]
- A[lβ1]: Output (activation) of the previous layer
- W[l]: Weight matrix of the l-th layer
- b[l]: Bias vector of the l-th layer
-
The activation value A[l] of the l-th layer is calculated using the activation function Ο:
A[l]=Ο(Z[l])
- Loss function L: The difference between the predicted value Y^ and the actual value Y from the output layer.
-
Backpropagation involves calculating the gradient βw[l]βLβ of each weight w[l] with respect to the loss L.
-
Role of Chain Rule in Backpropagation
-
The gradient of the loss function with respect to the weight W[l] in the l-th layer using the chain rule:
βW[l]βLβ=βA[l]βLββ
βZ[l]βA[l]ββ
βW[l]βZ[l]β
- βA[l]βLβ: The gradient of the loss function with respect to the activation value A[l] in the l-th layer (which is related to the error Ξ΄[l+1] passed from the next layer).
- βZ[l]βA[l]β: The derivative of the activation function.
- βW[l]βZ[l]β: The derivative of the linear transformation with respect to the weight, which is the activation value A[lβ1] from the previous layer.
-
The gradient of the loss function with respect to the bias b[l] in the l-th layer using the chain rule:
βb[l]βLβ=βZ[l]βLββ
βb[l]βZ[l]β
- Since βb[l]βZ[l]β=1, the gradient with respect to the bias is the same as the error Ξ΄[l] in the l-th layer:
βb[l]βLβ=Ξ΄[l]
-
Error Propagation in Backpropagation
- Error Calculation at Output Layer: After calculating the derivative of the loss function, this value is used to compute the gradients of the weights and biases in the output layer.
- The error at the output layer is defined as:
Ξ΄[L]=βA[L]βLβ=Y^βY
- Y^: The predicted value from the model.
- Y: The actual value.
- Error Propagation to Hidden Layers: To propagate the error to the hidden layers, use the chain rule to compute the error at each layer. The error at the l-th layer is calculated by multiplying the error from the l+1-th layer by the weight matrix:
Ξ΄[l]=(W[l+1]Tβ
Ξ΄[l+1])β
Οβ²(Z[l])
- Οβ²(Z[l]): The derivative of the activation function at the l-th layer.
- Updating Weights and Biases Across All Layers: After calculating the error and gradients for each layer, the weights and biases are updated using gradient descent.
- Weight update:
W[l]=W[l]βΞ·βW[l]βLβ
- Bias update:
b[l]=b[l]βΞ·βb[l]βLβ