Backward propagation

been_29Β·2024λ…„ 10μ›” 17일
post-thumbnail

πŸ’‘ Backward propagation

Algorithm based on the chain rule


πŸ₯¨ Forward Propagation

Data is propagated from the input layer to the output layer, with calculations occurring at each neuron.

  1. Transmission of input data: The input data X\mathbf{X} is passed to the input layer of the neural network, typically in the following form:

    X=[x1,x2,x3,...,xn]T\mathbf{X} = {[x_1, x_2, x_3, ..., x_n]}^T
    • Here, nn is the number of input features.
  2. Linear Transformation: The neurons in each layer perform a linear transformation by receiving the output of the previous layer. This transformation is calculated by taking the dot product of the weights with the input and adding the bias:

    • In the first hidden layer, A[0]=X\mathbf{A}^{[0]} = \mathbf{X}
    Z[l]=W[l]A[lβˆ’1]+b[l]\mathbf{Z}^{[l]} = \mathbf{W}^{[l]} \mathbf{A}^{[l-1]} + \mathbf{b}^{[l]}
    • Z[l]\mathbf{Z}^{[l]}: Linear transformation result (pre-activation value) of the ll-th layer.
    • W[l]\mathbf{W}^{[l]}: Weight matrix of the ll-th layer.
    • A[lβˆ’1]\mathbf{A}^{[l-1]}: Activation (output) of the previous layer.
    • b[l]\mathbf{b}^{[l]}: Bias vector of the ll-th layer.
  3. Application of Activation Function: An activation function Οƒ\sigma is applied to the result of the linear transformation, adding non-linearity and generating the output to be passed to the next layer.

    A[l]=Οƒ(Z[l])\mathbf{A}^{[l]} = \sigma(\mathbf{Z}^{[l]})
  4. Prediction Calculation at Output Layer: In the output layer, a suitable activation function is chosen based on the problem type to generate the final predicted values.

    • Regression problem: Linear output or identity function is used without an activation function.
    • Binary classification problem: Sigmoid function is used to convert the output to probability values.
    • Multiclass classification problem: Softmax function is used.
  5. Loss Function Calculation: The predicted value Y^\hat{\mathbf{Y}} is compared with the actual value Y\mathbf{Y}.

    • For regression problems: MSE (Mean Squared Error).
    • For binary classification problems: Cross-Entropy Loss.
    • For multiclass classification problems: Multiclass Cross-Entropy Loss.






πŸ₯¨ Chain Rule

  • Definition: A method to calculate the derivative of a composite function.

    • Assume function yy is a function of the variable uu, and uu is a function of the variable xx:

      y=f(u),u=g(x)y = f(u), u = g(x)
    • Here, yy is a composite function of xx: y=f(g(x))y=f(g(x)). According to the chain rule, the derivative of yy with respect to xx is expressed as:

      dydx=dyduβ‹…dudx\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}
    • In other words, the derivative of the composite function is the product of the derivative of yy with respect to the intermediate variable uu and the derivative of uu with respect to xx.

  • Chain Rule with Multiple Variables

    • Assume y=f(u,v)y = f(u, v) is a function of two variables uu and vv, and uu and vv are functions of xx:

      y=f(u(x),v(x))y = f(u(x), v(x))
    • In this case, the chain rule states that the derivative of yy with respect to xx is:

      dydx=βˆ‚yβˆ‚uβ‹…dudx+βˆ‚yβˆ‚vβ‹…dvdx\frac{dy}{dx} = \frac{\partial y}{\partial u} \cdot \frac{du}{dx} + \frac{\partial y}{\partial v} \cdot \frac{dv}{dx}
    • This means that the total derivative is found by summing the products of the partial derivatives of yy with respect to each variable and the derivative of each variable with respect to xx.






πŸ₯¨ Backward propagation

  • Definition

    • The process of calculating the gradients of each weight by propagating the error from the output layer back through the hidden layers to the input layer.
    • Chain rule is used to compute the gradients at each layer.
  • Applying Chain Rule in a Single Layer

    • In a multilayer perceptron, the linear transformation in the ll-th layer is defined as:

      Z[l]=W[l]A[lβˆ’1]+b[l]\mathbf{Z}^{[l]} = \mathbf{W}^{[l]} \mathbf{A}^{[l-1]} + \mathbf{b}^{[l]}
      • A[lβˆ’1]\mathbf{A}^{[l-1]}: Output (activation) of the previous layer
      • W[l]\mathbf{W}^{[l]}: Weight matrix of the ll-th layer
      • b[l]\mathbf{b}^{[l]}: Bias vector of the ll-th layer
    • The activation value A[l]\mathbf{A}^{[l]} of the ll-th layer is calculated using the activation function Οƒ\sigma:

      A[l]=Οƒ(Z[l])\mathbf{A}^{[l]} = \sigma (\mathbf{Z}^{[l]})
      • Loss function LL: The difference between the predicted value Y^\hat{\mathbf{Y}} and the actual value Y\mathbf{Y} from the output layer.
    • Backpropagation involves calculating the gradient βˆ‚Lβˆ‚w[l]\frac{\partial L}{\partial w^{[l]}} of each weight w[l]w^{[l]} with respect to the loss LL.

  • Role of Chain Rule in Backpropagation

    • The gradient of the loss function with respect to the weight W[l]\mathbf{W}^{[l]} in the ll-th layer using the chain rule:

      βˆ‚Lβˆ‚W[l]=βˆ‚Lβˆ‚A[l]β‹…βˆ‚A[l]βˆ‚Z[l]β‹…βˆ‚Z[l]βˆ‚W[l]\frac{\partial L}{\partial \mathbf{W}^{[l]}} = \frac{\partial L}{\partial \mathbf{A}^{[l]}} \cdot \frac{\partial \mathbf{A}^{[l]}}{\partial \mathbf{Z}^{[l]}} \cdot \frac{\partial \mathbf{Z}^{[l]}}{\partial \mathbf{W}^{[l]}}
      • βˆ‚Lβˆ‚A[l]\frac{\partial L}{\partial \mathbf{A}^{[l]}}: The gradient of the loss function with respect to the activation value A[l]\mathbf{A}^{[l]} in the ll-th layer (which is related to the error Ξ΄[l+1]\delta^{[l+1]} passed from the next layer).
      • βˆ‚A[l]βˆ‚Z[l]\frac{\partial \mathbf{A}^{[l]}}{\partial \mathbf{Z}^{[l]}}: The derivative of the activation function.
      • βˆ‚Z[l]βˆ‚W[l]\frac{\partial \mathbf{Z}^{[l]}}{\partial \mathbf{W}^{[l]}}: The derivative of the linear transformation with respect to the weight, which is the activation value A[lβˆ’1]\mathbf{A}^{[l-1]} from the previous layer.
    • The gradient of the loss function with respect to the bias b[l]\mathbf{b}^{[l]} in the ll-th layer using the chain rule:

      βˆ‚Lβˆ‚b[l]=βˆ‚Lβˆ‚Z[l]β‹…βˆ‚Z[l]βˆ‚b[l]\frac{\partial L}{\partial \mathbf{b}^{[l]}} = \frac{\partial L}{\partial \mathbf{Z}^{[l]}} \cdot \frac{\partial \mathbf{Z}^{[l]}}{\partial \mathbf{b}^{[l]}}
      • Since βˆ‚Z[l]βˆ‚b[l]=1\frac{\partial \mathbf{Z}^{[l]}}{\partial \mathbf{b}^{[l]}} = 1, the gradient with respect to the bias is the same as the error Ξ΄[l]\delta^{[l]} in the ll-th layer:
        βˆ‚Lβˆ‚b[l]=Ξ΄[l]\frac{\partial L}{\partial \mathbf{b}^{[l]}} = \delta^{[l]}
  • Error Propagation in Backpropagation

    1. Error Calculation at Output Layer: After calculating the derivative of the loss function, this value is used to compute the gradients of the weights and biases in the output layer.
    • The error at the output layer is defined as:
      Ξ΄[L]=βˆ‚Lβˆ‚A[L]=Y^βˆ’Y\delta^{[L]} = \frac{\partial L}{\partial \mathbf{A}^{[L]}} = \hat{\mathbf{Y}} - \mathbf{Y}
      • Y^\hat{\mathbf{Y}}: The predicted value from the model.
      • Y\mathbf{Y}: The actual value.
    1. Error Propagation to Hidden Layers: To propagate the error to the hidden layers, use the chain rule to compute the error at each layer. The error at the ll-th layer is calculated by multiplying the error from the l+1l+1-th layer by the weight matrix:
      Ξ΄[l]=(W[l+1]Tβ‹…Ξ΄[l+1])β‹…Οƒβ€²(Z[l])\delta^{[l]} = \left( \mathbf{W}^{[l+1]T} \cdot \delta^{[l+1]} \right) \cdot \sigma'(\mathbf{Z}^{[l]})
    • Οƒβ€²(Z[l])\sigma'(\mathbf{Z}^{[l]}): The derivative of the activation function at the ll-th layer.
  1. Updating Weights and Biases Across All Layers: After calculating the error and gradients for each layer, the weights and biases are updated using gradient descent.
  • Weight update:
    W[l]=W[l]βˆ’Ξ·βˆ‚Lβˆ‚W[l]\mathbf{W}^{[l]} = \mathbf{W}^{[l]} - \eta \frac{\partial L}{\partial \mathbf{W}^{[l]}}
  • Bias update:
    b[l]=b[l]βˆ’Ξ·βˆ‚Lβˆ‚b[l]\mathbf{b}^{[l]} = \mathbf{b}^{[l]} - \eta \frac{\partial L}{\partial \mathbf{b}^{[l]}}
    • Ξ·\eta: Learning rate.
profile
Data Analysis

0개의 λŒ“κΈ€