Backpropagation

‍이세현·2024년 12월 1일

Gradient 계산

s=f(x;W_1,W_2)=W_2\max(0,W_1x) \\ L=\frac{1}{N}\sum_{i=1}^{N}L_i+\lambda R(W_1)+\lambda R(W_2)

$\frac{\partial L}{\partial W_1}, \frac{\partial L}{\partial W_2}$ 를 계산하면 $W_1, W_2$ 를 학습할 수 있다.
손실함수를 weight에 대해 미분한 도함수는 activation function이 바뀌면 다시 계산해야한다.
- 유지보수의 어려움
- Computational Graphs를 통해 모듈화하여 간단하게 계산할 수 있다.

Example

f(x,y,z)=(x+y)z, \\ (x=-2, y=5, z=-4)

Forward Pass: $q=x+y, f=qz$
Backward Pass: Compute derivatives
- $\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}$ 를 구해야 한다.
- $\frac{\partial f}{\partial f}=1$
- $\frac{\partial f}{\partial z}=q=3$
- $\frac{\partial f}{\partial q}=z=-4$
- $\frac{\partial f}{\partial y}=\frac{\partial f}{\partial q}\frac{\partial q}{\partial y}=\frac{\partial f}{\partial q}=-4$
- $\frac{\partial f}{\partial x}=\frac{\partial f}{\partial q}\frac{\partial q}{\partial x}=\frac{\partial f}{\partial q}=-4$
- Modularization
  - Downstream Gradient: $\frac{\partial f}{\partial x}$ (구하고자 하는 것, 해당 변수가 최종 결과에 얼마나 영향을 끼치는가)
  - Local Gradient: $\frac{\partial q}{\partial x}$ (당장 직접 계산할 수 있는 것)
  - Upstream Gradient: $\frac{\partial f}{\partial q}$ (이전 과정에서 계산된 것)

Activation Function의 Gradient

Sigmoid: $f'(x)=\big(1-\sigma(x)\big)\sigma(x)$
ReLU: $f'(x)=\begin{cases} 0 & \text{for } x<0 \\ 1 & \text{for } x\geq 0 \end{cases}$

Gradient Flow

Add gate: Gradient distributor
Mul gate: swap multiplier
Max gate: Gradient router
- Non linear 함수는 도함수를 정해둘 수 없어 매번 연산을 해야 한다.
- Modularization 조건: 미분(local gradient) 가능해야 한다.

Gradient code

s0 = w0 * x0
s1 = w1 * x1
s2 = s0 + s1
s3 = s2 + w2
L = sigmoid(s3)

Downstream 계산을 위해 최종 output의 gradient는 1로 설정한다.

grad_L = 1.0 # upstream
grad_s3 = grad_L * (1-L) * L
grad_w2 = grad_s3 # add gate
grad_s2 = grad_s3 # add gate
...

Backward code

class Multiply(torch.autograd.Function):
    @staticmethod
    def backward(ctx, grad_z):
        # grad_z: Upstream gradient
        x, y = ctx.saved_tensors
        grad_x = y * grad_z # dz/dx * dL/dz
        grad_y = x * grad_x # dz/dy * dL/dz
        return grad_x, grad_y # Downstream gradient

Vector Backpropagation

Vector Derivatives

Regular derivative
- $x\in\mathbb{R}, y\in\mathbb{R}$
- $x$ 가 변화할 때 $y$ 가 변화하는 양 $\frac{\partial y}{\partial x}\in\mathbb{R}$
Gradient (Derivative)
- $x\in\mathbb{R}^N, y\in\mathbb{R}$
- $x$ 가 변화할 때 $y$ 가 변화하는 양 $\frac{\partial y}{\partial x}\in\mathbb{R}^N$
Jacobian (Derivative)
- $x\in\mathbb{R}^N, y\in\mathbb{R}^M$
- $x$ 가 변화할 때 $y$ 가 변화하는 양 $\frac{\partial y}{\partial x}\in\mathbb{R}^{N\times M}$ $J=\begin{bmatrix} \frac{\partial F_1}{\partial x_1} & \cdots & \frac{\partial F_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial F_m}{\partial x_1} & \cdots & \frac{\partial F_m}{\partial x_n} \end{bmatrix}$
- Example: $F(x,y)=\begin{bmatrix} x^2y \\ 5x+\sin(y) \end{bmatrix}$
  - Input 2차원, Output 2차원이므로 Jacobian은 $[2\times2]$
  - $J_F=\begin{bmatrix} 2xy & x^2 \\ 5 & \cos(y) \end{bmatrix}$

Backprop with Vectors

$z=f(x,y)$
- 각 차원은 $D_x, D_y, D_z$ 이고 Loss $L$ 은 스칼라이다.
  - Loss는 전체 손실을 계산하기 위해 scalar여야 한다.
  - 손실이 $N$ 차원이면 각 차원의 최적을 선택하는 것이 최적의 모델인 것이 아니다. 각 파라미터의 독립을 보장할 수 없다.
- Upstream Gradient: $\frac{\partial L}{\partial z}, D_z$ 차원
- Local Gradients: $\frac{\partial z}{\partial x}, D_x\times D_z$ 차원, $\frac{\partial z}{\partial y}, D_y\times D_z$ 차원
- Downstream Gradients: $[D_x\times D_z]\times[D_z\times1]=D_x$ 차원, $[D_y\times D_z]\times[D_z\times1]=D_y$ 차원
  - Downstream gradient의 차원은 그 값의 차원과 동일하다.
Example: $f(x)=\max(0,x)$
- Input: $\begin{bmatrix} 1 \\ -2 \\ 3 \\ -1 \end{bmatrix}$ , Output: $\begin{bmatrix} 1 \\ 0 \\ 3 \\ 0 \end{bmatrix}$
- Upstream gradient(주어지는 값): $\begin{bmatrix} 4 \\ -1 \\ 5 \\ 9 \end{bmatrix}$
- Local gradient: $\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix}$
- Downstream gradient: $\begin{bmatrix} 4 \\ 0 \\ 5 \\ 0 \end{bmatrix}$

Backpropagation with Matrices

$z=f(x,y)$
- 각 차원은 $[D_x\times M_x], [D_y\times M_y], [D_z\times M_z]$ 이고 Loss $L$ 은 스칼라이다.
- Upstream gradient, Downstream gradient는 해당 차원과 동일하다.
- Local gradients
  - $\frac{\partial z}{\partial x}, [(D_x\times M_x)\times(D_z\times M_z)]$
  - $\frac{\partial z}{\partial y}, [(D_y\times M_y)\times(D_z\times M_z)]$