Forward Propagation

For this Multi layer perceptron(MLP) of a binary classification.
The enumerated forward propagation is like this.

  • Forward propagation path
  1. input x\textbf{x} // size (2,1)
  2. linear combination in hidden layer z[2]\textbf{z}^{[2]} // size (2,1)
z[2]=W[2]x,(W=2×2matrix)\textbf{z}^{\left [ 2 \right ]} = \textbf{W}^{[2]} \textbf{x}, (W= 2 \times 2 \:\:\text{matrix})
  1. non-linearity function a[2]\textbf{a}^{[2]} // size (2,1)

    a[2]=tanh(z[2]){a}^{[2]} = \textrm{tanh}( \textbf{z}^{[2]} )
  2. linear combination in output layer z[3]\textbf{z}^{[3]} // size (1, 1)

    z[3]=W[3]x,(W=2×1matrix)\textbf{z}^{[3]} = \textbf{W}^{[3]} \textbf{x}, (W= 2 \times 1 \:\:\text{matrix})
  3. sigmoid function for binary classification a[3]\textbf{a}^{[3]} // size (1,1)

    y^=a[3]=σ(z[3])\hat{y} = {a}^{[3]} = \sigma( \textbf{z}^{[3]} )
  4. cross entrpy function ll

    l(xi,yi;W[2],b[2],W[3],b[3])=(yilogyi^+(1yi)log(1yi^))l(\textbf{x}_i, \textbf{y}_i; \textbf{W}^{[2]}, \textbf{b}^{[2]}, \textbf{W}^{[3]}, \textbf{b}^{[3]}) = -(y_ilog\hat{y_i} + (1-y_i) log(1-\hat{y_i}))

Backward propagation rule

single node

  • node : linear combination or non-linearity happen

Donwstream gradient = Local gradient * Upstream gradient

Multiple node

  • multiple inputs invoke multiple local gradients

computing - output layer

  • In output layer we want to get the dervatives below
    la[3]\frac{\partial{l}}{\partial{\textbf{a}}^{[3]}}
lz[3]=la[3]a[3]z[3]\frac{\partial{l}}{\partial{\textbf{z}}^{[3]}} = \frac{\partial{l}}{\partial{\textbf{a}}^{[3]}} \frac{\partial{\textbf{a}}^{[3]}}{\partial{\textbf{z}}^{[3]}}
  • If there are nLn_L (output) units in Layer LL, then laL\frac{\partial{l}}{\partial{\textbf{a}^L}} and lzL\frac{\partial{l}}{\partial{\textbf{z}^L}} are vectors with nLn_L elements, aLzL\frac{\partial{\textbf{a}^L}}{\partial{\textbf{z}^L}} is nLn_L * nLn_L Jacobian matrix:

If , fLf_L(non-linearity function) is applied element-wise(e.g., sigmoid) then this matrix is diagonal.

  • because each non-relevant derivatives are zero except mathcing diagonal elements (e.g., a1=σ(z1)a_1 = \sigma(z_1), a2=σ(z2)a_2 = \sigma(z_2),..., an=σ(zn)a_n = \sigma(z_n))

computing - hidden layer

  • the inputs into layer I+1I+1
zI+1=WI+1aI+bI+1\textbf{z}^{I+1} = \textbf{W}^{I+1}\textbf{a}^I + b^{I+1}
  • for this linear combination,
  • upstream derivatives: lzI+1\frac{\partial{l}}{\partial{\textbf{z}}^{I+1}}
  • local derivatives: zI+1zI\frac{\partial{\textbf{z}}^{I+1}}{\partial{\textbf{z}}^{I}}
  • downstream derivatives: lzI\frac{\partial{l}}{\partial{\textbf{z}}^{I}}
lzI=lzI+1zI+1zI\frac{\partial{l}}{\partial{\textbf{z}}^{I}} = \frac{\partial{l}}{\partial{\textbf{z}}^{I+1}} \frac{\partial{\textbf{z}}^{I+1}}{\partial{\textbf{z}}^{I}}
=lzI+1zI+1aIaIzI= \frac{\partial{l}}{\partial{\textbf{z}}^{I+1}} \frac{\partial{\textbf{z}}^{I+1}}{\partial{\textbf{a}}^{I}} \frac{\partial{\textbf{a}}^{I}}{\partial{\textbf{z}}^{I}}
=lzI+1WI+1aIzI= \frac{\partial{l}}{\partial{\textbf{z}}^{I+1}} \cdot \textbf{W}^{I+1} \frac{\partial{\textbf{a}}^{I}}{\partial{\textbf{z}}^{I}}

computing - each parameter

  • WI\textbf{W}^I

  • upstream derivatives: lzl\frac{\partial{l}}{\partial{\textbf{z}^{l}}}

  • local derivatives: zIwl\frac{\partial{\textbf{z}^I}}{\partial{\textbf{w}^{l}}}

  • downstream derivatives:

    lwl=lzlzIwl\frac{\partial{l}}{\partial{\textbf{w}^{l}}} = \frac{\partial{l}}{\partial{\textbf{z}^{l}}} \frac{\partial{\textbf{z}^I}}{\partial{\textbf{w}^{l}}}
    lwl=al1lzl\frac{\partial{l}}{\partial{\textbf{w}^{l}}}= \textbf{a}^{l-1} \cdot \frac{\partial{l}}{\partial{\textbf{z}^{l}}}
  • bI\textbf{b}^I

  • upstream derivatives: lzl\frac{\partial{l}}{\partial{\textbf{z}^{l}}}

  • local derivatives: zIbl=I\frac{\partial{\textbf{z}^I}}{\partial{\textbf{b}^{l}}} = I

  • downstream derivatives:

    lbl=lzl\frac{\partial{l}}{\partial{\textbf{b}^{l}}}= \frac{\partial{l}}{\partial{\textbf{z}^{l}}}

Backpropagation Fast calculation Tip

  • Assume fc network looks like this
    drawing

  • Layer part including z\mathbf{z} and a\mathbf{a}, as activation function

    al=f(zl)\mathbf{a}_l = f(\mathbf{z}_l)
  • And between Layer l-1 and Layer l, there are weight Wl\mathbf{W}_l and bias bl\mathbf{b}_l. Then from the output of Layer l-1, the equaiton as follows.

zl=Wlal1+bl\mathbf{z}_l = \mathbf{W}_l\mathbf{a}_{l-1} + \mathbf{b}_l
  • We assumed that all the gradient of ll with respect z\mathbf{z} is already calculated. Such as..

    lz2,lz3,,lzl,,lzL\frac{\partial l}{\partial \mathbf{z}_2}, \frac{\partial l}{\partial \mathbf{z}_3}, \dotsc, \frac{\partial l}{\partial \mathbf{z}_l}, \dotsc, \frac{\partial l}{\partial \mathbf{z}_L}
  • In this situataion, if we want to get the gradient of weight in specific layer(ll).
    Just multiply inputted al1\mathbf{a}_{l-1} and the gradient from outputted zl\mathbf{z}_l

lwl=(al1lzl)T\frac{\partial{l}}{\partial{\textbf{w}^{l}}}= (\textbf{a}^{l-1} \cdot \frac{\partial{l}}{\partial{\textbf{z}^{l}}})^T
  • Bias is much more simple, it is just the gradient with respect to zl\mathbf{z}_l.
    lbl=lzl\frac{\partial{l}}{\partial{\textbf{b}^{l}}}= \frac{\partial{l}}{\partial{\textbf{z}^{l}}}

profile
Artificial Intelligence study note

0개의 댓글