Lecture 12. Backprop & Improving Neural Networks

cryptnomy·2022년 11월 24일
0

CS229: Machine Learning

목록 보기
12/18
post-thumbnail

Lecture video link: https://youtu.be/zUazLXZZA2U

Outline

  • LogReg with a NN mindset
  • Neural Networks ~ Backpropagation
  • Improving your NNs

Stack some neurons inside a layer.

→ Stack layers on top of each other

→ More we stack layers the more parameters we have

→ More parameters NN is able to copy the complexity of our data

(\because It becomes more flexible).

Ex. In training,

Forward propagate through the network

→ Get the output

→ Compute the cost function which compares this output to the ground truth

→ Process of backpropagating the error to tell our parameters how they should move in order to detect cats more properly.

2nd part, NNs,

Derive the backpropagation with the chain rule

→ Talk about how to improve our neural networks

(\because In practice, it’s not because you designed a neural network that it’s going to work, there’s a lot of hacks and tricks that you need to know in order to make a neural network).

Backpropagation

Define the cost function:

J(y^,y)=1mi=1mL(y^,y)J(\hat y,y)=\frac{1}{m}\sum_{i=1}^m\mathcal L(\hat y,y)

with L(i)=[y(i)logy^(i)+(1y(i))log(1y^(i))]\mathcal L^{(i)}=-[y^{(i)}\log\hat y^{(i)}+(1-y^{(i)})\log(1-\hat y^{(i)})].

Q. Why use a batch instead of single example?

A. Vectorization. Use GPU → Parallelize the computation.

Update:

w[l]=w[l]αJw[l].w^{[l]}=w^{[l]}-\alpha\frac{\partial J}{\partial w^{[l]}}.

E.g.,

Lw[3]=[y(i)w[3](logσ(w[3]a[2]+b[3]))+(1y(i))w[3](log(1σ(w[3]a[2]+b[3])))]=[y(i)1a[3]a[3](1a[3])a[2]T+(1y(i))11a[3](1)a[3](1a[3])a[2]T](Note:    w[3](1×2)(w[3]a[2]+b[3])(1×1)=a[2]T(2×1)T)=[y(i)(1a[3])a[2]T(1y(i))a[3]a[2]T]=[y(i)a[2]Ta[3]a[2]T]=(y(i)a[3])a[2]Tz[3]w[3].\begin{aligned}\frac{\partial\mathcal L}{\partial w^{[3]}} &=-\left[y^{(i)}\frac{\partial}{\partial w^{[3]}}\left(\log\sigma(w^{[3]}a^{[2]}+b^{[3]})\right)+(1-y^{(i)})\frac{\partial}{\partial w^{[3]}}\left(\log(1-\sigma(w^{[3]}a^{[2]}+b^{[3]}))\right)\right]\\ &=-\left[y^{(i)}\frac{1}{a^{[3]}}a^{[3]}(1-a^{[3]})a^{[2]^T}+(1-y^{(i)})\frac{1}{1-a^{[3]}}(-1)a^{[3]}(1-a^{[3]})a^{[2]^T}\right]\\ &\left(\text{Note:}\;\;\frac{\partial}{\partial \underset{\mathclap{\substack{\uparrow\\(1\times2)}}}{w^{[3]}}}\underbrace{\left(w^{[3]}a^{[2]}+b^{[3]}\right)}_{(1\times1)}=\underset{\mathclap{\substack{\uparrow\\(2\times1)^T}}}{a^{[2]^T}}\right)\\ &=-\left[y^{(i)}(1-a^{[3]})a^{[2]^T}-(1-y^{(i)})a^{[3]}a^{[2]^T}\right]\\ &=-\left[y^{(i)}a^{[2]^T}-a^{[3]}a^{[2]^T}\right]\\ &=-\left(y^{(i)}-a^{[3]}\right)\underbrace{a^{[2]^T}}_{\frac{\partial z^{[3]}}{\partial w^{[3]}}}. \end{aligned}

Hence,

Jw[3]=1mi=1m(y(i)a[3])a[2]T.\frac{\partial J}{\partial w^{[3]}}=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}-a^{[3]})a^{[2]^T}.

Likewise,

Lw[2](2,3)=La[3]a[3]z[3]Lz[3]=Lw[3]/z[3]w[3]=(y(i)a[3])z[3]a[2]w[3]Ta[2]z[2]a[2](1a[2])z[2]w[2]a[1]T=(y(i)a[3])(1×1)w[3]T(2×1)a[2](1a[2])(2×1)a[1]T(1×3)=w[3]Ta[2](1a[2])(a[3]y(i))a[1]T.\begin{aligned}\underbrace{\frac{\partial \mathcal L}{\partial w^{[2]}}}_{(2,3)}&=\underbrace{\frac{\partial\mathcal L}{\partial a^{[3]}}\frac{\partial a^{[3]}}{\partial z^{[3]}}}_{\substack{\frac{\partial \mathcal L}{\partial z^{[3]}}=\frac{\partial \mathcal L}{\partial w^{[3]}}/\frac{\partial z^{[3]}}{\partial w^{[3]}}\\=-(y^{(i)}-a^{[3]})}}\overbrace{\frac{\partial z^{[3]}}{\partial a^{[2]}}}^{w^{[3]^T}}\underbrace{\frac{\partial a^{[2]}}{\partial z^{[2]}}}_{a^{[2]}(1-a^{[2]})}\overbrace{\frac{\partial z^{[2]}}{\partial w^{[2]}}}^{a^{[1]^T}}\\&=\underbrace{-(y^{(i)}-a^{[3]})}_{(1\times1)}\overbrace{w^{[3]^T}}^{(2\times1)}\ast \underbrace{a^{[2]}(1-a^{[2]})}_{(2\times1)}\overbrace{a^{[1]^T}}^{(1\times3)}\\&=w^{[3]^T}\ast a^{[2]}(1-a^{[2]})(a^{[3]}-y^{(i)})a^{[1]^T}.\end{aligned}

TA comment: Read the lecture note with the rigorous parts.

Q. Why is cache very important?

A. To avoid recomputing. We have the calculation results such as a[2],a[1]T,etc.a^{[2]}, a^{[1]^T},\text{etc.} from the forward propagation already.

Improving your NNs

  1. Activation functions

    • σ(z)=11+exp(z)\sigma(z)=\frac{1}{1+\exp(-z)} (sigmoid) (+)(+): Use this for classification because it returns probability. ()(-): For high zz or low zz, your gradient is very close to 00. → Super hard to update parameters in the network due to gradient vanishing.
    • ReLU(z)=max(0,z)\text{ReLU}(z)=\max(0,z) No gradient vanishing problem for high zz.
    • tanh(z)=exp(z)exp(z)exp(z)+exp(z)\tanh(z)=\frac{\exp(z)-\exp(-z)}{\exp(z)+\exp(-z)} Similar to sigmoid.

    Q. Why do we need activation functions

    A. To add nonlinearity. If you don’t use nonlinear function at each neuron, there is no use of deploying neurons at each layer and stacking layers in the network since it boils down to one single neuron.

    TA comment:

    There’re a lot of experimental results in deep learning but we don’t fully understand why certain activations work better than others.

  2. Initialization methods

    Normalizing your input:

    xxμx→x-\mu and xx/σx→x/\sigma

    (Source: https://youtu.be/zUazLXZZA2U 52 min. 55 sec.)
    Q. What are the differences between two distributions?

    A. Your gradient descent algorithm will go towards approximately the steepest slope for the unnormalized case (left). The right one may need fewer iterations.

    TA comment: You should use μ,σ\mu, \sigma that were computed on the training set.
    Vanishing / exploding gradient

    → One way that is not perfect to avoid is to initialize your weights properly (into the right range of values).

    Initialization

    # NumPy implementation
    # Why use 2 in np.sqrt() function? -> Better perfomance, practically found.
    w_l = np.random.randn(shape) * np.sqrt(2/n^[l-1])
    • Xavier Initialization
    w[l]1n[l1]    for  tanh.w^{[l]}\sim\sqrt{\frac{1}{n^{[l-1]}}}\;\;\text{for}\;\tanh.
    • He Initialization
    w[l]2n[l]+n[l1].w^{[l]}\sim\sqrt{\frac{2}{n^{[l]}+n^{[l-1]}}}.
  1. Optimization

    • Gradient descent
      • Cool \because vectorization
    • Stochastic gradient descent
      • Updates are very quick

    → Trade-offs between the two: Stochasticity and vectorization.

    • Mini-batch gradient descent
    For iteration t = 1 ...
    	Select batch (x^{(t)}, y^{(t)})
    	Forward prop:
    		J = \frac{1}{1000} * \sum_{i=1}^1000 \mathcal{L}^{(i)}
    	Backward prop
    	Update w^{[l]}, b^{[l]}
    • Momentum algorithm + GD (gradient descent) Look at the past updates to find the right way to go.
      v=βv+(1β)Lww=wαv\begin{aligned}v&=\beta v+(1-\beta)\frac{\partial\mathcal L}{\partial w}\\w&=w-\alpha v\end{aligned}
      One additional variable with a big impact on optimization.

    There are many more optimization algorithms. In CS230, we cover RMSProp and Adam, which are most likely the ones that are used most in deep learning.

    → Why? Adam brings momentum to the deep learning optimization algorithms.

0개의 댓글