Lecture 12. Backprop & Improving Neural Networks

cryptnomy·2022년 11월 24일

ML cs229 machine learning

CS229: Machine Learning

목록 보기

12/18

Lecture video link: https://youtu.be/zUazLXZZA2U

Outline

LogReg with a NN mindset
Neural Networks ~ Backpropagation
Improving your NNs

Stack some neurons inside a layer.

→ Stack layers on top of each other

→ More we stack layers the more parameters we have

→ More parameters NN is able to copy the complexity of our data

( $\because$ It becomes more flexible).

Ex. In training,

Forward propagate through the network

→ Get the output

→ Compute the cost function which compares this output to the ground truth

→ Process of backpropagating the error to tell our parameters how they should move in order to detect cats more properly.

2nd part, NNs,

Derive the backpropagation with the chain rule

→ Talk about how to improve our neural networks

( $\because$ In practice, it’s not because you designed a neural network that it’s going to work, there’s a lot of hacks and tricks that you need to know in order to make a neural network).

Backpropagation

Define the cost function:

J(\hat y,y)=\frac{1}{m}\sum_{i=1}^m\mathcal L(\hat y,y)

with $\mathcal L^{(i)}=-[y^{(i)}\log\hat y^{(i)}+(1-y^{(i)})\log(1-\hat y^{(i)})]$ .

Q. Why use a batch instead of single example?

A. Vectorization. Use GPU → Parallelize the computation.

Update:

w^{[l]}=w^{[l]}-\alpha\frac{\partial J}{\partial w^{[l]}}.

E.g.,

\begin{aligned}\frac{\partial\mathcal L}{\partial w^{[3]}} &=-\left[y^{(i)}\frac{\partial}{\partial w^{[3]}}\left(\log\sigma(w^{[3]}a^{[2]}+b^{[3]})\right)+(1-y^{(i)})\frac{\partial}{\partial w^{[3]}}\left(\log(1-\sigma(w^{[3]}a^{[2]}+b^{[3]}))\right)\right]\\ &=-\left[y^{(i)}\frac{1}{a^{[3]}}a^{[3]}(1-a^{[3]})a^{[2]^T}+(1-y^{(i)})\frac{1}{1-a^{[3]}}(-1)a^{[3]}(1-a^{[3]})a^{[2]^T}\right]\\ &\left(\text{Note:}\;\;\frac{\partial}{\partial \underset{\mathclap{\substack{\uparrow\\(1\times2)}}}{w^{[3]}}}\underbrace{\left(w^{[3]}a^{[2]}+b^{[3]}\right)}_{(1\times1)}=\underset{\mathclap{\substack{\uparrow\\(2\times1)^T}}}{a^{[2]^T}}\right)\\ &=-\left[y^{(i)}(1-a^{[3]})a^{[2]^T}-(1-y^{(i)})a^{[3]}a^{[2]^T}\right]\\ &=-\left[y^{(i)}a^{[2]^T}-a^{[3]}a^{[2]^T}\right]\\ &=-\left(y^{(i)}-a^{[3]}\right)\underbrace{a^{[2]^T}}_{\frac{\partial z^{[3]}}{\partial w^{[3]}}}. \end{aligned}

Hence,

\frac{\partial J}{\partial w^{[3]}}=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}-a^{[3]})a^{[2]^T}.

Likewise,

\begin{aligned}\underbrace{\frac{\partial \mathcal L}{\partial w^{[2]}}}_{(2,3)}&=\underbrace{\frac{\partial\mathcal L}{\partial a^{[3]}}\frac{\partial a^{[3]}}{\partial z^{[3]}}}_{\substack{\frac{\partial \mathcal L}{\partial z^{[3]}}=\frac{\partial \mathcal L}{\partial w^{[3]}}/\frac{\partial z^{[3]}}{\partial w^{[3]}}\\=-(y^{(i)}-a^{[3]})}}\overbrace{\frac{\partial z^{[3]}}{\partial a^{[2]}}}^{w^{[3]^T}}\underbrace{\frac{\partial a^{[2]}}{\partial z^{[2]}}}_{a^{[2]}(1-a^{[2]})}\overbrace{\frac{\partial z^{[2]}}{\partial w^{[2]}}}^{a^{[1]^T}}\\&=\underbrace{-(y^{(i)}-a^{[3]})}_{(1\times1)}\overbrace{w^{[3]^T}}^{(2\times1)}\ast \underbrace{a^{[2]}(1-a^{[2]})}_{(2\times1)}\overbrace{a^{[1]^T}}^{(1\times3)}\\&=w^{[3]^T}\ast a^{[2]}(1-a^{[2]})(a^{[3]}-y^{(i)})a^{[1]^T}.\end{aligned}

TA comment: Read the lecture note with the rigorous parts.

Q. Why is cache very important?

A. To avoid recomputing. We have the calculation results such as $a^{[2]}, a^{[1]^T},\text{etc.}$ from the forward propagation already.

Improving your NNs

Activation functions
- $\sigma(z)=\frac{1}{1+\exp(-z)}$ (sigmoid) $(+)$ : Use this for classification because it returns probability. $(-)$ : For high $z$ or low $z$ , your gradient is very close to $0$ . → Super hard to update parameters in the network due to gradient vanishing.
- $\text{ReLU}(z)=\max(0,z)$ No gradient vanishing problem for high $z$ .
- $\tanh(z)=\frac{\exp(z)-\exp(-z)}{\exp(z)+\exp(-z)}$ Similar to sigmoid.
Q. Why do we need activation functions

A. To add nonlinearity. If you don’t use nonlinear function at each neuron, there is no use of deploying neurons at each layer and stacking layers in the network since it boils down to one single neuron.

TA comment:

There’re a lot of experimental results in deep learning but we don’t fully understand why certain activations work better than others.
Initialization methods

Normalizing your input:

$x→x-\mu$ and $x→x/\sigma$

(Source: https://youtu.be/zUazLXZZA2U 52 min. 55 sec.)
Q. What are the differences between two distributions?

A. Your gradient descent algorithm will go towards approximately the steepest slope for the unnormalized case (left). The right one may need fewer iterations.

TA comment: You should use $\mu, \sigma$ that were computed on the training set.
Vanishing / exploding gradient

→ One way that is not perfect to avoid is to initialize your weights properly (into the right range of values).

Initialization
```
# NumPy implementation
# Why use 2 in np.sqrt() function? -> Better perfomance, practically found.
w_l = np.random.randn(shape) * np.sqrt(2/n^[l-1])
```
- Xavier Initialization
$w^{[l]}\sim\sqrt{\frac{1}{n^{[l-1]}}}\;\;\text{for}\;\tanh.$
- He Initialization
$w^{[l]}\sim\sqrt{\frac{2}{n^{[l]}+n^{[l-1]}}}.$

Optimization
- Gradient descent
  - Cool $\because$ vectorization
- Stochastic gradient descent
  - Updates are very quick
→ Trade-offs between the two: Stochasticity and vectorization.
- Mini-batch gradient descent
```
For iteration t = 1 ...
	Select batch (x^{(t)}, y^{(t)})
	Forward prop:
		J = \frac{1}{1000} * \sum_{i=1}^1000 \mathcal{L}^{(i)}
	Backward prop
	Update w^{[l]}, b^{[l]}
```
- Momentum algorithm + GD (gradient descent) Look at the past updates to find the right way to go. $\begin{aligned}v&=\beta v+(1-\beta)\frac{\partial\mathcal L}{\partial w}\\w&=w-\alpha v\end{aligned}$ One additional variable with a big impact on optimization.
There are many more optimization algorithms. In CS230, we cover RMSProp and Adam, which are most likely the ones that are used most in deep learning.

→ Why? Adam brings momentum to the deep learning optimization algorithms.

cryptnomy

이전 포스트

Lecture 11. Introduction to Neural Networks

다음 포스트

Lecture 12. Backprop & Improving Neural Networks

CS229: Machine Learning

Backpropagation

Improving your NNs

Lecture 11. Introduction to Neural Networks

Lecture 13. Debugging ML Models and Error Analysis

0개의 댓글

관련 채용 정보