→ More we stack layers the more parameters we have
→ More parameters NN is able to copy the complexity of our data
(∵ It becomes more flexible).
Ex. In training,
Forward propagate through the network
→ Get the output
→ Compute the cost function which compares this output to the ground truth
→ Process of backpropagating the error to tell our parameters how they should move in order to detect cats more properly.
2nd part, NNs,
Derive the backpropagation with the chain rule
→ Talk about how to improve our neural networks
(∵ In practice, it’s not because you designed a neural network that it’s going to work, there’s a lot of hacks and tricks that you need to know in order to make a neural network).
Backpropagation
Define the cost function:
J(y^,y)=m1i=1∑mL(y^,y)
with L(i)=−[y(i)logy^(i)+(1−y(i))log(1−y^(i))].
Q. Why use a batch instead of single example?
A. Vectorization. Use GPU → Parallelize the computation.
TA comment: Read the lecture note with the rigorous parts.
Q. Why is cache very important?
A. To avoid recomputing. We have the calculation results such as a[2],a[1]T,etc. from the forward propagation already.
Improving your NNs
Activation functions
σ(z)=1+exp(−z)1 (sigmoid) (+): Use this for classification because it returns probability. (−): For high z or low z, your gradient is very close to 0. → Super hard to update parameters in the network due to gradient vanishing.
ReLU(z)=max(0,z) No gradient vanishing problem for high z.
tanh(z)=exp(z)+exp(−z)exp(z)−exp(−z) Similar to sigmoid.
Q. Why do we need activation functions
A. To add nonlinearity. If you don’t use nonlinear function at each neuron, there is no use of deploying neurons at each layer and stacking layers in the network since it boils down to one single neuron.
TA comment:
There’re a lot of experimental results in deep learning but we don’t fully understand why certain activations work better than others.
A. Your gradient descent algorithm will go towards approximately the steepest slope for the unnormalized case (left). The right one may need fewer iterations.
TA comment: You should use μ,σ that were computed on the training set. Vanishing / exploding gradient
→ One way that is not perfect to avoid is to initialize your weights properly (into the right range of values).
Initialization
# NumPy implementation# Why use 2 in np.sqrt() function? -> Better perfomance, practically found.
w_l = np.random.randn(shape)* np.sqrt(2/n^[l-1])
Xavier Initialization
w[l]∼n[l−1]1fortanh.
He Initialization
w[l]∼n[l]+n[l−1]2.
Optimization
Gradient descent
Cool ∵ vectorization
Stochastic gradient descent
Updates are very quick
→ Trade-offs between the two: Stochasticity and vectorization.
Mini-batch gradient descent
For iteration t =1...
Select batch (x^{(t)}, y^{(t)})
Forward prop:
J = \frac{1}{1000}* \sum_{i=1}^1000 \mathcal{L}^{(i)}
Backward prop
Update w^{[l]}, b^{[l]}
Momentum algorithm + GD (gradient descent) Look at the past updates to find the right way to go.
vw=βv+(1−β)∂w∂L=w−αv
One additional variable with a big impact on optimization.
There are many more optimization algorithms. In CS230, we cover RMSProp and Adam, which are most likely the ones that are used most in deep learning.
→ Why? Adam brings momentum to the deep learning optimization algorithms.