Squashes numbers to range [0,1] : high values as input - output will be near 1, very negative values as input - output will be near 0
historically popular since they can be interpretated as a saturating "firing rate" of a neuron
Saturated neurons can "kill off" gradients
In the back propagation process, we recursively use the chain rule to compute the gradient with respect to every raviable in the computational graph.
So at each node, remember we have upstream gradients coming backwards -
However, when using Sigmoid as activation function / non-linearity, the flat parts of the sigmoid (when the X is very negative or very positive) gives gradient of basically something near zero.
This kills the gradient flow, with having a near-zero gradient getting passed downstream.
You will not have much coming back in the back propagation.
Sigmoid outputs are not zero-centered
?????
exp() is a bit compute expensive.
**Instead of being flat in the negative regime, slight negative slope is given
Therefore, does NOT saturate
Is still computationally efficient
Converges much faster than sigmoid/tanh in practive
no dying problem
-> takes max of the two linear functions
With Original Data
1. Zero - mean (zero-centered data)
Does this solve Sigmoid problem?
No, not sufficient. Only helps at first layer.
basically you have to START at some value, and then update from there.
What happens when W=0 init is used?
-Every neuron is going to do the same thing on the input, and all will give a same output, have the same gradient, same updates, and all the neurons will do the same thing.
Deeper networks