Express functions in terms of in computational graphs
Neural Networks, which are linear layers stacked (non-linearities in between)
CNN, which are type of NN which use convolution layers to preserve spatial structure
learn the weights and parameters through optimization using GD

Where Activation Functions are applied in our system (after the matrix mul.)

squashes the numbers to [0, 1] -> probablity interpretation possible 😃
Saturation causes killing gradients
Sigmoid outputs are NOT zero-centered




Why the need for zero-mean?
For Activation Functions to NOT fall into saturation regions.

How this zero-mean is used in CNN


Can check that the std falls down to 0.
Why is this a problem?
[]

The neurons become saturated!

Xavier Initialization as an alternative


Normalize by each features in the batch.


Used to control to squash the range as wanted




Weight Initialization이나 Batch Normalization모두 이것을 해주는건데,
왜 필요할까?

왜 Var(z)를 기준으로 논리가 전개되는가?
Var(z)가 Xavier Initialization에서처럼 일정하게 유지된다면,