Advantages of good initialization
def initialize_parameters_zeros(layers_dims):
parameters = {}
L = len(layers_dims)
for l in range(1, L):
parameters['W' + str(l)] = np.zeros(shape=(layers_dims[l], layers_dims[l-1]))
parameters['b' + str(l)] = np.zeros(shape=(layers_dims[l], 1))
return parameters
The most weakness of zero initialization can be simply calculated as follows:
For ReLu
as activation funtion:
At the classification layer, where the activation function is sigmoid function:
From this result, we get a value of 0.5 for chance of being true in classifiation process.
Loss function is:
For y=1
, y_pred=0.5
(for true) it becomes:
For y=0
, y_pred=0.5
(for false) it becomes:
No matter what the value of y_pred
is, the value of loss function will be the same. No wonder it's doing so badly.
What should we remeber is
This initialization method is for breaking symetry of weight.
def initialize_parameters_random(layers_dims):
parameters = {}
L = len(layers_dims)
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * 10
parameters['b' + str(l)] = np.zeros(shape=(layers_dims[l], 1))
return parameters
In this case, we have to consider how to initialize well. Because poor initialized can be extremely big or small value, it leads to gradient vanishing or exploding.
Mathematically, a value of can be very close to 0 or 1 for some cases, so a value of loss function can be inifinity; .
What should we remeber is
Before we go to He initialization, we need to consider differences between np.rand.random()
and np.rand.randn()
np.rand.random()
: sampling in uniform distributionnp.rand.randn()
: sampling in normal distributionIf we made a sample from normal distribution, it will help to extract random values which are neither extreme large nor small.
The He initializaion use sqrt(2. / layers_dims[l-1])
as scaling factor. cf. Xavier initialization sqrt(1. / layers_dims[l-1])
.
def initialize_parameters_he(layers_dims):
parameters = {}
L = len(layers_dims)
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2. / layers_dims[l-1])
parameters['b' + str(l)] = np.zeros(shape=(layers_dims[l], 1))
return parameters