# Initialization

jaegoo1199·2021년 6월 30일
0

## Deep Learning

목록 보기
1/12

• Speed up the convergence of gradient descent
• Increase the odds of gradient descent converging to a lower training (and generalization) error

## Zero Initialization

def initialize_parameters_zeros(layers_dims):
parameters = {}
L = len(layers_dims)

for l in range(1, L):
parameters['W' + str(l)] = np.zeros(shape=(layers_dims[l], layers_dims[l-1]))
parameters['b' + str(l)] = np.zeros(shape=(layers_dims[l], 1))

return parameters

The most weakness of zero initialization can be simply calculated as follows:
For ReLu as activation funtion:

$\ a = ReLu(z) = \max(0, z) = 0$

At the classification layer, where the activation function is sigmoid function:

$\ \sigma(z)=\frac{1}{1+e^{-(z)}} = \frac{1}{2} = y_{pred}$

From this result, we get a value of 0.5 for chance of being true in classifiation process.
Loss function is:

$\ L(a,y)=-y\ln(y_{pred})-(1-y)\ln(1-y_{pred})$

For y=1, y_pred=0.5(for true) it becomes:

$\ L(0, 1)=0.6931$

For y=0, y_pred=0.5(for false) it becomes:

$\ L(0, 1)=0.6931$

No matter what the value of y_pred is, the value of loss function will be the same. No wonder it's doing so badly.

What should we remeber is

• The weights $W^{[l]}$ should be initialized randomly to break symmetry.
• Symmetry is broken as long as $W^{[l]}$ is initialized randomly, so bias don't have to be initialized randomly.

## Random Initialization

This initialization method is for breaking symetry of weight.

def initialize_parameters_random(layers_dims):
parameters = {}
L = len(layers_dims)

for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * 10
parameters['b' + str(l)] = np.zeros(shape=(layers_dims[l], 1))

return parameters

In this case, we have to consider how to initialize $W^{[l]}$ well. Because poor initialized $W^{[l]}$ can be extremely big or small value, it leads to gradient vanishing or exploding.
Mathematically, a value of $a$ can be very close to 0 or 1 for some cases, so a value of loss function can be inifinity; $\log(0) = \infty$.

What should we remeber is

• Initializing weights to very large values doesn't work well.
• We need to initialize weights to proper small value.

## He Initialization

Before we go to He initialization, we need to consider differences between np.rand.random() and np.rand.randn()

• np.rand.random(): sampling in uniform distribution
• np.rand.randn(): sampling in normal distribution

If we made a sample from normal distribution, it will help to extract random values which are neither extreme large nor small.
The He initializaion use sqrt(2. / layers_dims[l-1]) as scaling factor. cf. Xavier initialization sqrt(1. / layers_dims[l-1]).

def initialize_parameters_he(layers_dims):
parameters = {}
L = len(layers_dims)

for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2. / layers_dims[l-1])
parameters['b' + str(l)] = np.zeros(shape=(layers_dims[l], 1))

return parameters

## Reference

아직 거북이지만 곧 앞질러 갈겁니다.