Initialization

재구몬·2021년 6월 30일

목록 보기

1/12

Advantages of good initialization

Speed up the convergence of gradient descent
Increase the odds of gradient descent converging to a lower training (and generalization) error

Zero Initialization

def initialize_parameters_zeros(layers_dims):   
    parameters = {}
    L = len(layers_dims)
    
    for l in range(1, L):
        parameters['W' + str(l)] = np.zeros(shape=(layers_dims[l], layers_dims[l-1]))
        parameters['b' + str(l)] = np.zeros(shape=(layers_dims[l], 1))
        
    return parameters

The most weakness of zero initialization can be simply calculated as follows:
For ReLu as activation funtion:

\ a = ReLu(z) = \max(0, z) = 0

At the classification layer, where the activation function is sigmoid function:

\ \sigma(z)=\frac{1}{1+e^{-(z)}} = \frac{1}{2} = y_{pred}

From this result, we get a value of 0.5 for chance of being true in classifiation process.
Loss function is:

\ L(a,y)=-y\ln(y_{pred})-(1-y)\ln(1-y_{pred})

For y=1, y_pred=0.5(for true) it becomes:

\ L(0, 1)=0.6931

For y=0, y_pred=0.5(for false) it becomes:

\ L(0, 1)=0.6931

No matter what the value of y_pred is, the value of loss function will be the same. No wonder it's doing so badly.

What should we remeber is

The weights $W^{[l]}$ should be initialized randomly to break symmetry.
Symmetry is broken as long as $W^{[l]}$ is initialized randomly, so bias don't have to be initialized randomly.

Random Initialization

This initialization method is for breaking symetry of weight.

def initialize_parameters_random(layers_dims):
	parameters = {}
    L = len(layers_dims)
    
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * 10
        parameters['b' + str(l)] = np.zeros(shape=(layers_dims[l], 1))
        
    return parameters

In this case, we have to consider how to initialize $W^{[l]}$ well. Because poor initialized $W^{[l]}$ can be extremely big or small value, it leads to gradient vanishing or exploding.
Mathematically, a value of $a$ can be very close to 0 or 1 for some cases, so a value of loss function can be inifinity; $\log(0) = \infty$ .

What should we remeber is

Initializing weights to very large values doesn't work well.
We need to initialize weights to proper small value.

He Initialization

Before we go to He initialization, we need to consider differences between np.rand.random() and np.rand.randn()

np.rand.random(): sampling in uniform distribution
np.rand.randn(): sampling in normal distribution

If we made a sample from normal distribution, it will help to extract random values which are neither extreme large nor small.
The He initializaion use sqrt(2. / layers_dims[l-1]) as scaling factor. cf. Xavier initialization sqrt(1. / layers_dims[l-1]).

def initialize_parameters_he(layers_dims):
	parameters = {}
    L = len(layers_dims)
    
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2. / layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros(shape=(layers_dims[l], 1))
        
    return parameters

Reference

deeplearning.ai

재구몬

아직 거북이지만 곧 앞질러 갈겁니다.

다음 포스트

Initialization

Deep Learning

Zero Initialization

Random Initialization

He Initialization

Reference

Regularization

0개의 댓글

관련 채용 정보