Initialization

재구몬·2021년 6월 30일
0

Deep Learning

목록 보기
1/12
post-thumbnail

Advantages of good initialization

  • Speed up the convergence of gradient descent
  • Increase the odds of gradient descent converging to a lower training (and generalization) error

Zero Initialization

def initialize_parameters_zeros(layers_dims):   
    parameters = {}
    L = len(layers_dims)
    
    for l in range(1, L):
        parameters['W' + str(l)] = np.zeros(shape=(layers_dims[l], layers_dims[l-1]))
        parameters['b' + str(l)] = np.zeros(shape=(layers_dims[l], 1))
        
    return parameters

The most weakness of zero initialization can be simply calculated as follows:
For ReLu as activation funtion:

 a=ReLu(z)=max(0,z)=0\ a = ReLu(z) = \max(0, z) = 0

At the classification layer, where the activation function is sigmoid function:

 σ(z)=11+e(z)=12=ypred\ \sigma(z)=\frac{1}{1+e^{-(z)}} = \frac{1}{2} = y_{pred}

From this result, we get a value of 0.5 for chance of being true in classifiation process.
Loss function is:

 L(a,y)=yln(ypred)(1y)ln(1ypred)\ L(a,y)=-y\ln(y_{pred})-(1-y)\ln(1-y_{pred})

For y=1, y_pred=0.5(for true) it becomes:

 L(0,1)=0.6931\ L(0, 1)=0.6931

For y=0, y_pred=0.5(for false) it becomes:

 L(0,1)=0.6931\ L(0, 1)=0.6931

No matter what the value of y_pred is, the value of loss function will be the same. No wonder it's doing so badly.

What should we remeber is

  • The weights W[l]W^{[l]} should be initialized randomly to break symmetry.
  • Symmetry is broken as long as W[l]W^{[l]} is initialized randomly, so bias don't have to be initialized randomly.

Random Initialization

This initialization method is for breaking symetry of weight.

def initialize_parameters_random(layers_dims):
	parameters = {}
    L = len(layers_dims)
    
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * 10
        parameters['b' + str(l)] = np.zeros(shape=(layers_dims[l], 1))
        
    return parameters

In this case, we have to consider how to initialize W[l]W^{[l]} well. Because poor initialized W[l]W^{[l]} can be extremely big or small value, it leads to gradient vanishing or exploding.
Mathematically, a value of aa can be very close to 0 or 1 for some cases, so a value of loss function can be inifinity; log(0)=\log(0) = \infty.

What should we remeber is

  • Initializing weights to very large values doesn't work well.
  • We need to initialize weights to proper small value.

He Initialization

Before we go to He initialization, we need to consider differences between np.rand.random() and np.rand.randn()

  • np.rand.random(): sampling in uniform distribution
  • np.rand.randn(): sampling in normal distribution

If we made a sample from normal distribution, it will help to extract random values which are neither extreme large nor small.
The He initializaion use sqrt(2. / layers_dims[l-1]) as scaling factor. cf. Xavier initialization sqrt(1. / layers_dims[l-1]).

def initialize_parameters_he(layers_dims):
	parameters = {}
    L = len(layers_dims)
    
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * np.sqrt(2. / layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros(shape=(layers_dims[l], 1))
        
    return parameters

Reference

profile
아직 거북이지만 곧 앞질러 갈겁니다.

0개의 댓글