a single hidden layer.
Input Layer = Layer 0
:Hidden Layer = Layer 1
:Output Layer = Layer 2
:The circle in logistic regression really represents two steps of computation rows.
A neural network just does this a lot more times.
Let's start by focusing on just one of the nodes in the hidden layer.
( : layer, : node in layer)
Let's look at the first node in the hidden layer
Let's look at the second node in the hidden layer
So, what we're going to do,
is take these four equations and vectorize.
We can generate for a single training example.
We see how to vectorize across multiple training examples.
: Layer
: training example
To suggest that if we have an unvectorized implementation
and
want to compute the predictions of all our trainin examples,
we need to do to
We'll see how to vectorize
this.
(we recall that we defined the matrix to be equal
to our training examples stacked up in these columns like so.
So this become a dimension of the matrix)
So far, we've just been using the sigmoid function,
but sometimes other choices can work much better.
In the forward propagation steps for the neural network,
we had these two steps where we use the sigmoid function here.
So that sigmoid is called an activation function.
So in the more general case,
we can have a different function
where could be a nonlinear function
that may not be the sigmoid function.
the tangent function
or the hyperbolic tangent function
.hidden layer
are closer to having a zero mean.The one exception
is for the ouptut layer because if is either or ,output layer
.So what we see in this example is where
we might have a tanh activation function for the hidden layer and a sigmoid for the output layer.
So the activation functions can be different for different layers.
And sometimes to denote that the activation functions are different for different layers,
we might use square brackets superscripts as well to indicate that
g[1] may be different than g[2].
One of the downsides
of both the sigmoid function and tanh function is that
if is either very large or very small,
then the gradient of the derivative of the slope of this function becomes very small.
So if is very large or is very small,
the slope of the function either ends up being close to zero and so this can slow down gradient descent.
So one other choice that is very popular in machine learning is
what's called the rectified linear unit
.
So the ReLU
function looks like this and the formula is
So the derivative is so long as is positive and
derivative or the slope is when is negative.
The fact that, so here's some rules of thumb for choosing activation functions.
Leaky ReLU
.sigmoid function
& tanh function
:
I would say never use sigmoid functino
except for the output layer
if you're doing binary classification or maybe almost never us this.
I almost never use this because the tanh function
is pretty much strictly superior.
ReLU
:
The most commonly used activation function is ReLU
.
So if you're not sure what else to use, use this one.
And maybe, feel free also to try the Leaky ReLU
(why is that constant 0.01? You can also make that another parameter of the learning algorithm.
And you can just see how it works and how well it works, and stick with it if it gives you a good result)
One of the things we'll see in deep learning is that
you often have a lot of different choices in how you build your neural network.
Raning from a number of hidden units to the choices activation function, to how you initialize the ways...
And it turns out that it is sometimes difficult to get good guildelines
for exactly what will work best for your problem.
If you're not sure which one of these activation functions work best, try them all.
And evaluate on like a holdout validation set or like a development set.
And see which one works better and then go of that.
Why does a neural network need a non-linear activation function?
So, here's the forward prop equations for the neural network.
Why don't we just get rid of the function and set equals ?
➡️ we can say that
Sometimes this is called the linear activation function.
(a better name for it would be the identity activation function)
➡️ if you were to use linear activation functions or we can also call them identity activation functions,
then the neural network is just outputting a linear function of the input.
➡️ And we'll talk about deep networks later, neural networks with many, many hidden layers.
And it turns out that if you use a linear activation function or alternatively,
if you don't have an activation function, then no matter how many layers your neural networks has,
all it's doing is just computing a linear activation function.
So you might as well not have any hidden layers.
➡️ But the take home is that a linear hidden layer is more or less useless
because the composition of two linear functions is itselft a linear function.
그래서 만약 저 곳에 non-linear function을 사용하지 않는다면,
network가 깊어질수록 더 많은 computing을 할 수 없게 된다.
There is just one place where you might use a linear activation function .
And that's if you are doing machine learning on the regression problem.
So if is a real number.
So for example, if you're trying to predict housing prices.
So is not , but is a real number, anywhere from $ ~ $
Then it might be okay to have a linear activation function here
so that your output is also a real number going anywhere from to
But then the hidden units should not use the activation functions.
They could use ReLU or tanh or Leaky ReLU or maybe something else.
So the one place you might use a linear activation funcion is usually in the output layer.
But other than that, using a linear activation function in the hidden layer
except for some very special circumstances relating to compression that we're goint to talk about using the linear activation function is extremely rare.
And, of course, if we're actually predicting housing prices,
because housing prices are all non-negative, perhaps even then you can use a ReLU activation function so that your output are all greater than or euql to 0
Our neural network with a single hidden layer for now, will have parameters
(you have input features, and hidden units, and output units)
We also have cost function
for a neural network.
For now, we'are just going to assume that we're doing binary classification.
To train parameters of our algorithm, we need to perform gradient desent
.
When training a nerual netowork, it is important to initialize the parameters randomly rather than to all zeros.
it's important to initialize the weights randomly.
For logistic regression, it was okay to initialize the weights to zero.
But for a neural network of initialize the weights to parameters to all zero and
then applied gradient descent, it won't work.
Let's see why.
So it's possible to construct a proof by induction that
if you initialize all the values ,
then because both hidden units start off computing the same function,
and both hidden units have the same influence on the output unit,
then after one iteration, that same statement is still true,
the two hidden units are still symmetric.
And therefore, after two iterations, three iterations and so on,
no matter how long you train you neural network,
both hidden units are still computing exactly the same function.
And so in this case, there's really no point to having more than one hidden unit.
The solution to this is to initialize your parameters randomly.
h5py :
HDF5는 binary data format이며,
h5py는 HDF5에 대한 pythonic interface를 제공하는 Package이다.
logistic regression하기 전에, data를 preprocessing할 때, standardization하는 이유?
왜 hidden layaer에 nonlinear activation function만을 사용해야 할까?
linear activation function을 사용한다면
예를 들어 의 activation function을 사용한다고 가정하자.
그렇다면, 1 hidden layer를 지났을 때, 가 된다.
2 hidden layer를 지났을 때, 가 된다.
3 hidden layer를 지났을 때, 가 된다.
하지만 이처럼 3 hidden layer를 만들지 않고도 를 activation function 를 사용한 하나의 hidden layer로 만들 수 있다.
따라서, 3개의 hidden layer를 구성할 이유가 없어지는 것이다.
또한 linear activation function을 사용하면, 각 unit들도 모두 linear하기 때문에 prediction value도 linear하게 되고, 이는 no more expression하다.
따라서 nonlinear activation function을 사용함으로써 input feature 에 대한 변칙적인 계산을 수행하여 에 대해 더 많은 expression을 추출하여 좋은 prediction value를 만들 수 있게 해야 한다.
zero initialization보다 Random initialization이 필요한 이유?만약 모든 weight가 로 initialization된다면,
hidden layer의 모든 unit은 똑같은 computation을 할 것이다.
(이기 때문에)
위의 neural network에서는 hidden layer가 1개이고, 4개의 unit이 있다.
만약 backward propagation 과정에서 weight update값을 구할 때,
모두 같은 값으로 update되어 모든 weight가 계속 같은 값을 갖고 있을 것이다.
그렇다면 4개의 unit은 또 모두 동일한 computation을 수행하기 때문에,
4개의 unit의 computation은 1개의 unit의 computation과 똑같은 기능을 수행할 것이다.
따라서 weight를 initialization하면, learning이 이루어지긴 하지만 unit이 여러 개 있는 neural network의 기능을 기대하기 힘들 것이다.
따라서 weight를 initialization을 하여, unit 4개 각각이 서로 다른 computation을 통해 neural network 기능을 기대할 수 있다.
sigmoid, tanh를 hidden layer의 activation function으로 사용하는 것보다 ReLU를 사용하는 것이 더 좋을까?
위 은 backward propagation과정에서 구하는 값이다.
을 구할 때, 을 구해야 한다.
만약 우리가 를 sigmoid 또는 tanh를 사용했다면,
sigmoid 또는 tanh의 derivative를 구해야 할 것이다.
하지만, sigmoid또는 tanh의 derivative를 구할 때, 학습에 치명적인 단점이 존재한다.
sigmoid와 tanh에 적용되는 축 value가 작거나 크면, derivative는 으로 수렴하게 된다.
sigmoid와 tanh의 derivative가 0으로 수렴한다는 의미는
으로 수렴한다는 의미가 되고,
과 값은 모두 의 곱셈연산이 적용되기 때문에
과 값도 모두 으로 수렴할 것이다.
그렇게 된다면, gradient descent를 수행할 때, 과 는 update가 되지 않거나, 매우 작은 값으로 update가 될 것이다.
이는 Learning이 일어나지 않는다는 의미가 된다.
따라서 activation function의 derivative가 0이 되는 구간이 없는 activation function인 ReLU함수가 이러한 면에서 더욱 장점을 갖는다.
또한 ReLU는 derivative를 구하기도 매우 간단하다.