Computer vision
is one of the areas that's been advancing rapidly thanks to deep learning.
Here are some exampels of computer vision problems we'll study in this course.
You've already seen image classifications, sometimes also called image recognition,
where you might take as input say a image and try to figure out is that a cat?
Another example of the computer vision problem is object detection.
So if you're building a self-driving car,
maybe you don't just need to figure out that there are other cars in this image.
But instead, you need to figure out the position of the other cars in this picture,
so that your car can avoid them.
In object detection, usually we have to not only just figure out that these other objects say cars and pictures,
but also draw boxes around them.
And in this example, they can be multiple cars in the same picture,
or at least every one of them with in a certain distance of your car.
Here's another example, Neural Style Tranfer.
Let's say you have a picture, and you want this picture repainted in a different style.
So Nueral Style Transfer,
you have a content image, and you have stlye image.
The image on the right is actually a Picasso.
And you can have a neural network put them together to repaint the content image.
So, algorithms like these are enabling new types of artwork to be created.
it's difficult to get enough data to prevent a neural network from overfitting.
And also, the computational memory requirements to train a neural network with parameters is just a bit infeasible.
But for computer visioon applications, you don't want to be stuck using only tiny little imgaes.
You want to use large images.
To do that, you need to better implement the convolution operation,
which is one of the fundamental building blocks of convolutional neural networks.
크기가 큰 input image를 학습하려면, neural network로는 힘들다
convolution operations
is one of the fundamental building blocks of a convolutional neurla network.how the convolutional operation works
.Here is a grayscale image.
Because this is a grayscale image this is just a matrix.
In order to detect edges or lets say vertical edges in this image,
what you can do is construct a matrix
and in the pollens when the terminology of convolutional neural networks,
this is going to be called a filter
.
(Sometimes research papers will call a kernal
instead of afilter
but i am going to use the filter
terminology in these videos.)
And what you are gonna to do is take the image and convolve it with the filter.
(수학에서의 는 convolution symbol이지만,
python에서의 은 multiplication or element-wise multiplication symbol로 사용된다.)
(이 강의에서는 asterisk()를 convolution으로 나타낸다.)
The output of this convolution operator will be matrix, which you can think of as a image.
If you implement this in a programming language, some different functions rather than an asterisk to denote convolution.
Why is this doing vertical edge detection?
Let's look at another example.
To illustrate this, we are goint to use a simplified image.
So here is a simple image.
If you plot this as a picture, it might look like this
where the left half, the s give you brighter pixel intensive values
and the right half, the s give you darker pixel intensive values.
(10과 0은 실제로 저런 색이 아니겠지만 예시를 위해 극단적으로 생각.)
In this image, there is clearly a very strong vertical edge right down the middle of this image
as it transitions from white to black color.
So when you convolve this with the filter and so this filter can be visualized as follows,
brighter pixels on the left
and then mid tone zeroes in the middle
and then darker on the right.
In case the dimensions here seem a little bit wrong that the detected edges seems really thick,
That's only because we are working with very small images in this example.
And if you are using, say a image rather than a image then
you find that this does a pretty good job(detecting the vertical edges in your image).
the difference between positive and negative edges
,Let's see some more examples of edge detection.
So here's one example.If you convolve this with a horizontal edge detector, you end up with this.
So in summary, different filters allow you to find vertical and horizontal edges.
It turns out that the vertical edge detection filter we've used is just one possible choice.
And historically, in the computer vision literature,
there was a fair amount of debate about what is the best set of numbers to use.
This is called a Sobel filter
.
And the advantage of this is it puts a little bit weight to the central row,
and this makes it maybe a little bit more robust.
But computer vision researchers will use other sets of numbers as well.
And this is called a Scharr filter
.
With the rise of deep learning,
one of the things we learned is that
when you really want to detect edges in some complicated image, maybe you don't need to have computer vision researchers handpick these nine numbers.
Maybe you can just learn them and treat nine numbers of this matrix as parameters,
which you can then learning using back propagation.
And what you see in later videos is that by just treating these nine numbers as parameters,
the backprop can choose to learn something else that's even better at capturing the statistics of you data than any of these hand coded filters(Sobel, Scharr).
And rather than just vertical and horizontal edges,
maybe it can learn to detect edges that are at whatever orientation it chooses.
padding
.Valid convolutions
: No padding.Same convolutions
: Pad so that output size is the same as the input size.Strided Convolutions
is another piece of the basic building block of convolutions as using convolutional neural networks.
Let me show you an example.we're going to step it over by two steps.
If you have an x image convolve x filter.
If you use padding and stride .
In this example,
Then you end up with an output that is x
Now one last detail which is one of this fraction is not an integer.
In that case, we're going to round this down.
If any of this blue box, part of it hangs outside then you just do not do that computation.
Then the right thing to do, to compute the output dimension is to round down,
in case this is not an integer.
corss-correlation vs convolutions
.So here's the image,
So to simplify the drawing of this filter,
instead of drawing it as a stack of the matrices, just drawing it as this three dimensional cube.
So to compute the output of this convolutional operation,
what you would do is take the filter and first place it in that upper left most position.
So notice that this filter = three cubes has parameters.
And so what you do is take each of these numbers and multiply them with the corresponding numbers from the red, green, and blue channels of the image,
then add up all those numbers and this gives you this first number in the output.and then to compute the next output you take this cube and slide it over by one,
and again, due to multiplications, add up the numbers,
that gives you next output.
So what does this allow you to do?
Here's an example.
If you want to detect edges in the red channel of the image
and have the green channel be all zeros
and have the blue channel be all zeros.
And if you have these stock together to form your filter,
then this would be a filter that vertical edges but only in the red channel.
And with different choices of these parameters
you can get different features detectors out of this filter.
And by convention, in computer vision,
when you have an input with a certain height, a certain width, and a certain number of channels,
then your filter will have a potentiall different height, different width, but the same number of channels.
And once again, you notice that convolving a volume,
a convolve with a , that gives a , a 2D output.
ConvNet
.Convolutional layer = Conv
: That's what we've been using in the previous network.Pooling layer = Pool
: we'll talk about in the next videos.Fully connected layer = FC
: we'll talk about in the next videos.Pooling layer
and Fully connected layer
are a bit simpler than convolutional layers to define.pooling layers
to reduce the size of the representation to speed up the computation,Let's go through an example of pooling, and then we'll talk about why you might want to do this.
So to compute each of the numbers on the right,we took the max over a regions.
So this is as if you apply because you're taking a regions
and you're taking a
So these are actually the hyperparameters of max pooling.So here's the intuition behind what max pooling is doing.
If you think of this region as some set of features,
the activations in some layer of the neural network,
then a large number, it means that it's maybe detected a particular feature.
So the upper left-hand quadrant has this particular feature,
it maybe a vertical edge or maybe a higher or whisker if you detect a cat.
Clearly, that feature exists in the upper left-hand quadrant.
Whereas this feature, it doesn't really exist in the upper right-hand quadrant.
So what the max operation does is a lots of features detected anywhere,
and one of these quadrants, it then remains preserved in the output of max pooling.
So what the max operates to does is really to say,
if these features detected anywhere in this filter, then keep a high number.
But if this features is not detected, so maybe this feature doesn't exist in the upper right-hand quadrant,
then the max of all those numbers is still itself quite small.
So maybe that's the intuition behind max pooling.
I don't know of anyone fully knows if that is the real underlying reason that max pooling works well in ConvNets.One interesting property of max pooling is that it has a set of hyperparameters
but it has no parameters to learn.
Once you fix and , it's just a fixed computation and gradient descent doesn't change anything.
Let's go through an example with some different hyperparameters.
Max pooling on a 2D inputs : If you have 3D input, then the output will have the same dimension : The way you compute max pooling is you perform the computation
we just described on each of the chaneels independently.
And more generally, if this the output would be .
The max pooling computation is done independently on each of these channels.
Average pooling
is one of the types of pooling that isn't used very often.You're trying to do handwritten digit recognition.
Let's throw the neural network to do this.
It's actually quite similar
to one of the classic neural networks called LeNet-5
,
which is created by Yann LeCun many years ago.
What i'll show here isn't exactly LeNet-5, but many parameter choices were inspired by it.
When people report the number of layers in a neural network,
people report the number of layers that have weight, that have parameters.
And because the pooling layer has no weights, has no parameters, only a few hyper parameters,
I'm going to use a convention that CONV1 and POOL1 shared together.
I'm going to treat that as Layer 1
although sometimes you see people if you read articles online and read research papers,
you hear about the conv layer and the pooling layer as if they are two separate layers.
But when i count layers, i'm just going to count layers that have weights.
So we treat both of CONV layer and POOL both as Layer 1.
Pointed this out earlier, but it goes from to to to to .
So as you go deeper usually the height and width will decrease,
whereas the number of channels will increase.
It's gone from to to and then your fully connected layer is at the end.
And another pretty common pattern you see in neural networks is
to have one or more conv layers followed by a pooing layer,
and then one or more conv layers followed by pooling layer,
...
and then at the end you have a fully connected layers
and then followed by maybe a softmax.
(CONV - POOL - ... - CONV - POOL - FC - ... - FC - Softmax)
So let's just go through for this neural network some more details of
what are the activation shape, the activation size, and the number of parameters in this network.
First, notice that the max pooling layers don't have any parameters.
Second, notice that the conv layers tend to have relatively few parameters,
and a lot of parameters tend to be in the fully connected layers of the neural network.
And then you notice also that the activation function size tends to go down gradually as you go deeper in the neural network.
If activation size drops too quickly, that's usually not great for performance as well.A lot of computer vision research has gone into figuring out how to put together
these basic building blocks to build effective neural networks.
I think one of the best way for you to gain intuition is about how to these things together is
a see a number of concrete examples of how others have done it.
parameter sharing
and sparsity of connections
.two reasons
.Parameter sharing
: Parameter sharing is motivated by the observation that feature detector such as vertical edge detector that's useful in one part of the image is probably useful in another part of the image.
If you've figured out say a filter for detecting vertical edges,
you can then apply the same filter over here
and then the next position over here,
and the next position over, and so on.
Each of these feature detectors, each of these outputs can use the same parameters in lots of different positions in your input image
in order to detect say a vertical edge or some other feature.
And i think this is true for low-level features like edges,
as well as the higher level features like maybe detecting the eye that indicates a face or a cat or something there.
But being with a share in this case
the same parameters to compute all of these outputs,
is one of the ways the number of parameters is reduced.
And it also just seems intuitive that a feature detector like a vertical edge detector
computes it for the upper left-hand corner of the image.
The same feature seems like it will probably has a good chance of being useful for the lower-right hand corner of the image.
So you don't need to learn separate feature detectors for the upper left and the lower right-hand corners of the image.
And maybe you do have a dataset where you have the upper-right hand corner and lower right-hand corner have different distributions.
So they maybe look a little bit different but they might be similar enough,
they're sharing feature detectors all across the image, works just fine.
( filter 즉, feature detector로 input image에서 해당하는 feature를 detection하도록 하여
똑같은 parameter를 사용하여 feature를 detection할 수 있다.
따라서 이를 통해 parameter 수를 줄일 수 있다.)
Sparsity of connections
: If you look at this zero, this is comptued via convolution.
And so it depends only on this inputs gride or cells.
So as if this output unit(=0) is connected only to of input features
And in particular, the rest of these pixel values do not have any effects on the other output.
Through this mechanism,
a neural network has a lot fewer parameters which allows it to be trained with smaller training cells and is less prone to be overfitting.
And so sometimes you also hear about
convolutional neural network being very good at capturing translation invariance.
And that's observation that a picture of a cat shifted a couple of pixels to the right is still pretty clearly a cat.
And convolutional structure helps the neural network encode the fact that an image shifted a few pixels
should result in pretty similar features and should probably be assigned the same output label.
And the fact that you are applying to same filter, all the positions of the image,
both in the early layers and in the late layers that helps a neural network automatically learn to be more robust or to better capture the desirable property of translation invariance.
(동일한 크기의 filter를 input image의 모든 position에 적용하기 때문에 image에서 고양이가 오른쪽으로 이동했어도 자동으로 feature를 detection할 수 있다.
이를 변환 불변성 = translation invariance.)
logit은 마지막 layer의 값임.
logit은 이며, 확률이 아님.
from_logits=True ➡️ 확률로 normalize가 아직 되지 않은 것 == 확률로 되지 않은 것들의 Cross Entropy를 계산해줘라.
from_logits=False ➡️ 0~1 확률로 normalize된 것들 == logits이 아닌 것들의 Cross Entropy를 계산해줘라.
언어, 말과 관련된 모델을 Sparsity of connection이 효과가 있다.
언어 데이터는 모든 data가 필수적이기 때문이다.
따라서 언어 model에 CNN을 쓸 수는 있지만 보통 언어 전체를 볼 수 있는 필터인 Transformer를 사용한다.
Transformer은 text 앞에 있는 단어와 맨 끝에 있는 단어까지 모두의 상관관계를 따진다.
따라서 Sparsity of connection의 특징을 잘 살린 영상처리에 효과적이다.
➡️ 입력되는 data의 특성에 따라 model의 구성이 완전히 달라진다.
영상처리에 특화된 model을 만들기 위해 CNN이 개발된 것이다.