Lecture 11. Introduction to Neural Networks

cryptnomy·2022년 11월 24일
0

CS229: Machine Learning

목록 보기
11/18
post-thumbnail

Lecture video link: https://youtu.be/MfIjxPh6Pys

Outline

  • Logistic Regression
  • Neural Networks

Deep Learning

  • computational power
  • data available
  • algorithms

Logistic Regression

Goal: Find cats in image {1presence of a cat0absence of a cat\begin{cases}1\rightarrow\text{presence of a cat}\\0\rightarrow\text{absence of a cat}\end{cases}

(Source: https://youtu.be/MfIjxPh6Pys 7 min. 15 sec.)

y^(1,1)=σ(θTx)=σ(w(1,12288)x(64×64×3,1)+b).\begin{aligned}\underset{\mathclap{\substack{\uparrow\\(1,1)}}}{\hat y}&=\sigma(\theta^Tx)\\&=\sigma(\overset{\mathclap{\substack{(1,12288)\\\downarrow}}}w\underset{\mathclap{\substack{\uparrow\\(64\times64\times3,1)}}}x+b).\end{aligned}

xx … flattened input.

  1. Initialize w,bw, b (weights and bias).

  2. Find the optimal w,bw, b.

    L=[ylogy^+(1y)log(1y^)]\mathcal{L}=-\left[y\log\hat y+(1-y)\log(1-\hat y)\right]

    {wwαLwbbαLb\begin{cases}w\leftarrow w-\alpha\frac{\partial\mathcal{L}}{\partial w}\\b\leftarrow b-\alpha\frac{\partial\mathcal{L}}{\partial b}\end{cases}

  3. Use y^=σ(wx+b)\hat y=\sigma(wx+b) to predict.

To remember

  1. neuron = linear + activation.
  2. model = architecture + parameters.

Goal 2.0: Find cat / lion / iguana in images.

(Source: https://youtu.be/MfIjxPh6Pys 17 min. 56 sec.)

Notation

a[]a^{[\cdot]} - a layer number

a()a_{(\cdot)} - a neuron

Q. What dataset do you need when you train this logistic regression?

A. Images and labels of the column vector form. E.g. [catlioniguana][110]\begin{bmatrix}\text{cat}\\\text{lion}\\\text{iguana}\end{bmatrix}\rightarrow\begin{bmatrix}1\\1\\0\end{bmatrix}.

Q. Is this network robust if different animals are present in the same picture?

A. Yes. Three neurons don’t communicate with each other. So we can totally train them independently from each other.

You don’t need to tell them everything. If you have enough data, they’re going to figure it out.

MyQ. Can images overlap each other?

A. Probably yes.

Goal 3.0: add a constraint that there is a unique animal on an image.

(Source: https://youtu.be/MfIjxPh6Pys 27 min. 44 sec.)

(called “softmax multi-class regression.”)

The loss function:

L3N=k=13[yklogy^k+(1yk)log(1y^k)].\mathcal{L}_{3N}=-\sum_{k=1}^3\left[y_k\log\hat y_k+(1-y_k)\log(1-\hat y_k)\right].

Note. The softmax regression needs a different loss function and a different derivative.

Cross-entropy loss:

LCE=k=13yklogy^k.\mathcal{L}_{CE}=-\sum_{k=1}^3y_k\log\hat y_k.

Neural Networks

Goal: image → cat (1) vs. no cat (0)

(Source: https://youtu.be/MfIjxPh6Pys 42 min. 37 sec.)

Q. How many parameters does this network have?

A. (3N+3)+(2×3+2)+(2×1+1)(3N+3)+(2\times3+2)+(2\times1+1).

Definition

Layer … a cluster of neurons that are not connected to each other.

Hidden layer (2nd layer in the picture above)

Q. Why word “hidden”?

A. The inputs and outputs are hidden from this layer.

Interpretation of layers

  • Neurons in the 1st layer … to understand the fundamental concepts of the image such as the edges.
  • Neurons in the 2nd layer … to use the edges from the previous layer to figure out ears or a mouth that are more structurally complex objects.
  • Neuron in the 3rd layer … to identify cat image.

House price prediction

  • number of bedrooms
  • size
  • zip code
  • wealth

(Source: https://youtu.be/MfIjxPh6Pys 48 min. 46 sec.)

Rather than explicitly representing relations between features, we construct the first layer as a fully-connected layer.

(Source: https://youtu.be/MfIjxPh6Pys 50 min. 12 sec.)

cf. neural network ~ black box model ~ end-to-end learning

Propagation equations

z[1]=w[1]x+b[1]a[1]=σ(z[1])z[2]=w[2]a[1]+b[2]a[2]=σ(z[2])z[3]=w[3]a[2]+b[3]a[3]=σ(z[3])\begin{aligned}z^{[1]}&=w^{[1]}x+b^{[1]}\\a^{[1]}&=\sigma\left(z^{[1]}\right)\\z^{[2]}&=w^{[2]}a^{[1]}+b^{[2]}\\a^{[2]}&=\sigma\left(z^{[2]}\right)\\z^{[3]}&=w^{[3]}a^{[2]}+b^{[3]}\\a^{[3]}&=\sigma\left(z^{[3]}\right)\end{aligned}

Q. What happens for an input batch of mm examples?

X=(x[1]x[2]x[m])X=\begin{pmatrix}\vert&\vert&&\vert\\x^{[1]}&x^{[2]}&\cdots& x^{[m]}\\\vert&\vert&&\vert\end{pmatrix}

→ parallelize equations.

z[1](3,m)=w[1]x+b[1](3,1)\underset{\mathclap{\substack{\uparrow\\(3,m)}}}{z^{[1]}}=w^{[1]}x+\underset{\mathclap{\substack{\uparrow\\(3,1)}}}{b^{[1]}}

→ Problem: Size mismatch

→ Solution? Broadcasting - duplicate b[1]b^{[1]} in column-wise by mm times.

cf. NumPy library automatically support broadcasting.

Q. How is this network different from principal component analysis?

A. This is a kind of supervised learning algorithm used to predict housing prices, whereas principal component analysis doesn’t predict anything.

Q. Day-night classification vs. cat classification. Which one is harder?

A. Cat. Because there are many breeds of cats while there are not many breeds of nights. 😝

cf. Challenge in the day and night classification? When you are to figure it out indoors. You can imagine there’s a tiny window somewhere in the picture, and the model should be able to tell that it is the day or night.

→ More data → The more data you need in order to figure out the output, the deeper the network should be.

Optimizing w[1],w[2],w[3],b[1],b[2],b[3]w^{[1]}, w^{[2]}, w^{[3]}, b^{[1]}, b^{[2]}, b^{[3]}.

Define loss/cost function.

J(y^,y)=1mi=1mL(i)    with    L(i)=[y(i)logy^(i)+(1y(i))log(1y^(i))].J(\hat y,y)=\frac{1}{m}\sum_{i=1}^m\mathcal{L}^{(i)}\;\;\text{with}\;\;\mathcal{L}^{(i)}=-\left[y^{(i)}\log\hat y^{(i)}+(1-y^{(i)})\log\left(1-\hat y^{(i)}\right)\right].

Backward Propagation

  l=1,,m,  {w[l]w[l]αJw[l]b[l]b[l]αJb[l]\forall\;l=1,\cdots,m,\; \begin{cases} w^{[l]}\leftarrow w^{[l]}-\alpha\frac{\partial J}{\partial w^{[l]}}\\b^{[l]}\leftarrow b^{[l]}-\alpha\frac{\partial J}{b^{[l]}}\end{cases}

, e.g.,

Jw[3]=Ja[3]a[3]z[3]z[3]w[3]Jw[2]=Jz[3]z[3]a[2]a[2]z[2]z[2]w[2]Jw[1]=Jz[2]z[2]a[1]a[1]z[1]z[1]w[1].\begin{aligned}\frac{\partial J}{\partial w^{[3]}}&=\frac{\partial J}{\partial a^{[3]}}\frac{\partial a^{[3]}}{\partial z^{[3]}}\frac{\partial z^{[3]}}{\partial w^{[3]}}\\\frac{\partial J}{\partial w^{[2]}}&=\frac{\partial J}{\partial z^{[3]}}\frac{\partial z^{[3]}}{\partial a^{[2]}}\frac{\partial a^{[2]}}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial w^{[2]}}\\\frac{\partial J}{\partial w^{[1]}}&=\frac{\partial J}{\partial z^{[2]}}\frac{\partial z^{[2]}}{\partial a^{[1]}}\frac{\partial a^{[1]}}{\partial z^{[1]}}\frac{\partial z^{[1]}}{\partial w^{[1]}}.\end{aligned}

0개의 댓글