https://www.youtube.com/watch?v=CS4cs9xVecg&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0
Andrew Ng 의 Machine Learning Course 01의 Week 1 ~ Week 2 를 수강하고 정리하였습니다.
Tiny neural network:

By stacking many neurons, you can get a bigger neural network:

X는 input layer, y는 output layer, X, y 사이의 neurons는 hidden layer이다.

→ selecting X and y is important


rise of neural network → computers can understand unstructured data much better!

m: number of training examples (size of training data)
⇒ Scale of data (m) drives deep learning progress!

using sigmoid function → 그래프 양 끝의 gradient가 0에 가까워 learning 속도가 매우 느려진다. (parameter가 천천히 바뀌기 때문)

ReLU 사용 → sigmoid를 사용했을 때보다gradient descent가 훨씬 빨라졌다
forward propagation and backward propagation

pixel intensity values → input feature vector x 로 바꾸려면 먼저 featue vector x 를 define 해야 한다.

define a matrix X and stack m training examples as columns
define Y matrix

Given input feature vector x, want y^ to be P(y=1|x), which means the chance of the given x to be a cat picture.
with given x and parameters are w and b, how do we generage output y^?
model output인 y^ 와 ground truth인 y 를 비교해 loss 계산

terminoloogy:

parameter w, b에 따른 cost function J(w, b)
⇒ J(w, b) 가 minimum인 w, b를 찾는 것이 목적이다.
one local minima → convex function
many local minima → nonconvex function
gradient descent : from initialized value, step downhill and find minima!

: actual update of w and b
derivative를 뜻하는 d는 partial derivative를 뜻하는 다른 기호로 나타내지기도 한다.

derivative = slope = height/width
at a = 2, slope is 3
at a = 5, slope is also 3

at a = 2, slope is 4
at a = 5, slope is 10
derivative of a^2 is 2a (Calculus)



if you change a, value of v changes and therefore changes J.
dJ/da 를 구하기 위해서는 dJ/dv 와 dv/da 를 먼저 구해야 한다 ⇒ backward propagation

dJ/du is dJ/dv * dv/du
using chain rule, dJ/db = dJ/du * du/db
dJ/dc = dJ/du * du/dc
⇒ to compue all the derivatives, most effecient way is to compute from right to left. (backward propagation)

→ draw computational graph

→ modify w and b to reduce L(a, y) by going backward to compute derivatives

cost function J(w, b) is average of individual losses.
derivative of cost function J(w, b) is also average of derivatives of individual losses.

we are using dw_1, dw_2, db as accumulators.
after the algorithm, we can update parameters w_1, w_2, b.
for loops
Vectorization can get rid of for loops → efficient!

Non-vectorized implementations → for loops → very slow
vectorized → faster
Whenever possible, avoid explicit for-loops:




define X
and define Z
and define A

based on the definitions of dZ, we can compute this by one line of code!

→ implementing logistic regression

cal = A.sum(axis = 0)
percentage = 100*A/(cal.reshape(1,4))
# you don't have to call reshape command.
# why? broadcasting


illimate rank 1 array!!



많은 것을 배웠습니다, 감사합니다.