[ML] 3. Neural Networks

실버버드·2024년 10월 19일

Machine Learning

목록 보기
3/8
post-thumbnail

Week 4-1. Neural Networks

  • Neuron unit

  • activation functions

(a) step/threshold function, (b) sigmoid function 11+ex\frac{1}{1 + e^{-x}}
Changing the bias weight W0,iW_{0,i} moves the threshold location

Network structures: Feed-forward networks

1. Single-layer Perceptrons

use a step/sigmoid function as activation function
output units all operate separately-no shared weights
adjusting weights changes the cliff's location, orientation, and steepness(경사)

  • Perceptron learning: Linearly-Separable Functions

can represent AND, OR, NOT, majority, etc.. but not XOR (2-dimensional space)
jWjxj>0\sum_jW_jx_j > 0 -> 1
jWjxj0\sum_jW_jx_j \leq 0 -> 0
xjx_j = 0 or 1

3-dimensional space: the minority function- return 1 if 1s < 0s; return 0 otherwise

2. Multi-layer Perceptrons

the number of layers: counted except the input layer

combine threshold functions-ridge-bump (능선/봉우리-혹)

  • Learning with NNs

Recurrent networks
directed cycles with delays, internal state

Week4-2 Neural Networks: Activation function, Building networks

Activation functions

must be non linear, differencialable/ better zero centered(bounded)
1. Sigmoid
drawbacks: Vanishing gradient, not zero centered, computationally expensive

  1. Tanh
    Zero-centered & squashed
    Vanishing gradient
  1. ReLU (Rectified Linear Unit)
    Not zero-centered, vanishing gradient when x < 0
  1. Leaky ReLU

Building a Neural Net

1. # of hidden layers (depth)
Decision boundary according to the number of layers
1 hidden layer: boundary of convex region
2 hidden layer: combinations of convex region

2. # of units per hidden layer (width)
size of hidden layer D, size of input M
D = M
D < M: encoded, feature extraction, classification
D > M: optimal, diverse features, can be overfitting

3. Types of activation function (nonlinearity)

4. Form of objective function
Regression: same objective as Linear Regression, quadratic loss (mean squared error)
Classification: same objective as Logistic Regression, cross-entropy (negative loglikelihood), softmax layer

Lecture Week 5-1. Neural Networks: Back-propagation

Recall gradient descent algorithm

Forward- & Back-propagation

  • Computational graph
    nodes correspond to operations or variables

  • Functions as boxes

(a + ϵ\epsilon) -> c + gradient*ϵ\epsilon

Linear classification with hinge loss (linear problem)

  • find loss function and its gradients (setting)

Loss(x,y,w)=max{1wϕ(x)y,0}wLoss(x,y,w)=ϕ(x)yLoss(x,y,w) = \max\{1 - w\cdot\phi(x)y, 0\}\\ \nabla_w Loss(x,y,w) = -\phi(x)y (margin < 1)
margin = wϕ(x)yw\cdot\phi(x)y if margin >= 1, 0

Two-layer neural networks

h(x)=Vϕ(x)fV,w(x)=sign(wh(x))h(x) = V\phi(x)\\ f_{V,w}(x) = sign(w\cdot h(x))
Loss(x,y,V,w)=(wσ(Vϕ(x))y)2=(y^y)2=(residual)2wLoss(x,y,V,w)=2(residual)hVLoss(x,y,V,w)=2(residual)wh(1h)ϕ(x)TLoss(x,y,V,w) = (w\cdot\sigma(V\phi(x)) - y)^2 = (\hat{y} - y)^2 = (residual)^2\\ \nabla_w Loss(x,y,V,w) = 2(residual)h\\ \nabla_V Loss(x,y,V,w) = 2(residual)w\cdot h\cdot (1-h)\phi(x)^T

Backpropagation

1. Backpropagation for Logistic regression model or 1-layered Neural Network Model

orange: forward calculation
purple: back 1 -> 1 * 2(residual) = 6 -> 6 * 1 = 6 -> 6 * [1,2] = [6,12]
old * gradient

  • Derive the formula for gradient descent algorithm

Cost Function J=yloga(1y)log(1a)z=w1x1+w2x2+bJ = -yloga - (1 - y)log(1 - a)\\ z = w_1 x_1 + w_2 x_2 + b
y^=hθ(x)=σ(z)=a\hat{y} = h_\theta(x) = \sigma(z) = a -> J=L(a,y)J = L(a,y)

Lw1=Laazzw1\displaystyle \frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial a}\cdot\frac{\partial a}{\partial z}\cdot\frac{\partial z}{\partial w_1}

La=ya+1y1a,  az=a(1a),  zw1=x1\displaystyle \frac{\partial L}{\partial a} = \frac{-y}{a} + \frac{1-y}{1-a},\;\displaystyle \frac{\partial a}{\partial z} = a(1 - a),\; \frac{\partial z}{\partial w_1} = x_1

w1=w1αLw1=w1α(ay)x1w2=w2αLw2=w2α(ay)x2b=bαLb=bα(ay)\displaystyle w_1 = w_1 - \alpha\frac{\partial L}{\partial w_1} = w_1 - \alpha (a-y)x_1\\ w_2 = w_2 - \alpha\frac{\partial L}{\partial w_2} = w_2 - \alpha (a-y)x_2\\ b = b - \alpha\frac{\partial L}{\partial b} = b - \alpha(a-y)

2. Backpropagation for 2-layered Neural Network

Lecture Week 7-1. Neural Network: Practice building Neural Networks w/MNIST

MNIST number classification

:Mixed National Institute of Standards and Technology database
0-9 handwritten digit recognition
train:test = 6:1
Multilayer Perceptron network architecture for MNIST for today's practice

Data Normalization/ Feature Scaling: Data Preprocessing

Advantages: classification loss becomes less sensitive to small changes in weights; easier to optimize

Mini-batch SGD

after creating the mini-baches of fixed size, in one epoch:
1. Pick a mini-batch (100)
2. Feed it to Neural Network
3. Calculate the mean gradient of the mini-batch
4. update the weights using the mean gradient
5. Repeat steps 1-4 for the mini-batches

: balance between the robustness of SGD and the efficiency of BGD
smaller batch size -> more updates, more time, lower speed
very large batch size can yield wore performance

Cross-entropy loss for classification

  • Binary cross-entropy loss
    Hy(y):=i(yilog(yi)+(1yi)log(1yi))yi\displaystyle H_{y'}(y) := -\sum_i(y'_i log(y_i) + (1 - y'_i)log(1 - y_i))\\ y_i: predicted probability for class i, yiy'_i: true probability for class i

  • Cross-entropy loss for multi-class classification
    Hy(y):=iyilog(yi)\displaystyle H_{y'}(y) := -\sum_i y'_i log(y_i)
    If target = 1, the loss is 0 when the prediction is 1 and loss is infinity when the prediction is 0

  • Cross-entropy in PyTorch
    combine nn.LogSoftmax() and nn.NLLLoss()
    recall softmax: σ(z)j=exp(zj)k=1Kexp(zk)\displaystyle \sigma(z)_j = \frac{exp(z_j)}{\sum^K_{k=1}exp(z_k)} for j = 1, ..., K.
    recall cross-entropy: D(Y^,Y)=YlogY^Y^=S(Z)D(\hat{Y}, Y) = -Ylog\hat{Y}\\ \hat{Y} = S(Z)

??

Traning, Validation, Test Set

  • Train data
  • Validation data: use training error and validation error to determine when to stop to prevent overfitting (early stopping)
  • Test data: measure the generalization ablility of a machine learning algorithm. never seen during the training process

0개의 댓글