23 KISS Summer School Note

김당찬·2023년 7월 10일
0

Introduction to DL

Traditional Linear Model > suffers at specific tasks(ex. MNIST)
- Why : Designed to use at low-dimensional data
- At higher dimension(ex. Image, Video data) : curse of dimensionality occurs
- How about Principal component?
: First PC finds overall mean usually(cannot capture small-scale feature)
How to overcome > Use Basis function f:RpRdf:R^{p}\to R^{d} (kernel function)

Neural Network

1-layer Neural Network

H(1)=g(1)(W(1)x+w0(1))Λ=g(λ)(W(λ)H(1)+w0(λ))Yp(y,λ)\begin{aligned} H^{(1)} &= g^{(1)}(W^{(1)}x + w_{0}^{(1)})\\ \Lambda&= g^{(\lambda)}(W^{(\lambda)}H^{(1)}+w_{0}^{(\lambda)})\\ Y &\sim p(y,\lambda) \end{aligned}

Output layer g(λ)g^{(\lambda)} is determined as dist of YY

Neural Network is equivalent to GLM with Data-Driven Basis function

PCA is also data driven but it is (1) linear (2) do not consider Y.

  • gg is nonlinear (Ex. ReLU : piecewise linear)

Deep Neural Network : Multilayer Neural Network

  • General Approximation Theory
  • Model Fitting : MLE
    let WW vector contains every weight matrices, (x1,y1),,(xn,yn)(x_{1},y_{1}),\ldots,(x_{n}, y_{n}) training data.

Then, cost function of YN(Λ,σ2)Y\sim N(\Lambda,\sigma^{2})

C(W)=l(W)1ni=1n(yiΛ(xi:W))2C(W)=-l(W)\propto \frac{1}{n}\sum_{i=1}^{n}(y_{i}-\Lambda(x_{i}:W))^{2}

If YMulti(Λ)Y\sim \text{Multi}(\Lambda) , the negative loglik becomes

C(W)ikyklogpk(xi:W)C(W) \propto \sum_{i}\sum_{k}y_{k}\log p_{k}(x_{i}:W)

How to optimize?

  • At neural net : Newton-Raphson impossible(impossible to calculate Hessian)
  • Backpropagation : easy to calculate Gradient and also easy to code

Gradient Descent

  • Adam, AdaGrad, RMSprop
  • Neural Net depends on Gradient i.e. the activation function should have good property
  • Gradient Saturation Problem
    - At Sigmoid, Hyperbolic tangent activation function
    : the product of partial derivatives conv to 0
    - Solution
    : Rectified Linear Unit(ReLU)

What kind of Hidden Layer then?

Image, Video : CNN
Text : LSTM, Transformer
Density estimation : Normalizing Flow

Lecture 2

Overfitting Problem

Neural Net has high flexibility, but easy to suffer overfitting problem.
Solutions :
- Shrinkage penalty
- Dropout
- Batch training
- Early Stopping

Penalization

L1, L2 Regularization(Ridge, Lasso) : add a regularization term to negative loss
ex) L1 Term at Categorical response

l(W)ikyiklogpk(xi:W)+λW2-l(W) \propto -\sum_{i}\sum_{k}y_{ik}\log p_{k}(x_{i}:W)+\lambda\Vert W\Vert^{2}

: work as Shrinkage estimator

Dropout

  • Slap and spike
  • add a Bernoulli random variable to every layer

Batch Training

Split the whole training data into BB batches, and do the gradient descent at every batch

  • epoch : One cycle over all batches
  • Batch : Sampling idea >> for statisticians?

Stochastic Gradient Descent(SGD)

Due to dropout and batch training, likelihood and cost function changes at every training with randomness

Data Split

  • Training data : Used to calculate gradient
  • Validation data : Not used to calculate gradient, instead used to calculate cost function and determine if overfitting occur or not

Early Stopping

  • If validation cost doesn't improve during specific number of epochs(patience), stop training

Algorithm Summary

  • Initialize : Generate W0W_{0} from probability distribution(prior)

Validation Method for Binary Y

  • Classification rule to 0 or 1
    - From Model : We acquire P(Y=1X)P(Y=1|X)
    - How to determine prediction of Y as 0 or 1? : Threshold setting
    : Large Threshold - both TPR, FPR lower

    	> We should not make threshold constant, instead observe change of ROC curve as threshold changes
  • ROC Curve is important : AUC-ROC

For Continuous Y

  • MSE alone cannot explain the full result
  • Draw scatter plot at test data

Lecture 3. CNN and Transformers

Convolutional Neural Network

  • If Input data XX is spatial data
  • Feedforward DNN : Input > Feature Extraction Layer > Fully connected(Dense) Layer > Output

Convolution

: Taking local weighted averages to create a summary image

Preserve spatial properties with much less parameters

Pooling

: Downsampling for translational invariance(?)
Traditional kernel (ex. Gaussian) : Smoothing i.e. maybe not useful for image classification or etc.

Channel and Filter

Input Channels : ex. RGB image > 3 channel for layer
Filter's Kernel values > Also estimated during training process

  • N output Channel produces N output image for convolutional layer

Stride, Filter(kernel size), Padding

  • Stride : Step size for each slide
  • Filter or kernel size : width and height of kernel
  • Padding : Additional rows and columns to adjust the resulting images

Transformer

  • Setting
    x=[x1,,xT]\mathbf{x} = [x_{1},\ldots,x_{T}] : T is number of words in the text x\mathrm{x}

  • Embedding : Transform each word into vector

  • At deep learning : Get Embedded vector as parameter in DL model, with just SGD

  • Text data analysis is just a special case of time series analysis.

Self-Attention Layer

  • xtx_{t} is d-dimensional vector at tt word

  • Linear transformation for xtx_t : Vt=Wvxt+wvV_{t}=W_{v}x_{t}+w_{v}

  • Self-attention:

    sa(xt)=u=1Ta(xu,xt)Vusa(x_{t}) = \sum_{u=1}^T a(x_{u},x_{t})V_{u}

    where a(Xt,Xu)>0,ua(xu,xt)=1a(X_{t},X_{u})>0, \sum_{u}a(x_{u},x_{t})=1 is an attention which tt-word gives to XuX_{u}.

  • Calculating a(Xu,Xt)a(X_{u},X_{t})

    query Qt=Wqxt+wqQ_{t} = W_{q}x_{t}+ w_{q}
    Key Kt=Wkxt+wkK_{t}=W_{k}x_{t}+w_{k}
    Then, we calculate as follows:

    a(xu,xt)=softmax(KuTQt)a(x_{u},x_{t}) = \mathrm{softmax}(K_{u}^{T}Q_{t})

Position Encoding

: The locational information of words are not contained at xtx_{t}.

  • Make Absolute position embedding(embedding matrix w.r.t. order of words)
  • Relative position embedding

Multi-head Self Attention

Define various number of self attention for h=1,,Hh=1,\cdots,H
: able to extract various kind of relationship of text

Transformer Layer

  • Residual multi-head attention > Layer Normalization > Residual Dense Layers > Layer Normalization
profile
블로그 이사했습니다 https://ddangchani.github.io

0개의 댓글