Introduction to DL
Traditional Linear Model > suffers at specific tasks(ex. MNIST)
- Why : Designed to use at low-dimensional data
- At higher dimension(ex. Image, Video data) : curse of dimensionality occurs
- How about Principal component?
: First PC finds overall mean usually(cannot capture small-scale feature)
How to overcome > Use Basis function f:Rp→Rd (kernel function)
Neural Network
1-layer Neural Network
H(1)ΛY=g(1)(W(1)x+w0(1))=g(λ)(W(λ)H(1)+w0(λ))∼p(y,λ)
Output layer g(λ) is determined as dist of Y
Neural Network is equivalent to GLM with Data-Driven Basis function
PCA is also data driven but it is (1) linear (2) do not consider Y.
- g is nonlinear (Ex. ReLU : piecewise linear)
Deep Neural Network : Multilayer Neural Network
- General Approximation Theory
- Model Fitting : MLE
let W vector contains every weight matrices, (x1,y1),…,(xn,yn) training data.
Then, cost function of Y∼N(Λ,σ2)
C(W)=−l(W)∝n1i=1∑n(yi−Λ(xi:W))2
If Y∼Multi(Λ) , the negative loglik becomes
C(W)∝i∑k∑yklogpk(xi:W)
How to optimize?
- At neural net : Newton-Raphson impossible(impossible to calculate Hessian)
- Backpropagation : easy to calculate Gradient and also easy to code
Gradient Descent
- Adam, AdaGrad, RMSprop
- Neural Net depends on Gradient i.e. the activation function should have good property
- Gradient Saturation Problem
- At Sigmoid, Hyperbolic tangent activation function
: the product of partial derivatives conv to 0
- Solution
: Rectified Linear Unit(ReLU)
What kind of Hidden Layer then?
Image, Video : CNN
Text : LSTM, Transformer
Density estimation : Normalizing Flow
Lecture 2
Overfitting Problem
Neural Net has high flexibility, but easy to suffer overfitting problem.
Solutions :
- Shrinkage penalty
- Dropout
- Batch training
- Early Stopping
Penalization
L1, L2 Regularization(Ridge, Lasso) : add a regularization term to negative loss
ex) L1 Term at Categorical response
−l(W)∝−i∑k∑yiklogpk(xi:W)+λ∥W∥2
: work as Shrinkage estimator
Dropout
- Slap and spike
- add a Bernoulli random variable to every layer
Batch Training
Split the whole training data into B batches, and do the gradient descent at every batch
- epoch : One cycle over all batches
- Batch : Sampling idea >> for statisticians?
Stochastic Gradient Descent(SGD)
Due to dropout and batch training, likelihood and cost function changes at every training with randomness
Data Split
- Training data : Used to calculate gradient
- Validation data : Not used to calculate gradient, instead used to calculate cost function and determine if overfitting occur or not
Early Stopping
- If validation cost doesn't improve during specific number of epochs(patience), stop training
Algorithm Summary
- Initialize : Generate W0 from probability distribution(prior)
Validation Method for Binary Y
-
Classification rule to 0 or 1
- From Model : We acquire P(Y=1∣X)
- How to determine prediction of Y as 0 or 1? : Threshold setting
: Large Threshold - both TPR, FPR lower
> We should not make threshold constant, instead observe change of ROC curve as threshold changes
-
ROC Curve is important : AUC-ROC
For Continuous Y
- MSE alone cannot explain the full result
- Draw scatter plot at test data
Convolutional Neural Network
- If Input data X is spatial data
- Feedforward DNN : Input > Feature Extraction Layer > Fully connected(Dense) Layer > Output
Convolution
: Taking local weighted averages to create a summary image
Preserve spatial properties with much less parameters
Pooling
: Downsampling for translational invariance(?)
Traditional kernel (ex. Gaussian) : Smoothing i.e. maybe not useful for image classification or etc.
Channel and Filter
Input Channels : ex. RGB image > 3 channel for layer
Filter's Kernel values > Also estimated during training process
- N output Channel produces N output image for convolutional layer
Stride, Filter(kernel size), Padding
- Stride : Step size for each slide
- Filter or kernel size : width and height of kernel
- Padding : Additional rows and columns to adjust the resulting images
-
Setting
x=[x1,…,xT] : T is number of words in the text x
-
Embedding : Transform each word into vector
-
At deep learning : Get Embedded vector as parameter in DL model, with just SGD
-
Text data analysis is just a special case of time series analysis.
Self-Attention Layer
-
xt is d-dimensional vector at t word
-
Linear transformation for xt : Vt=Wvxt+wv
-
Self-attention:
sa(xt)=u=1∑Ta(xu,xt)Vu
where a(Xt,Xu)>0,∑ua(xu,xt)=1 is an attention which t-word gives to Xu.
-
Calculating a(Xu,Xt)
query Qt=Wqxt+wq
Key Kt=Wkxt+wk
Then, we calculate as follows:
a(xu,xt)=softmax(KuTQt)
Position Encoding
: The locational information of words are not contained at xt.
- Make Absolute position embedding(embedding matrix w.r.t. order of words)
- Relative position embedding
Multi-head Self Attention
Define various number of self attention for h=1,⋯,H
: able to extract various kind of relationship of text
- Residual multi-head attention > Layer Normalization > Residual Dense Layers > Layer Normalization