Deep Learning
Single Layer Neural Network
f(X)=β0+k=1∑Kβkhk(X)=β0+k=1∑Kβkg(wk0+j=1∑pwkjXj)


- Ak=hk(X)=g(wk0+∑j=1pwkjXj) are called the
activations in the hidden layer
- g(z) is called the
activation function. Popular are the sigmoid and rectified linear, shown in figure
- Activation functions in hidden layers are typically
nonlinear, otherwise the model collapses to a linear model
- So the activations are like derived features - nonlinear transformations of linear combinations of the features
- The model is fit by minimizing ∑i=1n(yi−f(xi))2 (e.g. for regression)
Example : MNIST Digits

- Handwritten digits 28×28 grayscale images 60K train, 10K test images
- Features are the 784 pixel grayscale values ∈(0,255)
- Labels are the digit class 0−9
- Goal : build a classifier to predict the image class
- We build a two-layer network with 256 units at first layer, 128 units at second layer, and 10 units at output layer
- Along with intercepts (called
biases) there are 235,146 parameters (referred to as weights)

- Let Zm=βm0+∑l=1K2βmlAl(2),m=0,1,…,9 be 10 linear combinations of activations at second layer
- Output activation function encodes the
softmax functionfm(X)=Pr(Y=m∣X)=∑l=09eZleZm
- We fit the model by minimizing the negative multinomial log-likelihood (or cross-entropy)
−i=1∑nm=0∑9yimlog(fm(xi))
- yim is 1 if true class for observation i is m, else 0 - i.e.
one-hot encoded

- Early success for neural networks in the 1990s
- With so many parameters, regularization is essential
- Some details of regularization and fitting will come later
- Very overworked problem - best reported rates are < 0.5%
- Human error rate is reported to be aroung 0.2%
Convolutional Neural Network - CNN

- Major success story for classifying images
- Shown are samples from
CIFAR100 database
- 32×32 color natural images, with 100 classes
- 50K training images, 10K test images
- Each image is a three-dimensional array of
feature map : 32×32×3 array of 8-bit numbers
- The last dimension represents the three color channels for red, green and blue

- The
CNN builds up an image in a hierarchical fashion
- Edges and shapes are recognized and pieced together to form more complex shapes, eventually assembling the target image
- This hierarchical construction is achieved using
convolution and pooling layers
Convolution Filter

- The filter is itself an image, and represents a samll shape, edge etc.
- We slide it around the input image, scoring for matches
- The scoring is done via
dot-products, illustrated above
- If the subimage of the input image is similar to the filter, the score is high, otherwise low
- The filters are
learned during training

- The idea of convolution with a filter is to find common patterns that occur in differnet parts of the image
- The two filters shown here highlight vertical and horizontal stripes
The result of the convolution is a new feature map
- Since images have three colors channels, the filter does as well : one filter per channel, and dot-products are summed
- The weights in the filters are
learned by the network
Pooling

- Each non-overlapping 2×2 block is replaced by its maximum
- This sharpens the feature identification
- Allows for locational invariance
- Reduces the dimension by a factor of 4 - i.e. factor of 2 in each dimension
Architecture of a CNN

- Many
convolve + pool layers
- Filters are typically small, e.g. each channel 3×3
- Each filter creates a new channel in convolution layer
- As pooling reduces size, the number of filters/channels is typically increased
- Number of layers can be very large. E.g.
resnet50 trained on imagenet 1000-class image data base has 50 layers
Document Classification
Featurization : Bag-of-Words
- Documents have different lengths, and consist of sequences of words
- How do we create features X to characterize a document?
- From a dictionary, identify the 10K most frequently occurring words
- Create a binary vector of length p=10K for each document, and score a 1 in every position that the corresponding word occurred
- With n documents, we now have a n×p
sparse feature matrix X
- We compare a
lasso logistic regression model to a two-hidden-layer neural network on the next slide
Bag-of-words are unigrams
- We can instead use
bigrams ( occurrences of adjacent word pairs ), and in general m-grams
Lasso vs Neural Network

- Simpler
lasso logistic regression model works as well as neural network in this case
Recurrent Neural Networks
- Often data arise as sequences :
- Documents are sequences of words, and their relative positions have meaning
- Time-series such as weather data or financial indices
- Recorded speech or music
- Handwriting, such as doctor's notes
RNNs build models that take into account this sequential nature of the data, and build a memory of the past
- The feature for each observation is a
sequence of vectors X={X1,…,XL}
- The target Y is often of the usual kind - e.g. a single variable such as
Sentiment, or a one-hot vector for multiclass
- However, Y can also be a sequence, such as the same document in a differnet language

- The hidden layer is a sequence of vectors Al, receiving as input Xl as well as Al−1
- Al produces an output Ol
- The
same weights W, U and B are used at each step in the sequence - hence the term recurrent
- The Al sequence represents an evolving model for the response that is updated as each element Xl is processed
- Suppose Xl=(Xl1,Xl2,…,Xlp) has p components, and Al=(Al1,Al2,…,AlK) has K components
- Then the computation at the kth components of hidden unit Al is
Alk=g(wk0+j=1∑pwkjXlj+s=1∑KuksAl−1,s)Ol=β0+k=1∑KβkAlk
- Often we are concerned only with the prediction OL at the last unit
- For squared error loss, and n sequence / response pairs, we would minimize
i=1∑n(yi−oiL)2=i=1∑n(yi−(β0+k=1∑Kβkg(wk0+j=1∑pwkjxiLj+s=1∑Kuksai,L−1,s)))2
RNN and IMDB Reviews
- The document feature is a sequence of words {Wl}1L
- We typically truncate/pad the documents to the same number L of words (we use L=500)
- Each word Wl is represented as a
one-hot encoded binary vector Xl of length 10K, with all zeros and a single one in the position for that word in the dictionary
- This results in an extremely sparse feature representation, and would not work well
- Instead we use a lower-dimensional pretrained
work embedding matrix E(m×10K)
- This reduces the binary feature vector of length 10K to a real feature vector of dimension m<10K (e.g. m in the low hundreds)
Word Embedding

- Embeddings are pretrained on very large corpora of documents, using methods similar to principal components
word2vec and GloVe are popular
- After a lot of work, the results are a disappointing 76% accuracy
- We then fit a more exotic
RNN than the one displayed - a LSTM with long and short term memory
- Here Al receives input from Al−1 (short term memory) as well as from a version that reaches further back in time (long term memory)
- Now we get 87% accuracy, slightly less than the 88% achieved by
glmnet
- These data have been used as a benchmark for new
RNN architectures
- The best reported result found at the time of writing (2020) was aroung 95%
- We point to a leaderboard
Time Series Forecasting

New-York Stock Exchange Data
- Shown in previous slide are three daily time series for the period 6,051 trading days
Log trading volume : This is the fraction of all outstanding shares that are traded on that day, relative to a 100-day moving average of past turnover, on the log scale
Dow Jones return : This is the difference between the log of the Dow Jones Industrial Index on consecutive trading days
Log volatility : This is based on the absolute values of daily price movements
- Goal : predict
Log trading volume tomorrow, given it observed values up to today, as well as those of Dow Jones return and Log volatility
Autocorrelation

- The
autocorrelation at lag l is the correlation of all paris (vt,vt−l) that are l trading days apart
- These sizable correlations give us confidence that past values will be helpful in predicting the future
- This is a curious prediction problem : the response vt is also a feature vt−l
RNN Forecaster
- We only have one series of data
- How do we set up for an
RNN
- We extract many short mini-series of input sequences X={X1,…,XL} with a predefined length L known as the
lagX1=⎝⎜⎛vt−Lrt−Lzt−L⎠⎟⎞,X2=⎝⎜⎛vt−L+1rt−L+1zt−L+1⎠⎟⎞,⋯,XL=⎝⎜⎛vt−1rt−1zt−1⎠⎟⎞,and Y=vt
- Since T=6,051, with L=5 we can create 6,046 such (X,Y) pairs
- We use the first 4,281 as training data, and the following 1,770 as test data
- We fit an
RNN with 12 hidden units per lag step (i.e. per Al)

- Figure shows predictions and truth for test period
- R2=0.42 for
RNN
- R2=0.18 for
straw man - use yesterday's value of Log trading volume to predict that of today
Autoregression Forecaster
- The
RNN forecaster is similar in structure to a traditional autoregression procedurey=⎣⎢⎢⎢⎢⎢⎢⎡vL+1vL+2vL+3⋮vT⎦⎥⎥⎥⎥⎥⎥⎤M=⎣⎢⎢⎢⎢⎢⎢⎡111⋮1vLvL+1vL+2⋮vT−1vL−1vLvL+1⋮vT−2⋯⋯⋯⋱⋯v1v2v3⋮vT−L⎦⎥⎥⎥⎥⎥⎥⎤
- Fit an
OSL regression of y on M, givingv^t=β^0+β^1vt−1+β^2vt−2+⋯+β^Lvt−L.
- Known as an
order-L autoregression model or AR(L)
- For the
NYSE data we can include lagged version of DJ_return and log_volatility in matrix M, resulting in 3L+1 columns
- R2=0.41 for AR(5) model (16 parameters)
- R2=0.42 for
RNN model (205 parameters)
- R2=0.42 for AR(5) model fit by neural network
- R2=0.46 for all model if we include
day_of_week of day being predicted
Non Convex Functions and Gradient Descent

- Start with a guess θ0 for all the parameters in θ, and set t=0
- Iterate until the objective R(θ) fails to decrease :
2.1. Find a vector δ that reflects a small change in θ, such that θt+1=θt+δ reduces the objective
2.2. Set t←t+1
- In this sample example we reached the
global minimum
- If we had started a little to the left of θ0 we would have gone in the other direction, and ended up in a
local minimum
- Although θ is multi-dimensional, we have depicted the process as one-dimensional
- It is much harder to identify whether one is in a
local minimum in high dimensions
- How to find a direction δ that point downhill?
- We compute the
gradient vector∇R(θt)=∂θ∂R(θ)∣∣∣∣∣θ=θt
- i.e. the vector of
partial derivatives at the current guess θt
- The gradient points uphil, so our update is δ=−ρ∇R(θt) or
θt+1←θt−ρ∇R(θt) where ρ is the learning rate (typically small, e.g. ρ=0.001)
Gradients and Backpropagation
R(θ)=i=1∑nRi(θ)
is a sum, so gradient is sum of gradients
Ri(θ)=21(yi−fθ(xi))2=21(yi−β0−k=1∑Kβkg(wk0+j=1∑pwkjxij))2
- For ease of notation, let zik=wk0+∑j=1pwkjxij
Backpropagation uses the chain rule for differentitaion∂βk∂Ri(θ)=∂fθ(xi)∂Ri(θ)⋅∂βk∂fθ(xi)=−(yi−fθ(xi))⋅g(zik).∂wkj∂Ri(θ)=∂fθ(xi)∂Ri(θ)⋅∂g(zik)∂fθ(xi)⋅∂zik∂g(zik)⋅∂wkj∂zik=−(yi−fθ(xi))⋅βk⋅g′(zik)⋅xij.
Tricks of the Trade
Slow learning
Gradient descent is slow, and a small learning rate ρ slows it even further. With early stopping, this is a form of regularization
Stochastic gradient descent
Rather than compute the gradient using all the data, use a small minibatch drawn at random at each step
- An
epoch is a count of iterations and amounts to the number of minibatch updates such that n samples in total have been processed; i.e. 60K/128≈469 for MNIST
Regularization
Ridge and lasso regularization can be used to shrink the weights at each layer. Two other popular forms of regulariztion and dropout and augmentation
Droupout Learning

- At each
SGD update, randomly remove units with probability ϕ, and scale up the weights of those retained by 1/(1−ϕ) to compensate
- In simple scenarios like linear regression, a version of this process can be shown to be equivalent to ridge regularization
- As in ridge, the other units
stand in for those temporaily removed, and their weights are drawn closer together
- Similar to randomly omitting variables when growing trees in random forests
Ridge and Data Augmentation

- Make many copies of each (xi,yi) and add a small amount of Gaussian noise to the xi - a little cloud around each observation - but
leave the copies of yi alone
- This makes the fit robust to small perturbations in xi, and is equivalent to
ridge regularization in an OLS setting
Double Descent
- With neural networks, it seems better to have too many hidden units than too few
- Likewise more hidden layers better than few
- Running stochastic gradients descent till zero training error often gives good out-of-sample error
- Increasing the number of units or layers and again training till zero error sometimes gives
even better out-of-sample error

- When d≤20, model is OLS, and we see usual bias-variance trade-off
- When d>20, we revert to minimum-norm
- As d increases above 20, ∑j=1dβ^j2
decreases since it is easier to achieve zero error, and hence less wiggly solutions

- To achieve a zero-residual solution with d=20 is a real stretch
- Easier for larger d
- In a wide linear model (p>n) fit by least squares,
SGD with a small step size leads to a minimum norm zero-residual solution
Stochastic gradient flow - i.e. the entire path of SGD solutions - is somewhat similar to ridge path
- By analogy, deep and wide neural networks fit by
SGD down to zero training error often give good solutions that generalize well
- In particular cases with
high signal-to-noise ratio - e.g. image recognition - are less prone to overfitting; the zero-error solution is mostly signal
All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)