[ML&DL] 11. Deep Learning

KBC·2024년 12월 15일
0

Deep Learning

Single Layer Neural Network

f(X)=β0+k=1Kβkhk(X)=β0+k=1Kβkg(wk0+j=1pwkjXj)f(X)=\beta_0+\sum^K_{k=1}\beta_kh_k(X)\\[0.2cm] =\beta_0+\sum^K_{k=1}\beta_k g(w_{k0}+\sum^p_{j=1}w_{kj}X_j)


  • Ak=hk(X)=g(wk0+j=1pwkjXj)A_k=h_k(X)=g(w_{k0}+\sum^p_{j=1} w_{kj}X_j) are called the activations in the hidden layer
  • g(z)g(z) is called the activation function. Popular are the sigmoid and rectified linear, shown in figure
  • Activation functions in hidden layers are typically nonlinear, otherwise the model collapses to a linear model
  • So the activations are like derived features - nonlinear transformations of linear combinations of the features
  • The model is fit by minimizing i=1n(yif(xi))2\sum^n_{i=1}(y_i-f(x_i))^2 (e.g. for regression)

Example : MNIST Digits

  • Handwritten digits 28×2828\times 28 grayscale images 60K60K train, 10K10K test images
  • Features are the 784784 pixel grayscale values (0,255)\in (0,255)
  • Labels are the digit class 090-9

  • Goal : build a classifier to predict the image class
  • We build a two-layer network with 256256 units at first layer, 128128 units at second layer, and 1010 units at output layer
  • Along with intercepts (called biases) there are 235,146235,146 parameters (referred to as weights)

  • Let Zm=βm0+l=1K2βmlAl(2),  m=0,1,,9Z_m=\beta_{m0}+\sum^{K_2}_{l=1}\beta_{ml}A_l^{(2)},\;m=0,1,\dots,9 be 1010 linear combinations of activations at second layer
  • Output activation function encodes the softmax function
    fm(X)=Pr(Y=mX)=eZml=09eZlf_m(X)=\Pr(Y=m|X)=\frac{e^{Zm}}{\sum^9_{l=0}e^{Zl}}
  • We fit the model by minimizing the negative multinomial log-likelihood (or cross-entropy)
    i=1nm=09yimlog(fm(xi))-\sum^n_{i=1}\sum^9_{m=0}y_{im}\log(f_m(x_i))
  • yimy_{im} is 11 if true class for observation ii is mm, else 00 - i.e. one-hot encoded

  • Early success for neural networks in the 1990s
  • With so many parameters, regularization is essential
  • Some details of regularization and fitting will come later
  • Very overworked problem - best reported rates are < 0.5%0.5\%
  • Human error rate is reported to be aroung 0.2%0.2\%

Convolutional Neural Network - CNN

  • Major success story for classifying images
  • Shown are samples from CIFAR100 database
  • 32×3232\times 32 color natural images, with 100100 classes
  • 50K50K training images, 10K10K test images
  • Each image is a three-dimensional array of feature map : 32×32×332\times32\times 3 array of 88-bit numbers
  • The last dimension represents the three color channels for red, green and blue

  • The CNN builds up an image in a hierarchical fashion
  • Edges and shapes are recognized and pieced together to form more complex shapes, eventually assembling the target image
  • This hierarchical construction is achieved using convolution and pooling layers

Convolution Filter

  • The filter is itself an image, and represents a samll shape, edge etc.
  • We slide it around the input image, scoring for matches
  • The scoring is done via dot-products, illustrated above
  • If the subimage of the input image is similar to the filter, the score is high, otherwise low
  • The filters are learned during training
  • The idea of convolution with a filter is to find common patterns that occur in differnet parts of the image
  • The two filters shown here highlight vertical and horizontal stripes
  • The result of the convolution is a new feature map
  • Since images have three colors channels, the filter does as well : one filter per channel, and dot-products are summed
  • The weights in the filters are learned by the network

Pooling

  • Each non-overlapping 2×22\times 2 block is replaced by its maximum
  • This sharpens the feature identification
  • Allows for locational invariance
  • Reduces the dimension by a factor of 44 - i.e. factor of 22 in each dimension

Architecture of a CNN

  • Many convolve + pool layers
  • Filters are typically small, e.g. each channel 3×33\times 3
  • Each filter creates a new channel in convolution layer
  • As pooling reduces size, the number of filters/channels is typically increased
  • Number of layers can be very large. E.g. resnet50 trained on imagenet 1000-class image data base has 5050 layers

Document Classification

Featurization : Bag-of-Words

  • Documents have different lengths, and consist of sequences of words
  • How do we create features XX to characterize a document?

  • From a dictionary, identify the 10K10K most frequently occurring words
  • Create a binary vector of length p=10Kp=10K for each document, and score a 11 in every position that the corresponding word occurred
  • With nn documents, we now have a n×pn\times p sparse feature matrix XX
  • We compare a lasso logistic regression model to a two-hidden-layer neural network on the next slide
  • Bag-of-words are unigrams
  • We can instead use bigrams ( occurrences of adjacent word pairs ), and in general m-grams

Lasso vs Neural Network

  • Simpler lasso logistic regression model works as well as neural network in this case

Recurrent Neural Networks

  • Often data arise as sequences :
    • Documents are sequences of words, and their relative positions have meaning
    • Time-series such as weather data or financial indices
    • Recorded speech or music
    • Handwriting, such as doctor's notes
  • RNNs build models that take into account this sequential nature of the data, and build a memory of the past
    • The feature for each observation is a sequence of vectors X={X1,,XL}X=\{X_1,\dots,X_L\}
    • The target YY is often of the usual kind - e.g. a single variable such as Sentiment, or a one-hot vector for multiclass
    • However, YY can also be a sequence, such as the same document in a differnet language

  • The hidden layer is a sequence of vectors AlA_l, receiving as input XlX_l as well as Al1A_{l-1}
  • AlA_l produces an output OlO_l
  • The same weights WW, UU and BB are used at each step in the sequence - hence the term recurrent
  • The AlA_l sequence represents an evolving model for the response that is updated as each element XlX_l is processed

  • Suppose Xl=(Xl1,Xl2,,Xlp)X_l=(X_{l1},X_{l2},\dots,X_{lp}) has pp components, and Al=(Al1,Al2,,AlK)A_l=(A_{l1},A_{l2},\dots,A_{lK}) has KK components
  • Then the computation at the kkth components of hidden unit AlA_l is
    Alk=g(wk0+j=1pwkjXlj+s=1KuksAl1,s)Ol=β0+k=1KβkAlkA_{lk}=g\left(w_{k0}+\sum^p_{j=1}w_{kj}X_{lj}+\sum^K_{s=1}u_{ks}A_{l-1,s}\right)\\[0.3cm] O_l=\beta_0+\sum^K_{k=1}\beta_kA_{lk}
  • Often we are concerned only with the prediction OLO_L at the last unit
  • For squared error loss, and nn sequence / response pairs, we would minimize
    i=1n(yioiL)2=i=1n(yi(β0+k=1Kβkg(wk0+j=1pwkjxiLj+s=1Kuksai,L1,s)))2\sum_{i=1}^n (y_i - o_{iL})^2 = \sum_{i=1}^n \left( y_i - \left( \beta_0 + \sum_{k=1}^K \beta_k g \left( w_{k0} + \sum_{j=1}^p w_{kj} x_{iL_j} + \sum_{s=1}^K u_{ks} a_{i, L-1, s} \right) \right) \right)^2

RNN and IMDB Reviews

  • The document feature is a sequence of words {Wl}1L\{W_l\}^L_1
  • We typically truncate/pad the documents to the same number LL of words (we use L=500L=500)
  • Each word WlW_l is represented as a one-hot encoded binary vector XlX_l of length 10K10K, with all zeros and a single one in the position for that word in the dictionary
  • This results in an extremely sparse feature representation, and would not work well
  • Instead we use a lower-dimensional pretrained work embedding matrix E(m×10K)E(m\times 10K)
  • This reduces the binary feature vector of length 10K10K to a real feature vector of dimension m<10Km<10K (e.g. mm in the low hundreds)

Word Embedding

  • Embeddings are pretrained on very large corpora of documents, using methods similar to principal components
  • word2vec and GloVe are popular

  • After a lot of work, the results are a disappointing 76%76\% accuracy
  • We then fit a more exotic RNN than the one displayed - a LSTM with long and short term memory
  • Here AlA_l receives input from Al1A_{l-1} (short term memory) as well as from a version that reaches further back in time (long term memory)
  • Now we get 87%87\% accuracy, slightly less than the 88%88\% achieved by glmnet
  • These data have been used as a benchmark for new RNN architectures
  • The best reported result found at the time of writing (2020) was aroung 95%95\%
  • We point to a leaderboard

Time Series Forecasting

New-York Stock Exchange Data

  • Shown in previous slide are three daily time series for the period 6,0516,051 trading days
    • Log trading volume : This is the fraction of all outstanding shares that are traded on that day, relative to a 100100-day moving average of past turnover, on the log scale
    • Dow Jones return : This is the difference between the log of the Dow Jones Industrial Index on consecutive trading days
    • Log volatility : This is based on the absolute values of daily price movements
  • Goal : predict Log trading volume tomorrow, given it observed values up to today, as well as those of Dow Jones return and Log volatility

Autocorrelation

  • The autocorrelation at lag ll is the correlation of all paris (vt,vtl)(v_t,v_{t-l}) that are ll trading days apart
  • These sizable correlations give us confidence that past values will be helpful in predicting the future
  • This is a curious prediction problem : the response vtv_t is also a feature vtlv_{t-l}

RNN Forecaster

  • We only have one series of data
  • How do we set up for an RNN
  • We extract many short mini-series of input sequences X={X1,,XL}X=\{X_1,\dots,X_L\} with a predefined length LL known as the lag
    X1=(vtLrtLztL),X2=(vtL+1rtL+1ztL+1),,XL=(vt1rt1zt1),and Y=vtX_1 = \begin{pmatrix} v_{t-L} \\ r_{t-L} \\ z_{t-L} \end{pmatrix}, \, X_2 = \begin{pmatrix} v_{t-L+1} \\ r_{t-L+1} \\ z_{t-L+1} \end{pmatrix}, \, \cdots, \, X_L = \begin{pmatrix} v_{t-1} \\ r_{t-1} \\ z_{t-1} \end{pmatrix}, \, \text{and } Y = v_t
  • Since T=6,051T=6,051, with L=5L=5 we can create 6,0466,046 such (X,Y)(X,Y) pairs
  • We use the first 4,2814,281 as training data, and the following 1,7701,770 as test data
  • We fit an RNN with 1212 hidden units per lag step (i.e. per AlA_l)

  • Figure shows predictions and truth for test period
  • R2=0.42R^2=0.42 for RNN
  • R2=0.18R^2=0.18 for straw man - use yesterday's value of Log trading volume to predict that of today

Autoregression Forecaster

  • The RNN forecaster is similar in structure to a traditional autoregression procedure
    y=[vL+1vL+2vL+3vT]M=[1vLvL1v11vL+1vLv21vL+2vL+1v31vT1vT2vTL]\mathbf{y} = \begin{bmatrix} v_{L+1} \\ v_{L+2} \\ v_{L+3} \\ \vdots \\ v_{T} \end{bmatrix} \quad \mathbf{M} = \begin{bmatrix} 1 & v_L & v_{L-1} & \cdots & v_1 \\ 1 & v_{L+1} & v_L & \cdots & v_2 \\ 1 & v_{L+2} & v_{L+1} & \cdots & v_3 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & v_{T-1} & v_{T-2} & \cdots & v_{T-L} \end{bmatrix}
  • Fit an OSL regression of yy on MM, giving
    v^t=β^0+β^1vt1+β^2vt2++β^LvtL.\hat{v}_t = \hat{\beta}_0 + \hat{\beta}_1 v_{t-1} + \hat{\beta}_2 v_{t-2} + \cdots + \hat{\beta}_L v_{t-L}.
  • Known as an order-L autoregression model or AR(L)AR(L)
  • For the NYSE data we can include lagged version of DJ_return and log_volatility in matrix MM, resulting in 3L+13L+1 columns

  • R2=0.41R^2=0.41 for AR(5)AR(5) model (1616 parameters)
  • R2=0.42R^2=0.42 for RNN model (205205 parameters)
  • R2=0.42R^2=0.42 for AR(5)AR(5) model fit by neural network
  • R2=0.46R^2=0.46 for all model if we include day_of_week of day being predicted

Non Convex Functions and Gradient Descent

  1. Start with a guess θ0\theta^0 for all the parameters in θ\theta, and set t=0t=0
  2. Iterate until the objective R(θ)R(\theta) fails to decrease :
    2.1. Find a vector δ\delta that reflects a small change in θ\theta, such that θt+1=θt+δ\theta^{t+1}=\theta^t+\delta reduces the objective
    2.2. Set tt+1t \leftarrow t+1

  • In this sample example we reached the global minimum
  • If we had started a little to the left of θ0\theta^0 we would have gone in the other direction, and ended up in a local minimum
  • Although θ\theta is multi-dimensional, we have depicted the process as one-dimensional
  • It is much harder to identify whether one is in a local minimum in high dimensions
  • How to find a direction δ\delta that point downhill?
  • We compute the gradient vector
    R(θt)=R(θ)θθ=θt\nabla R(\theta^t) = \left. \frac{\partial R(\theta)}{\partial \theta} \right|_{\theta = \theta^t}
  • i.e. the vector of partial derivatives at the current guess θt\theta^t
  • The gradient points uphil, so our update is δ=ρR(θt)\delta =-\rho\nabla R(\theta^t) or
    θt+1θtρR(θt)\theta^{t+1} \leftarrow \theta^t - \rho \nabla R(\theta^t)
    where ρ\rho is the learning rate (typically small, e.g. ρ=0.001\rho=0.001)

Gradients and Backpropagation

R(θ)=i=1nRi(θ)R(\theta)=\sum^n_{i=1}R_i(\theta)

is a sum, so gradient is sum of gradients

Ri(θ)=12(yifθ(xi))2=12(yiβ0k=1Kβkg(wk0+j=1pwkjxij))2R_i(\theta) = \frac{1}{2} \left( y_i - f_{\theta}(x_i) \right)^2 = \frac{1}{2} \left( y_i - \beta_0 - \sum_{k=1}^K \beta_k g \left( w_{k0} + \sum_{j=1}^p w_{kj} x_{ij} \right) \right)^2
  • For ease of notation, let zik=wk0+j=1pwkjxijz_{ik}=w_{k0}+\sum^p_{j=1}w_{kj}x_{ij}
  • Backpropagation uses the chain rule for differentitaion
    Ri(θ)βk=Ri(θ)fθ(xi)fθ(xi)βk=(yifθ(xi))g(zik).Ri(θ)wkj=Ri(θ)fθ(xi)fθ(xi)g(zik)g(zik)zikzikwkj=(yifθ(xi))βkg(zik)xij.\frac{\partial R_i(\theta)}{\partial \beta_k} = \frac{\partial R_i(\theta)}{\partial f_\theta(x_i)} \cdot \frac{\partial f_\theta(x_i)}{\partial \beta_k} = - (y_i - f_\theta(x_i)) \cdot g(z_{ik}).\\[0.3cm] \frac{\partial R_i(\theta)}{\partial w_{kj}} = \frac{\partial R_i(\theta)}{\partial f_\theta(x_i)} \cdot \frac{\partial f_\theta(x_i)}{\partial g(z_{ik})} \cdot \frac{\partial g(z_{ik})}{\partial z_{ik}} \cdot \frac{\partial z_{ik}}{\partial w_{kj}} = - (y_i - f_\theta(x_i)) \cdot \beta_k \cdot g'(z_{ik}) \cdot x_{ij}.

Tricks of the Trade

  • Slow learning
    Gradient descent is slow, and a small learning rate ρ\rho slows it even further. With early stopping, this is a form of regularization
  • Stochastic gradient descent
    Rather than compute the gradient using all the data, use a small minibatch drawn at random at each step
  • An epoch is a count of iterations and amounts to the number of minibatch updates such that nn samples in total have been processed; i.e. 60K/12846960K/128 \approx 469 for MNIST
  • Regularization
    Ridge and lasso regularization can be used to shrink the weights at each layer. Two other popular forms of regulariztion and dropout and augmentation

Droupout Learning

  • At each SGD update, randomly remove units with probability ϕ\phi, and scale up the weights of those retained by 1/(1ϕ)1/(1-\phi) to compensate
  • In simple scenarios like linear regression, a version of this process can be shown to be equivalent to ridge regularization
  • As in ridge, the other units stand in for those temporaily removed, and their weights are drawn closer together
  • Similar to randomly omitting variables when growing trees in random forests

Ridge and Data Augmentation

  • Make many copies of each (xi,yi)(x_i,y_i) and add a small amount of Gaussian noise to the xix_i - a little cloud around each observation - but leave the copies of yiy_i alone
  • This makes the fit robust to small perturbations in xix_i, and is equivalent to ridge regularization in an OLS setting

Double Descent

  • With neural networks, it seems better to have too many hidden units than too few
  • Likewise more hidden layers better than few
  • Running stochastic gradients descent till zero training error often gives good out-of-sample error
  • Increasing the number of units or layers and again training till zero error sometimes gives even better out-of-sample error
  • When d20d\leq 20, model is OLS, and we see usual bias-variance trade-off
  • When d>20d> 20, we revert to minimum-norm
  • As dd increases above 2020, j=1dβ^j2\sum^d_{j=1} \hat \beta_j^2 decreases since it is easier to achieve zero error, and hence less wiggly solutions

  • To achieve a zero-residual solution with d=20d=20 is a real stretch
  • Easier for larger dd

  • In a wide linear model (p>np>n) fit by least squares, SGD with a small step size leads to a minimum norm zero-residual solution
  • Stochastic gradient flow - i.e. the entire path of SGD solutions - is somewhat similar to ridge path
  • By analogy, deep and wide neural networks fit by SGD down to zero training error often give good solutions that generalize well
  • In particular cases with high signal-to-noise ratio - e.g. image recognition - are less prone to overfitting; the zero-error solution is mostly signal

All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)

profile
AI, Security

0개의 댓글