Optimization - Regularization

Ko Hyejung·2021년 12월 11일

Gradient Descent

Important Concepts in Optimization

Practical Gradient Descent Methods

Regularization

NAVER AI TECH - precourse

목록 보기

13/15

최적화와 관련된 주요한 용어와 다양한 Gradient Descent 기법들을 배웁니다.

주요한 용어: Generalization,Overfitting, Cross-validation 등 다양한 용어가 있습니다. 각 용어들의 의미에 대해 배웁니다.

다양한 Gradient Descent 기법: 기존 SGD(Stochastic gradient descent)를 넘어서 최적화(학습)가 더 잘될 수 있도록 하는 다양한 기법들에 대해 배웁니다.

Gradient Descent

First-order iterative optimization algorithm for finding a local minimum of a differentiable function

gradient descent down a 2D loss surface (two learnable parameters)

Important Concepts in Optimization

Generalization
Under-fitting vs. over-fitting
Cross validation
Bias-variance tradeoff
Bootstrapping
Bagging and boosting

Generalization

How well the learned model will behave on unseen data

Underfitting vs Overfitting

Cross-validation

Cross-validation is a model validation technique for assessing how the model will generalize to an independent (test) data set

Bias and Variance

Bias and Variance Tradeoff

Bootstrapping

Bootstrapping is any test or metric that uses random sampling with replacement

Bagging vs Boosting

Bagging (Bootstrapping aggregating)
Multiple models are being trained with bootstrapping.
ex) Base classifiers are fitted on random subset where individual predictions are aggregated (voting or averaging).

Boosting
It focuses on those specific training samples that are hard to classify.
A strong model is built by combining weak learners in sequence where each learner learns from the mistakes of the previous weak learner.

Practical Gradient Descent Methods

Gradient Descent Methods

Stochastic gradient descent
The term stochastic refers to the fact that each batch of data is drawn at random (stochastic is a scientific synonym of random)
Update with the gradient computed from a single sample

Mini-batch gradient descent
Update with the gradient computed from a subset of data

Batch gradient descent
Update with the gradient computed from the whole data
Stochastic gradient descent
Momentum
Nesterov accelerated gradient
Adagrad
Adadelta
RMSprop
Adam

Batch-size Matters

"It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize."

"We ... present numerical evidence that supports the view that large batch methods tend to converge to sharp minimizers of the training and testing functions. In contrast, small-batch methods consistently converge to flat minimizers... this is due to the inherent noise in the gradient estimation."

Gradient Descent

Momentum

hyperparameter로 베타 적용(모멘텀)
모멘텀이 포함된 gradient로 update

Nesterov Accelerated Gradient

Adagrad

Adagrad adapts the learning rate, performing larger updates for infrequent and smaller updates for frequent parameters

Gt : Sum of gradient squares

What will happen if the training occurs for a long period?

Adadelta

Adadelta extends Adagrad to reduce its monotonically decreasing the learning rate by restricting the accumulation window

There is no learning rate in Adadelta
-> 바꿀 수 있는 요소가 많지 않은 이유로 잘 활용되지는 않음

RMSprop

RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his lecture

Adam

Adaptive Moment Estimation (Adam) leverages both past gradients and squared gradients

opt = tf.keras.optimizers.Adam(learning_rate=0.1)

Adam effectively combines momentum with adaptive learning rate approach

Regularization

Early stopping
Parameter norm penalty
Data augmentation
Noise robustness
Label smoothing
Dropout
Batch normalization

Early Stopping

Interrupting training when the validation loss is no longer improving (and of course, saving the best model obtained during training)

keras.callbacks.EarlyStopping

or

callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

we need additional validation data to do early stopping

Parameter Norm Penalty

It adds smoothness to the function space

Data Augmentation

a powerful technique for mitigating overfitting in computer vision
More data are always welcomed

data_augmentation = tf.keras.Sequential([
  layers.experimental.preprocessing.RandomFlip("horizontal_and_vertical"),
  layers.experimental.preprocessing.RandomRotation(0.2),
])

However, in most cases, training data are given in advance
In such cases, we need data augmentation

Noise Robustness

Add random noises inputs or weights

Label Smoothing

Mix-up constructs augmented training examples by mixing both input and output of two randomly selected training data

CutMix constructs augmented training examples by mixing inputs with cut and paste and outputs with soft labels of two randomly selected training data

Dropout

In each forward pass, randomly set some neurons to zero

 model = models.Sequential()
        model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
        model.add(layers.Dropout(0.5))
        model.add(layers.Dense(16, activation='relu'))
        model.add(layers.Dropout(0.5))
        model.add(layers.Dense(1, activation='sigmoid'))

Batch Normalization

Batch normalization compute the empirical mean and variance independently for each dimension (layers) and normalize

BatchNormalization layer is typically used after a convolutional or densely
connected layer

conv_model.add(layers.Conv2D(32, 3, activation='relu')) #After a Conv layer 
conv_model.add(layers.BatchNormalization())
dense_model.add(layers.Dense(32, activation='relu')) #After a Dense layer 
dense_model.add(layers.BatchNormalization())

There are different variances of normalizations

Ko Hyejung

이전 포스트

Neural Network & MLP

다음 포스트

Optimization - Regularization

NAVER AI TECH - precourse

Gradient Descent

gradient descent down a 2D loss surface (two learnable parameters)

Important Concepts in Optimization

Generalization

Underfitting vs Overfitting

Cross-validation

Bias and Variance

Bias and Variance Tradeoff

Bootstrapping

Bagging vs Boosting

Practical Gradient Descent Methods

Gradient Descent Methods

Batch-size Matters

Gradient Descent

Momentum

Nesterov Accelerated Gradient

Adagrad

Adadelta

RMSprop

Adam

Regularization

Early Stopping

Parameter Norm Penalty

Data Augmentation

Noise Robustness

Label Smoothing

Dropout

Batch Normalization

Neural Network & MLP

CNN-Convolution

0개의 댓글

관련 채용 정보