최적화와 관련된 주요한 용어와 다양한 Gradient Descent 기법들을 배웁니다.
주요한 용어: Generalization,Overfitting, Cross-validation 등 다양한 용어가 있습니다. 각 용어들의 의미에 대해 배웁니다.
다양한 Gradient Descent 기법: 기존 SGD(Stochastic gradient descent)를 넘어서 최적화(학습)가 더 잘될 수 있도록 하는 다양한 기법들에 대해 배웁니다.
First-order iterative optimization algorithm for finding a local minimum of a differentiable function
How well the learned model will behave on unseen data
Cross-validation is a model validation technique for assessing how the model will generalize to an independent (test) data set
Bootstrapping is any test or metric that uses random sampling with replacement
Batch gradient descent
Update with the gradient computed from the whole data
Stochastic gradient descent
Momentum
Nesterov accelerated gradient
Adagrad
Adadelta
RMSprop
Adam
"It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize."
"We ... present numerical evidence that supports the view that large batch methods tend to converge to sharp minimizers of the training and testing functions. In contrast, small-batch methods consistently converge to flat minimizers... this is due to the inherent noise in the gradient estimation."
hyperparameter로 베타 적용(모멘텀)
모멘텀이 포함된 gradient로 update
Adagrad adapts the learning rate, performing larger updates for infrequent and smaller updates for frequent parameters
Gt : Sum of gradient squares
What will happen if the training occurs for a long period?
Adadelta extends Adagrad to reduce its monotonically decreasing the learning rate by restricting the accumulation window
There is no learning rate in Adadelta
-> 바꿀 수 있는 요소가 많지 않은 이유로 잘 활용되지는 않음
RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his lecture
Adaptive Moment Estimation (Adam) leverages both past gradients and squared gradients
opt = tf.keras.optimizers.Adam(learning_rate=0.1)
Adam effectively combines momentum with adaptive learning rate approach
Interrupting training when the validation loss is no longer improving (and of course, saving the best model obtained during training)
keras.callbacks.EarlyStopping
or
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
we need additional validation data to do early stopping
It adds smoothness to the function space
a powerful technique for mitigating overfitting in computer vision
More data are always welcomed
data_augmentation = tf.keras.Sequential([
layers.experimental.preprocessing.RandomFlip("horizontal_and_vertical"),
layers.experimental.preprocessing.RandomRotation(0.2),
])
However, in most cases, training data are given in advance
In such cases, we need data augmentation
Add random noises inputs or weights
Mix-up constructs augmented training examples by mixing both input and output of two randomly selected training data
CutMix constructs augmented training examples by mixing inputs with cut and paste and outputs with soft labels of two randomly selected training data
In each forward pass, randomly set some neurons to zero
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))
Batch normalization compute the empirical mean and variance independently for each dimension (layers) and normalize
BatchNormalization layer is typically used after a convolutional or densely
connected layer
conv_model.add(layers.Conv2D(32, 3, activation='relu')) #After a Conv layer
conv_model.add(layers.BatchNormalization())
dense_model.add(layers.Dense(32, activation='relu')) #After a Dense layer
dense_model.add(layers.BatchNormalization())
There are different variances of normalizations