[2주차] 8.10 화

William JO·2021년 8월 15일
0

✏️Optimization

Optimization이란?

  • loss function(L) value가 줄어 들었을 때, 어떤 optimum point에 이룰 것이라 기대; 찾고자 하는 parameter에 대해 L을 편미분한 값을 이용해 학습
  • Gradient Descent : First-order(1차미분) iterative optimization algorithm for finding a local minimum of a differentiable function



Generalization(일반화)

  • How well the learned model will behave on unseen data
  • 일반적으로 학습 시 training error는 iteration마다 줄어든다.
    -> 우리가 원하는 최적값 도달 보장 X
    -> 어느 시점부터 test error ↑
  • generalization gap = generalization performance = 일반화 성능
    -> 일반화 성능 좋다해서 모델 성능이 좋다고 보장 X
    -> ex) train/test error가 둘다 크지만 gap 없을 경우



Underfitting vs. Overfitting

  • Underfitting : simple network, lack of training, train data도 잘 못 맞춤
  • Overfitting : train data prediction good, test data prediction bad



Cross-validation(K-fold validation)

  • is a model validation technique for assessing how the model will generalize to an independent (test) data set
  • test data → 학습엔 절대 사용 X (cheating!!)
  • validation data : over/underfitting 방지 테스트 용
    -> 최적의 hyperparameters set을 찾은 후, 고정하고 마지막 학습엔 모든 data를 사용



Bias and Variance



Bias and Variance Tradeoff

  • Minimizing cost can be decomposed into three different parts: bias2bias^2, variance, and noise
  • cost를 minimize할 때 근본적으로 train datadp noise가 있다면, bias & variance를 둘 다 줄이기엔 어렵다



Bootstrapping (Ensemble)

  • is any test or metric that uses random sampling with replacement
  • 학습데이터를 subsampling해서 여러 개 집합을 만들고 그걸로 여러 모델 혹은 metric을 만들어 사용

  1. Bagging(Bootstrapping aggregating)
    • Multiple models are being trained with bootstrapping.
    • ex) Base classifiers are fitted on random subset where individual predictions are aggregated
      (votting or averaging)
    • Vanilla 학습보다 성능 ↑
  2. Boosting
    • focus on those **specific training samples that are hard to classify
    • combine weak learners in sequence where each learner learns from the mistakes of the previous weak learner.

    Reference


✏️Practical Gradient Descent Methods

Gradient Descent Methods

Stochastic GD : Update with the gradinet computed from a single sample(batch-size)
Mini-batch GD : Update with the gradinet computed from a subset of data(batch-size)
Batch GD : Update with the gradinet computed from whole data(batch-size)



Batch-size Matters

  • a large batch tend to converge to sharp minimizers while small-batch methods consistently converge to flat minimizers; this is due to the inherent noise in the gradient estimation.
  • flat minimizer의 generalization performance가 더 좋다.
    -> This is why you should use MBGD(flat minimizer)



Gradient Descent

Wt+1=WtηgtW_{t+1} = W_t - \eta g_t

  • η\eta : learning-rate, gtg_t : gradient
  • 문제점 : step-size(lr) 적절히 설정하기 어렵다



Momentum

at+1=βat+gta_{t+1} = \beta a_t + g_tmomentum이 포함된 gradient로 업데이트
Wt+1=Wtηat+1W_{t+1} = W_t - \eta a_{t+1}

  • β\beta : momentum, at+1a_{t+1} : accumulation
  • intuition: 한번 gradient가 ➡방향으로 흐르면, 다음번 grad가 ⬅방향으로 다르게 흘러도, ➡방향 정보를 이어가자. → 전 grad 정보를 활용해보자.



Nesterov Accelerated Gradient(NAG)

at+1=βat+L(Wtηβat)a_{t+1} = \beta a_t + \nabla L(W_t - \eta \beta a_t)ata_t로 이동한 곳에서 gradient를 구한 값으로 at+1a_{t+1}을 구한다
Wt+1=Wtηat+1W_{t+1} = W_t - \eta a_{t+1}

  • L(Wtηβat)\nabla L(W_t - \eta \beta a_t) : Lookahead gradient

  • NAG converge가 더 빠르다.



Adagrad

  • adapts the learning rate, performing larger updates for infrequent and smaller updates for frequent parameters

Wt+1=WtηGt+ϵGtW_{t+1} = W_t - \frac{\eta}{\sqrt{G_t + \epsilon}}G_t

  • ϵ\epsilon : for numerical stability (division by 0 처리)
  • GtG_t : sum of gradient squares (t까지의 gradient 변화 저장)
  • 변화 적은 parameter 변화 ⬆, 변화 많은 parameter 변화 ⬇
  • 문제점 : monotonically decreasing = GtG_t가 계속 커져서 무한대로 가면 update가 되지 않는다; 뒤로 가면 갈수록 학습 멈춤 현상이 생김



Adadelta

  • extends Adagrad to reduce its monotonically decreasing the learning rate by restricting the accumulation window → Adagrad의 large GtG_t현상을 최대한 막겠다.

Gt=γGt1+(1γ)gt2G_t = \gamma G_{t-1} + (1 - \gamma)g_t^2
Wt+1=WtHt1+ϵGt+ϵgtW_{t+1} = W_t - \frac{\sqrt{H_{t-1} + \epsilon}}{\sqrt{G_t + \epsilon}}g_t
Ht=γHt1+(1γ)(Wt)2H_t = \gamma H_{t-1} + (1-\gamma)(\triangle W_t)^2

  • GtG_t : Exponential Moving Average(EMA) of gradient squares
  • HtH_t : Exponential Moving Average(EMA) of difference squares
  • Ht1+ϵGt+ϵ\frac{\sqrt{H_{t-1} + \epsilon}}{\sqrt{G_t + \epsilon}} : Adagrad의 분자로 인한 monotonical decrease를 막는 수식
  • time의 window-size만큼의 gradient 제곱의 변화를 보겠다.
    -> 문제점 : # parameters가 많으면 window-size 무용지물
  • no learning rate
  • tuning 힘들어 잘 안쓴다.



RMSprop

  • unpublished adaptive learning rate method proposed by Geoff Hinton in his lecture

Gt=γGt1+(1γ)gt2G_t = \gamma G_{t-1} + (1 - \gamma)g_t^2
Wt+1=WtηGt+ϵgtW_{t+1} = W_t - \frac{\eta}{\sqrt{G_t + \epsilon}}g_t

  • GtG_t : EMA of gradient squares
  • η\eta : stepsize



Adam(Adaptive Moment Estimation)

  • leverages both past gradients and squared gradients
  • hyperparameters : β1\beta_1, β2\beta_2, η\eta, ϵ\epsilon

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t
vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2
Wt+1=Wtηvt+ϵ1β2t1β1tmtW_{t+1} = W_t - \frac{\eta}{\sqrt{v_t + \epsilon}}\frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t}m_t

  • mtm_t : Momentum
  • vtv_t : EMA of gradient squares
  • η\eta : step-size
  • β1,β2\beta_1, \beta_2 : 전 time-step의 momentum, EMA of gradient squares 정보를 유지 비율 설정
  • Adam effectively combines momentum with adaptive learning rate approach(RMSprop)



Reference


✏️Regularization

  • Generalization '도구'
  • 학습에 반대 되도록 규제를 건다. → 학습 방해가 목적
    -> test(real) data prediction에도 이 방법론이 잘 동작할 수 있게 만들어 준다.(but 보장 불가)
    -> validation dataset distribution ≠ test dataset distribution이기 때문



Early Stopping

  • we need additional validation data to do early stopping
    -> validation data : train에 활용되지 않는 dataset
  • cost(error)가 커지는 시점에 stop한다.



Parameter Norm Penalty (Weight Decay)

  • adds smoothness to the function space → 부드러운 함수일수록 generalization performance가 올라갈 것이라 가정

  • NN parameters가 너무 커지지 않게 제약을 건다 → weight decay

  • L2 Regularization, L1 Regularization이 존재한다.

  • Reference
    https://deepapple.tistory.com/6
    https://huidea.tistory.com/154



Data Augmentation

  • More data are always wecomed



Noise Robustness

  • add random noises inputs or weights → data augmentation은 input data만 추가
  • 왜 잘되는지는 아직까지 의문



Label Smoothing

  • ≈ Data Augmentation
  • constructs augmented training exaples by mixing both input and output of two randomly selected training data



Dropout

  • NN train 시 (forward/backward propagation) dropout ratio만큼 각 layer의 neuron을 0으로 바꾸어준다.
  • 각각의 neuron이 좀 더 robust한 feature를 잡을 수 있다; voting 효과 (증명 X)
  • Reference
    https://algopoolja.tistory.com/50



Batch Normalization

  • compute the empirical mean and variance independently for each dimension (layers) and normalize
    -> 각 layer의 statistic을 정규화

μB=1mi=1mxi\mu_B = \frac1m \sum^m_{i=1}x_i
σB2=1mi=1m(xiμB)2\sigma^2_B = \frac1m \sum^m_{i=1}\left(x_i - \mu_B\right)^2
x^i=xiμBσB2+ϵ\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma^2_B + \epsilon}}
→ 평균을 빼고 표준편차로 나눠준다.

  • different variances of normalizations

    - Batch norm : layer 전체를 줄임
    - Layer norm : 각 layer 정보를 줄임
    - Instance norm : 데이터 별로 줄임




Image Reference

0개의 댓글

관련 채용 정보