[Hands-on Machine Learning] Gradient Descent

김강태·2021년 1월 9일

Gradient Descent

Gradient descent (경사 하강법)은 여러 문제에서 해법을 찾을 수 있는 매우 일반적인 최적화 알고리즘이다,.

Gradient descent의 기본 아이디어는 loss function을 최소화하기 위하여 반복해서 파라미터를 조정해가는 것으로 파라미터인 vector $\theta$ 에 대해 loss function의 현재 gradient를 계산하여 gradient가 0이 될때까지 진행하는 것이다.

* 자세한 개념적 설명은 EECS 498-007 / 598-005 강의 포스팅의 lec4 부분에서 보실수 있습니다.

Batch-Gradient Descent

gradient descent algorithm을 구현하려면 $\theta_j$ (모든 $\theta들$ )에 대해 loss function이 얼마나 바뀌는지 계산해야 한다. loss function의 partial derivative(편도 함수)를 계산하고 gradient vector를 도출 해야한다. 쉽게말해 multivariable을 input으로 갖고, scalar value를 output으로 갖는 loss function을 편미분하여 계산하는 것 이다. 특정 파라미터 $\theta_j$ 에 대한 loss function(MSE)의 편도함수는 다음과 같다

\frac{\partial}{\partial\theta_j}MSE(\theta) = \frac{2}{m}\sum_{i-1}^m(\theta^T\cdot x^{(i)}-y^{(i)})x_j^{(i)}

하지만 각각의 $\theta$ 에 대하여 편미분을 계산하는 대신 모든 $\theta$ 에 대해 한꺼번에 계산하여 gradient vector를 한번에 계산하는 공식은 아래와 같다.

\nabla_\theta MSE(\theta) = \begin{pmatrix} \frac{\partial}{\partial\theta_0}MSE(\theta)\\ \frac{\partial}{\partial\theta_1}MSE(\theta)\\ \vdots\\ \frac{\partial}{\partial\theta_n}MSE(\theta) \end{pmatrix} = \frac{2}{m}X^T\cdot(X\cdot\theta-y)

각 $\theta$ 에 대해 편미분한 값을 element로 갖은 gradient vector를 행렬식 표현을 통해 한번에 계산한 것이다.

이 공식은 매 경사하강법 step에서 전체 training set X에 대해 계산하기 때문에 이 알고리즘을 Batch Gradient Descent라고 한다. 이러한 이유로 Batch Gradient Descent는 training set이 매우 큰 경우에는 아주 느리자만 x의 feature수 에는 민감하지 않다.

이제 gradient vector를 구했으면 $\theta$ 의 gradient가 0이 되는 방향 (loss가 최소화되는 방향)으로 $\theta$ 를 수정해야 한다. 이는 다음과같이 나타낸다.

\theta^{next~spet} = \theta-\eta\nabla_\theta MSE(\theta)

이때 학습률 $\eta$ (eta)를 사용하여 수정할(이동할) step의 크기를 결정한다.

* 간단한 코드를 통해 위 알고리즘을 살펴보자

import numpy as np

# 균일분포를 따르는 0-2 범위의 난수 생성
X = 2 * np.random.rand(100,1) 
# 가우시안 노이즈를 위한 정규분포를 따르는 0-1 범위의 난수 생성
y = 4 + 3 * X + np.random.randn(100,1)

eta = 0.1 # learning rate
n_iterations = 1000 # 반복할 step 수
m = 100 # data 개수

theta = np.random.randn(2,1) # theta random initialization

for iteration in range(n_iterations):
  gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y) # gradient vector 계산
  theta = theta - eta * gradients # theta 수정(이동)

print(theta)

array([[4.17747757],
[2.80936927]])

모든 step을 통해 최적화된 theta를 살펴보니 이전 포스팅에서 살펴본 정규방정식을 통해 계산된 $\theta$ 와 같은 것을 볼 수 있다.

학습률 $\eta$ 를 3가지 값으로 나누어서 결과를 살펴보자.

위 그래프는 각각 eta(learning rate)별 batch gradient descent의 처음 10개의 step을 통한 예측 model을 파란색 그래프로 나타낸 것이다. 이때 빨간색 점선 그래프는 시작점을 나타낸다.

위의 그래프중 왼쪽의 그래프는 학습률이 낮다고 볼 수 있다. 결국엔 최적점에 도달하겠지만 가운데 그래프에비해 시간이 오래걸릴 것 이다. 오른쪽 그래프는 학습률이 너무 커 model이 이리저리 날뛰면서 최적점에서 멀어지는 것을 볼 수 있다.

gradient descent algorithm에서 적절한 learning rate를 설정하는게 중요하다는 것을 보여준다.

Stochastic Gradient Descent

앞에서 설명하였듯이 batch gradient descent는 매 step에서 전체 set을 사용해 gradient vector를 계산하기 때문에 training set 이 매우 커지면 속도가 느려진다. batch gradient descent와는 다르게 stochastic gradient descent 에서는 매 step에서 딱 한개의 sample을 무작위로 선택하여 그 sample에 대한 gradient를 계산한다.

하지만 무작위로 sample을 선택하기 때문에 batch gradient descent보다 불안정하다. 이러한 특성은 batch gradient descent의 local minimum 문제를 해결해주지만 global minimum의 최솟값에 다다르지 못하고 주변을 맴돌게 된다.

이러한 딜레마를 해결하기위해 학습률을 점진적으로 감소시키는 방법이있다. 이때 학습률을 결정하는 함수를 learning schedule이라 한다.

* learning schedule을 사용한 stochastic gradient descent를 간단한 코드를 통해 살펴보자.

n_epochs = 50
t0, t1 = 5, 50 # learning schedule function parameter

#learning schedule function
def learning_schedule(t):
  return t0 / (t + t1)

theta = np.random.randn(2,1) # theta random initilization

for epoch in range(n_epochs) : 
  for i in range(m): # training set 크기인 m 만큼 반복
    random_index = np.random.randint(m)
    xi = X_b[random_index:random_index+1] # training data에서 무작위 sample을 선택
    yi = y[random_index:random_index+1]
    gradients = 2 * xi.T.dot(xi.dot(theta) - yi) #gradient vector 계산
    eta = learning_schedule(epoch * m + i) # learning schedule function을 통해 eta 계산
    theta = theta - eta * gradients
    
print(theta)

array([[4.3305164 ],
[2.69299599]])

이전의 batch gradient decent가 1000번 반복한 반면에 stochastic gradient descent 에서는 50번만 반복하고도 근사치에 도달 한 것을 볼 수 있다.

아래는 sklearn 에서 SGDclassifier 클래스를 통해 학습한 코드이다.

from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=50, penalty=None, eta0=0.1)
sgd_reg.fit(X, y.ravel())

print(sgd_reg.intercept_, sgd_reg.coef_)

Mini-batch gradient descent

마지막으로 살펴볼 mini-batch gradient descent는 전체 set에서 gradient를 계산 하거나 (BGD 방법), 하나의 무작위 sample을 기반으로 gradient를 계산 (SGD) 하는 것이 아닌 mini-batch 라고 부르는 임의의 작은 sample set에대해 gradient를 계산하는 방법을 사용한다.

mini-batch gradient는 SGD보다 덜 불규칙하게 움직여 보다 더 최적값에 가까이 도달하지만 SGD에 비해 local minimum문제 발생확률이 높다.

위 그래프는 parameter space에 표시한 gradient descent algoritm별 경로를 나타낸 것이다. 보이는 것 처럼 batch gradient descent가 실제 최적점에서 멈춘 반면에 SGD와 mini-batch는 주변을 맴돌고 있다는 것을 알 수있다. 그래서 적절한 learning rate와 learning schedule function을 사용하는것이 중요하다.

김강태

개썅마이웨이로 shine my way

이전 포스트

[Hands-on Machine Learning] Regression using the Normal Equation + learning curve

다음 포스트