02. Linear Regression

AT·2023년 6월 23일

ML

목록 보기

2/2

Linear regression은 어떤 input에 대한 실수값의 output을 예측하는 문제이다.

1. Univariate Linear Regression

1.1 The Hypothesis Function

h(x) = \theta_0 + \theta_1 x

Hypothesis란, input (feature)과 output (target)의 관계를 나타내는 함수이다.
Hypothesis $\,h$ 는 어떤 함수의 형태든 취할 수 있지만, linear함수를 자주 사용한다.

한가지 feature를 이용한 Linear Regression을 "Univariate Linear Regression"이라고 부른다.

1.2 Cost function

J(\theta_0, \theta_1) = \frac{1}{2m} \displaystyle\sum_{i=1}^{m}{(\hat{y}^{(i)}-y^{(i)})^2} = \frac{1}{2m} \displaystyle\sum_{i=1}^{m}{(h(x^{(i)})-y^{(i)})^2}

error = h(x^{(i)})-y^{(i)}

$m :\,$ # of Training Dataset
$x^{i} :\,$ Feature
$y^{i} :\,$ Label or Target Variable
$\theta_0 :$ zero condition
$\theta_1 :$ gradient

Hypothesis Function의 정확도를 측정하기 위해 Cost Function을 이용한다.

Training Dataset을 학습해 Parameter $\,\theta_0, \theta_1$ 을 추정하는데 error가 최소화 되는 방향으로 Parameter를 추정한다.

error값은 양수값일 수도 있고 음수값일 수도 있기 때문에 이들의 제곱값의 합을 구하여 그 합이 최소가 되는 Parameter를 찾는 방법이 일반적이다.
즉, Cost Funciton은 mean-squared-error (MSE)이다.

다만, 평균이라면 Data의 개수인 $m$ 으로 나누어야하는데 $2m$ 으로 나누었다는 점이 다르다.
Cost를 어떤 양의 상수로 나누어도 Cost를 최소화하는 Parameter는 달라지지 않으므로 간단한 계산을 위해 2로 미리 나누어주는 방식이 자주 이용된다.

$\theta_0 = 0$	$\theta_0 \ne 0$

2. Parameter Learning - Gradient Descent

2.1 Gradient Descent Algorithm

Gradient descent는 Cost function을 최소화하기 위해 이용할 수 있는 방법 중 하나이며, Cost function 말고도 각종 Optimization에 이용되는 일반적인 방법이다.

argmin_{\theta_0, \theta_1}J(\theta_0, \theta_1)

\theta_0 := \theta_0 - \alpha{\partial \over \partial\theta_0}J(\theta_0, \theta_1)

\theta_1 := \theta_1 - \alpha{\partial \over \partial\theta_1}J(\theta_0, \theta_1)

{\partial \over \partial\theta_j}J(\theta_0, \theta_1) = {\partial \over \partial\theta_j}[\frac{1}{2m} \displaystyle\sum_{i=1}^{m}{(h(x^{(i)})-y^{(i)})^2}] = \frac{1}{m} \displaystyle\sum_{i=1}^{m}{(h(x^{(i)})-y^{(i)})}\cdot x^{(i)}

주의해야할 점은 Parameter들을 한번에 업데이트해야한다는 점이다.
만약 $\theta_0$ 을 먼저 업데이트하여 Hypothesis가 바뀌고, 그 Hypothesis에서 $\theta_1$ 을구하면 예상치 못한 문제가 발생할 수 있다.

편미분항 ${\partial \over \partial\theta_j}J(\theta_0, \theta_1)$ 이 양수이면 왼쪽으로, 음수이면 오른쪽으로 움직여 Cost를 줄여나간다.

2.2 Learning Rate $\alpha > 0$

Learning Rate가 너무 작으면 수렴하는데에 오래걸리는 문제가 생기고, 너무 크면 최소값에 이르지 못해 수렴하지 못하거나 심지어 발산하는 문제가 발생할 수 있다. 그러므로 적절한 learning rate을 고르는 것이 중요하다.

대부분의 경우, 최적값에 수렴할수록 편미분항의 크기가 작아져서 조금씩 업데이트되기 때문에 $\alpha$ 값을 수동으로 조절하지 않아도 된다.

2.3 Global Minimum v.s Local Minimum

Gradient Descent가 수렴할 수 있는 여러 개의 Local Minimum이 있을 수 있다.
이는 Gradient Descent의 문제 중 하나이며, Parameter들을 무작위로 초기화한 후 각각에 대해 알고리즘을 실행하고 가장 낮은 Cost를 제공하는 Parameter를 선택하여 이를 해결할 수 있다.

3. Multivariate Linear Regression

3.1 Cost Function

h(x^{(i)}) = \theta_0 + \theta_1x_1^{(i)} + \theta_2x_2^{(i)} + \theta_3x_3^{(i)} + \cdot\cdot\cdot + \theta_nx_n^{(i)}

X = \left[\begin{array}{rrr} x_{1,1} & x_{1,2} & \cdots & x_{1,n} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m,1} & x_{m,2} & \cdots & x_{m,n} \end{array}\right] \quad\quad\quad \Theta = \left[\begin{array}{rrr} \theta_0 \\ \theta_1 \\ \vdots\\ \theta_n \end{array}\right]

h(X) = X\Theta

J(\theta) = \frac{1}{2m} \displaystyle\sum_{i=1}^{m}{(h(x^{(i)})-y^{(i)})^2}

Vectorize\;\;\; \Rightarrow \;\;\;J(\Theta) = {1 \over 2m}(X\Theta-Y)^T(X\Theta-Y)

$n \;\,:$ # of Fetures
$x_n^i :$ The value of feature $n$ in the $i^{th}$ training Dataset
$x$ is an $n$ -dimensional feature vector

3.2 Gradient Descent for Multiple Variables

Gradient descent는 변수가 하나일 때와 기본적으로 같은 꼴이지만 $n$ 개 feature에 대하여 반복한다는 점이 다를 뿐이다.

\theta_j := \theta_j - \alpha{\partial \over \partial\theta_j}J(\theta_0, \cdot \cdot \cdot , \theta_n) = \theta_j - \alpha\frac{1}{m} \displaystyle\sum_{i=1}^{m}{(h(x^{(i)})-y^{(i)})}\cdot x_j^{(i)}

이를 각 Parameter별로 풀어쓰면,

\theta_0 := \theta_0 - \alpha\frac{1}{m} \displaystyle\sum_{i=1}^{m}{(h(x^{(i)})-y^{(i)})}\cdot x_0^{(i)} \\ \theta_1 := \theta_1 - \alpha\frac{1}{m} \displaystyle\sum_{i=1}^{m}{(h(x^{(i)})-y^{(i)})}\cdot x_1^{(i)} \\ \vdots \\ \theta_n := \theta_n - \alpha\frac{1}{m} \displaystyle\sum_{i=1}^{m}{(h(x^{(i)})-y^{(i)})}\cdot x_n^{(i)}

\nabla J(\Theta) = \left[\begin{array}{rrr} {\partial J(\vec{\theta}) \over \partial\theta_0} \\ \\ {\partial J(\vec{\theta}) \over \partial\theta_1} \\\\ \vdots\\ \\ {\partial J(\vec{\theta}) \over \partial\theta_n} \end{array}\right] \quad\quad \nabla J(\Theta) = \frac{1}{m}X^T(X\Theta - Y)

\Theta := \Theta - \alpha\nabla J(\Theta)

4. Gradient Descent in Practice

4.1 Feature Scaling

Feature Scaling은 서로 다른 Scale or Unit을 가지는 Feature들을 동일한 범위로 조정하는 데이터 전처리 과정이다.

모델 성능 향상
Feature Scaling을 적용하지 않으면 범위가 큰 Feature가 Model에 다른 Feature보다 더 큰 영향을 미칠 수 있다.
수렴 속도 향상
일부 최적화 알고리즘(Ex. Gradient Descent)은 Feature들이 비슷한 범위 내에 있을 때 더 빠르게 수렴할 수 있다.

4.1.1 Normalization (Min-Max scaling)

x^{\prime} = \frac{x - x_{min}}{x_{max} - x_{min}}

$x_{max}$ is the maximum values of the feature
$x_{min}$ is the minimum values of the feature

4.1.2 Standardization

x^{\prime} = \frac{{x - \mu}}{{\sigma}}

$\mu$ is the mean of the feature values
$\sigma$ is the standard deviation of the feature values.

5. Polynomial Regression

Hypothesis Function이 반드시 Linear 하여야 하는 것은 아니다. 예를들면 2차함수나 3차함수 또는 제곱근 함수 등의 형태를 이용할 수 있다.

가령, $h(x) = \theta_0 + \theta_1 x_1$ 일 때, $x_1$ 에 연관된 feature를 추가하여 2차 함수꼴로 만들 수 있다.

h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 \quad\quad x_2 = x_1^2

주의할 점은, 이렇게 feature를 만들면 feature scaling이 더욱 중요해진다는 점이다.
예를 들어, $x_1$ 의 원래 범위가 1~100이라면, $x_2$ 의 범위는 1~10000이 되어버린다.

6. Computing Parameters Analytically

6.1 Normal Equation

Model parameter $\theta$ 의 analytical solution을 직접 찾는 방법이다. Gradient-Descent가 여러번 Iteration을 돌아야하는 것과 대조적으로 한번에 최적의 해를 찾는다는 특징이 있다.

Normal \,Equation : \hat{\Theta} = (X^TX)^{-1}X^TY

이전 포스트

02. Linear Regression

ML

1. Univariate Linear Regression

1.1 The Hypothesis Function

1.2 Cost function

2. Parameter Learning - Gradient Descent

2.1 Gradient Descent Algorithm

2.2 Learning Rate $\alpha > 0$

2.3 Global Minimum v.s Local Minimum

3. Multivariate Linear Regression

3.1 Cost Function

3.2 Gradient Descent for Multiple Variables

4. Gradient Descent in Practice

4.1 Feature Scaling

4.1.1 Normalization (Min-Max scaling)

4.1.2 Standardization

5. Polynomial Regression

6. Computing Parameters Analytically

6.1 Normal Equation

01. Introduction

0개의 댓글

02. Linear Regression

ML

1. Univariate Linear Regression

1.1 The Hypothesis Function

1.2 Cost function

2. Parameter Learning - Gradient Descent

2.1 Gradient Descent Algorithm

2.2 Learning Rate α>0\alpha > 0α>0

2.3 Global Minimum v.s Local Minimum

3. Multivariate Linear Regression

3.1 Cost Function

3.2 Gradient Descent for Multiple Variables

4. Gradient Descent in Practice

4.1 Feature Scaling

4.1.1 Normalization (Min-Max scaling)

4.1.2 Standardization

5. Polynomial Regression

6. Computing Parameters Analytically

6.1 Normal Equation

01. Introduction

0개의 댓글

2.2 Learning Rate $\alpha > 0$