Linear model for Regressions : 선형 회귀 모델 최적화

November·2024년 12월 23일

머신러닝

Least (Mean) Squares

N training observations (x1,y1) ~(xN,yN) to estimate parameters

y값과 f(x)의 차이를 작게 만드는 것 : Minimize the sum of squared residuals

cost function J(β) =sigma[(y-f(x))^2] ⇒ 이 값을 작게 만드는 것이 중요

Gradient descent algorithm

β 초기화

반복적으로 β 업데이트

β - αJ’(β ) (α * cost function 미분한 값)

α: Learning rate

α가 커지면 더 빠르게 업데이트 ⇒ how big the step is on each iteration

m개의 training sample에 대해 반복적으로 β 업데이트

Derivative = Slope

미분값 = 기울기 ⇒ 탐색 방향에 영향

Batch Gradient Descent

모든 샘플에 대해 반복하고 파라미터 업데이트

장점: Less updates 매개변수 업데이트 횟수가 적다 모든 데이터 샘플을 다 모아서 한 번의 경사 하강을 수행한 뒤, 한 번에 한 번의 업데이트

단점: Resource intensive 너무 많은 자원이 요구됨 ⇒ Stochastic Gradient Descnet (모든 샘플을 다 안봄)

non-linear model에서 Gradient Descent

Gradient Descent는 local minimum of a function을 찾기 위한 반복적인 알고리즘

linear model에서 Gradient Descent

Only one global optima

Convex Function(Bow-shaped) 학습하기 용이한 모양

Stochastic gradient descent

Look at single sample in the training set for each iteration and update the parameter

장점: Faster convergence

단점: Harder to reach global optima

데이터셋이 크다면 Stochastic gradient descent를 추천

좋은 파라미터를 찾는 다른 방법?

linear model에서는 존재한다! global optima가 하나이기 때문에 미분값을 0으로 만드는 파라미터 β를 찾는다.

Normal equations

linear model은 항상 global optima를 가짐 (not local optima)

답이 하나라는 걸 알기 때문에 그냥 미분 취했을 때 0이 되는 파라미터 β를 찾는다.

반복 없이 파라미터를 찾는 방법

Normal equations을 쓰려면 모델이 linear해야함 + 계산한 β식이 invertable 해야함

invertable한 mtrix?

역행렬이 존재하는 행렬

determinant (행렬식)determinant가 0이 아닌 경우에만 그 행렬은 invertible 하다

Locally wighted linear regression

Are features always important for better learning? NO

linear regression 모델을 좀 더 non-linear한 단계에 있는 데이터에 사용하고 싶다.

특정 타겟 포인트x를 기준으로 linear regression 모델을 locally하게 fit시킨다.

N개의 샘플에 대해서 각 샘플에 다른 가중치 w를 부여한다.

가중치가 작으면 해당 샘플의 error를 무시할 수 있을 정도로 만들고(negligible),

가중치가 크면 해당 샘플의 error를 줄이기 어렵게 만든다.

이 가중치에 영향을 주는 2가지 요소

neighbers

(query point x와 가까운 이웃) : 현재 내가 타겟으로 하는 x와의 거리

타우(bandwidth parameter)

determine how many nearest neighbors we will consider

target x : query point

타겟 포인트와 가까운 거리에 있으면 (neighber-x)^2값이 작다 → 가중치 식에서 exponential 값이 1에 가까워진다

타겟 포인트와 먼 거리에 있으면 (neighber-x)^2값이 크다 → exponential 값이 0에 가까워진다

If tau is small, what happen?

x와 멀어질수록 가중치는 더 빠르게 줄어든다

tau가 커지면 (bandwidth가 넓어지면) 상대적으로 더 많은 이웃들을 고려, tau가 작을수록 narrow한 모양 → 더 적은 이웃들만 고려하게 됨.

x와 가까운 이웃들은 더 큰 가중치를 얻는다. 전체적으로는 non-linear한 모습이나, 각 local area는 linear하다.

종의 높이: 타겟으로부터 이웃들이 얼마나 가까운지

종의 너비: 얼마나 많은 이웃들의 고려되는지

파라미터 수가 고정되어있지 않으므로 Non-parametric

Parametric vs Non-parametric algorithm

parametric model : 파라미터 개수가 fix됨 → input크기에 따라 파라미터 개수가 바뀌지 않음 학습이 비교적 빠름, 데이터가 특정 분포를 따른다는 가정을 함. 결과 Interpretable(해석가능)
ex) Logistic regression, Neural networks: training 전에 함수가 정의될 수 있으면 parametric
Non-parametric algorithm 파라미터의 개수가 training 데이터에 따라 늘어날 수도, 작아질 수도 있음
데이터에 대한 가정 없음, 학습 상대적으로 느림, 결과 해석 어려움
ex) Decision tree, random forest, k-NN: training 전에 함수 정의 X

Locally weighted linear regression은 non-parametric algorithm임. ⇒ training set size가 커지면 파라미터들이 많아지니까.

Linear regression은 parametric algorithm임. 파라미터 개수가 고정되어있으니까.

Likelihood function

To find the best distribution of samples

모든 데이터들을 잘 커버할 수 있는 분포를 찾는 것

data point가 내가 정의한 distribution에서 나올 확률(likelihood)이 높을수록 좋다

가정

데이터들은 정규분포를 따른다
IID: Samples are independently and identically distributed (각각의 샘플은 독립적이고 하나의 같은 분포에서 나온 샘플이다.)

MLE : Maximum likelihood estimation

목표 : Find β maximizing the following MLE

n개의 샘플에 대한 p(x|β), x는 IID하다.

Normal distribution == Gaussian distribution

정규분포(가우시안분포)에서 중요한 것은 샘플들의 평균값, 분산값

우리가 Gaussian-noise linear regression model을 가지고 있다고 가정, 하지만 주어진 샘플들이 어떤 분포를 이루는지 모름 ⇒ 여기에 어떻게 likelihood를 정의할까?

X는 임의의 분포를 따름, hypothesis function f(x) = Xβ + random noise

random noise is on Gaussian distribution 정규 분포를 따른다

random noise is independent across observations

⇒ Likelihood function L(β) = p(y|x; β)

좋은 β를 추정하기 위해서는 L(β)이 최댓값이 되어야 한다. ⇒ MLE라고 함 (Maximum likelihood estimation)

likelihood function을 최대화하는 방법은 error함수의 cost function을 줄이는 방향과 결국은 같구나

Bias and Variance

Generalization error에 대표적인 2가지 오류

모델의 복잡도가 올라갈수록(파라미터 개수가 많아질수록) 오버피팅의 가능성⬆️

Bias-variance tradeoff

Bias: difference between the average prediction and the truth value (정답과 예측과의 차이)

variance: variability of prediction for a data point

Overfitting(flexible model) : Low bias, High variance 모델의 예측이 정답과 가깝지만 예측끼리의 variance가 높다

Underfitting(Rigid model) : High bias, Low variance 모델의 예측이 정답으로부터 멀이 떨어져있다, 하지만 모델의 예측은 일정하다

Idealistic model : Low bias, Low variance

Bad model : High bias, High variance

Minimize total error

total error = bias^2 + variance + noise

모델 복잡도가 커지면 variance는 높아지고 bias는 낮아짐

underfitting의 해결책: 파라미터 추가

overfitting의 해결책

모델의 복잡도를 줄이자
Regularization ⇒ Ridge and Lasso regression

Ridge regression

L2 regularization : using L2 norm

파라미터를 줄이기 위해서 패널티를 추가 β^2

β 파라미터의 제곱만큼 줄인다 ⇒ 큰 파라미터 일수록 더 많이 줄어듬

람다: penalty term : 줄어드는 양 조절하는 패널티텀

람다가 클수록 파라미터가 줄어드는 크기가 커짐

만약 람다가 0과 가까워지면 linear regression과 비슷해진다 (패널티항이 사라지니까)

Lasso regeression

L1 regularization : using L1 norm

파라미터를 줄이기 위해서 패널티를 추가 |β|

Feature selection : β가 0이되면 특정 파라미터들은 무시됨 some features are just neglected as β becomes 0

elliptical 타원형 → Cost function

Lasso의 다이아몬드 & Ridge의 원 → Constraint regions

이 둘이 교차되는 부분에서 β를 찾겠다

What if the constraint region gets bigger?

constraint의 영역이 커질수록 hat β에 도달할 가능성이 커진다. 그걸 피하려고 정규화하는건데 영역이 너무 커지면 정규화의 효과가 없어진다.

What if the region hits ^β(hat β)?

hat β:Cost function의 최솟값, 이 최솟값에 가까워질수록 모델은 오버피팅을 겪을 확률이 높아진다. 이를 방지하기 위해 constraint가 여기에 교차되지 않게끔 한다.

제약 영역(constraint)이 ^β에 도달하면, Ridgre regularization는 더 이상 계수를 제한하지 않기 때문에 정규화되지 않은 일반적인 linear regression 모델과 동일한 결과를 얻게 됩니다. 즉, 제약 조건이 β의 크기를 제한하지 않는다면 릿지 정규화의 효과가 사라지고, 모델은 overfitting될 수 있습니다

For higher-dimensional feature space, which one seems to be better?
Lasso!! Feature Selection이 가능하니까
축에서 단 하나의 교차점을 만들 수 있음, 반면 릿지는 제한영역이 원모양이라서 교차점이 축에서 생길 수가 없음

How can we find suitable regression model parameters that yield the minimum error?
Use gradient descent
If we know that linear model can be used for given dataset, what method can we use to find optimal parameters, that avoids multiple iterations?
Use normal equations

November

이전 포스트

SVM 서포트 벡터 머신 & kernel

다음 포스트