[ML] 1. Linear Regression

실버버드·2024년 10월 19일

Machine Learning

목록 보기

1/8

Lecture Week 1,2. Linear Regression with One Variable and Multiple Variables

1. Supervised learning

: infers a function from labeled training data

1) Supervised Classification
Example: Spam Detection
Decide which emails are spam and which are important(not spam)
Example: Credit scoring
Differentiating between low-risk and high-risk customers from their income and savings(income, saving > a, low-risk else high-risk)
Example: Image classification, type of cancer, cancer Y/N

vs. predictions are made by classifying into different categories, discreate, categorical variable
model: Logistics Regression, Binary, Dependent variable, Decision tree, KNN, Support Vector Machines, Naive Bayes, Convolutional Neural Network

2) Supervised Regression
: Predicting numeric value
Example: Price of a used car y = wx + w0, stock market, weather prediction

vs. attempts to predict a value for an input based on past data, real number, continous numbers
model: Linear Regression, Random Forest, Multilayer Perceptron, AdaBoost, Gradient Boosting, Convolutional Neural Network

2. Unsupervised learning

: draws inferences from input data without lableled responses

1. Linear Regression with one variable

1) Numerical approach

goal: estimate w from training data of < $x_i$ , $y_i$ > pairs
optimization: minimize squared error (least squares-distance between measurements and prediction line)

$\displaystyle \frac{\partial w}{\partial}\underset{i}{\sum}(y_i - wx_i)^2 = 0$

$\displaystyle w = \frac{\underset{i}{\sum}x_i y_i}{\underset{i}{\sum}x_i^2}$
estimated w, nosie:STD = 1,2..

Bias term: $w_0\\ y = w_0 + w_1x + \epsilon$

$\displaystyle w_0 = \frac{\sum_iy_i \,- \,w_1x_i}{n}, \;w_1 = \frac{\sum_ix_i(y_i\,= \,w_0)}{\sum_ix_i^2}$

2) A cost function in Machine Learning

**Gradient Descent algorithm

start with some $\theta_0, \theta_1$ , keep changing $\theta_0, \theta_1$ to reduce $J(\theta_0, \theta_1)$ until we hopefully end up at a minimum (simultaneously update)
repeat until convergence

$\displaystyle \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1)$ (for j = 0 and j = 1)

1) Batch Gradient Descent (BGD)
: in 1 iteration, calculate gradient of all training data set and update
+Computational efficiency, stable convergence
-Convergence to local minima, Entire traing dataset must be available for each epoch

np.dot(X.T, y_hat - y) # looping 3 million samples, inefficient so SGD appear
2) Stochastic Gradient Descent (SGD)
: in 1 iteration, calculate gradient of one data and update (in random data)
+Usually faster than BGD owing to sequential data processing
−Frequency of updates induces noise in gradients which affects convergence. Path to the minima is noisier.
−Frequent updates are computationally expensive

+) Mini Batch Gradient Descent (MGD)
: in 1 iteration, calculate average gradient of train data in one mini batch and update
uses n data points (instead of 1 sample in SGD) at each iteration.

Intuition1: which way to go

$\displaystyle \alpha\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1)\,>\,0$ positive $\theta$ decrease, move left
$\displaystyle \alpha\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1)\,<\,0$ negative $\theta$ increase, move right

Intuition2: learning rate
learning rate $\alpha\, \approx$ step size
When the learning rate is too low, it takes a lot of steps to converge
When the learning rate is too high, Gradient Descent fails to reach the minimum

Gradient descent for linear regression

recap quiz(pseudo code)

%% cost function
[theta,theta1] = meshgrid(-1000:1:1000);
for i = 1:2001
	for j = 1:2001
		sum=0;
		for k=1:length(x)
			sum = sum + [theta(i,j) + theta1(i,j) * x(k) - y(k)]^2;
		end
		Z(i,j) = sum/(2*length(x));
	end
end
figure(1) ;
mesh(theta,theta1,Z);
[r c]=min(Z);
title('Graph of Cost Function vs Parameters');
xlabel('\theta_0');
ylabel('\theta_1');
zlabel('Cost Funtion')

%% gradient descent for linear regression
alpha=.01;
psi=0;
psi1=0;
for i=1:5000
	sum1=0;
	sum2=0;
	for j=1:length(x)
		sum1=sum1+(psi+psi1*x(j)-y(j));
		sum2=sum2+(psi+psi1*x(j)-y(j))*x(j);
	end
	psi=psi-alpha*sum1/length(x);
	psi1=psi1-alpha*sum2/length(x);
End

fprintf('The theta0 value is = %d', psi);
fprintf('\n\nThe theta1 value is = %d\n\n', psi1);
for i=1:length(x)
	xnew(i)=x(i);
	ynew(i)=psi+psi1*x(i);
end
figure(2); scatter(x,y,'x','r'); hold on;
title('Graph of DATA vs PREDICTED FIT');
plot(x,ynew,'g');hold off;
legend('DATA','PREDICTED FIT');
xlabel('x'); ylabel('F(x)')
%% theta0 = 1.2, theta1 = 0.4

2. Linear Regresssion with Multiple Variables

: Learning/ optimizing multivariate least squares using gradient descent
$y = w_0 + w_1x_1 + ... + w_kx_k + \epsilon$ (variable x)

Non-Linear basis function

: add an intercept term(절편함), add a new variable z = 1 and weight $w_0$ (data 원하는 형식에 맞게 범위나 모양을 바꿈)

general notation:
$y = \sum\limits_{j=0}^n w_j\phi_j(x)$
$\phi_j(x): x_j$ for multivariate regression or non-linear basis function
$\phi_0(x)$ =1 for the intercept term

Gradient Descent for Multivariate Linear Regression

Goal: minimize loss function(g)

Learning algorithm:

initialize weights w=0
for t=1, until convergence:
Predict for each example $x^i$ using w: $\displaystyle \hat{y}^i = \sum\limits_{j=0}^k w_j\phi_j(x^i)$ # find $\hat{y}^i$
Compute gradient of loss $\displaystyle g = \frac{\partial}{\partial w_j}J(x) = 2\underset{i}{\sum}(y^i - \hat{y}^i)\phi_j(x^i)$
Update: $w = w - \lambda g$ ( $\lambda$ : learning rate)
+)
$\displaystyle J_{X,y}(w) = \underset{i}{\sum}(y^i - \hat{y}^i)^2 = \underset{i}{\sum}(y^i - \underset{j}{\sum}w_j\phi_j(x^i))^2$ (i: sum n examples, j: sum k+1 basis vectors)

Regression and Overfitting/ Underfitting

Bias-Variance Tradeoff

Overfitting: high variance
too many hypotheses, expressive class
error on the training data is very low, but error on new instances is high

Underfitting: high bias
not having good hypotheses, less expressive class

How to overcome overfitting issue?
Train with more data
Remove features (regularization): add sum of weights to loss function so that push some weights towards zero
Early stopping (validation)
Ensembling (bagging, boosting)

Regularization

: add information to prevent overfitting

L1 regularization (lasso regression)
: add sum of weights to loss function so that some weights become zero.
sum of weights, less sensitive to outlier, penalize small values more; feature selection, effect on pushing towards 0 (sparsity)
L2 regularization (ridge regression)
: add squared weights to loss function so that some weights become close to zero.
squared weights, penalizes large values more

$\theta_0, \theta_1$ 은 feature weight, 타원 그래프의 중심에 가까워질수록 데이터는 최적화한다. minimize data term
타원 그래프의 중심에 가장 가까우면서 정규화 그래프(규제영역)와 만나는 부분이 minimize combination. 이는 한 feature를 없앨 수도 있다. ( $\theta_0 = 0$ )

Cross-validation

: generate multiple mini train-test splits and use these splits to tune your model (ith validation set, k-1 training set, test set for i=1tok)

general procedure for estimating the true error of a predictor:
use a training and validation set to find the right predictor
use a test set to report the prediction error of the algorithm
repeated several times, results are averaged

Leave-one-out cross-validation
(one instance)
1. for each order of polynomial d:

repeate m times:
1) leave out ith instance from the training set and put in a validation set
2) find best parameter vector $w_{d,i}$ using other instances
3) measure the error using ith instance, $J_{d,i}$ (unbiased estimate of the true prediction error
Compute the average of the estimated errors $J_d = \frac{1}{m}\sum_{i=1}^mJ_{d,i}$

Choose the d with lowest average estimated error: $d^* = \underset{d}{arg min} J(d)$

+) order of polynomial d, parameter: d=d일때 validation set을 m번 뽑아 $J_{d,i}$ 즉 erorr를 m번 구하고 이를 평균내서 $J_d$ 즉 avgerage of estimated error구함. 구한 $J_d$ 중 최솟값을 가지는 d를 구함