5. Linear Model (2)

Eunji·2026년 4월 14일

Data Mining

목록 보기
7/12

1. Regression

1.1 Definition

  • a statical method to study relationship between x\mathbf{x} and y
    • x\mathbf{x}: covariate / predictor variable / independent variable / feature
    • yy: response / dependent variable
  • training data (x1,y1)(\mathbf{x}_1, y_1), (x2,y2)(\mathbf{x}_2, y_2), ... , (xN,yN)(\mathbf{x}_N, y_N)
    • noise is added to target yn=f(x)+ϵy_n = f(\mathbf{x}) + \epsilon
    • yP(yx)y \sim P(y|\mathbf{x}) instead of y=f(x)y = f(\mathbf{x})

Goal

  • find a model g(x)g(\mathbf{x}) that approximate yny_n

1.2 Etymology

  • re(back) + gression(going)
    • going back from data to formula
    • regression towards the mean
      • tail and short men tend to have sons with heights closer to mean

Simple Linear Regression

1.3 Linear Regression

  • popular linear model for predicting a quantitative response
    • applies to real-valued target functions
    • long history in statistics, social and behavioral sciences
    • regression means yy is real-valued
      • inherited from statistics

Simple vs. Multiple

  • d=1d = 1: simple linear regression
    • one predictor
  • d2d \ge 2: multiple linear regression
    • multiple predictor

Least Squares

  • ordinary least squares
    • minimizes the sum of squared residuals
    • leads to a closed-form expression
  • generalized least squares, iteratively reweighted least squares

Maximum Likelihood

  • Ridge / Lasso regression
  • least absolute deviation regression

Other Techniques

  • Bayesian linear regression, principal component regression

1.4 Linear Regression in 1D and 2D

  • in OLS, sum of squared error is minimized
    • solution hypothesis (in blue) of linear regression algorithm in 1D and 2D

2. Credit Approval Revisited

2.1 Components of the Learning Problem

학습 문제의 핵심 요소를 추상화하기 위해 신용 승인 사례를 사용한다. 주요 구성 요소는 다음과 같다.

  • f\mathbf{f}: unknown target function
  • X\mathbf{X}: input space
  • Y\mathbf{Y}: output space
  • NN: the number of input-output examples
    • i.e., training examples
  • D=(x1,y1),...,(xN,yN)D = {(x_1, y_1), ... , (x_N, y_N)}: data set where yn=f(xn)y_n = f(x_n)

2.2 Regression Perspective

신용 승인 문제를 이진 분류가 아닌 회귀 관점에서 접근하면 다음과 같은 특징을 갖는다.

Quantitative Prediction

  • consider credit approval as a regression problem
    • instead of making a binary decision
    • e.g., set a credit limit for each customer

Real-valued Target

  • the bank uses historical records to build a data set DD
    • D:(x1,y1),(x2,y2),...,(xN,yN)D:(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2),...,(\mathbf{x}_N, y_N)
    • xn\mathbf{x}_n: customer information
    • yny_n: credit limit
      • set by human experts; real number
      • $123,000 for Alicee and $57,000 for Bob

Automation

  • the bank wants to automate this task (as it did with credit approval)
    • use learning to find a hypothesis g\mathbf{g}

2.3 Target Distribution

현실적인 데이터 생성 과정에서는 완벽한 함수 ff를 가정하기 어렵기 때문에 확률론적 관점을 도입한다.

Inconsistency of Experts

  • each expert may not be perfectly consistent
  • \rightarrow out target will not be a deterministic function y=f(x)y=f(x)

Probabilistic Target

  • regression label yny_n comes from some distribution P(yx)P(y|\mathbf{x}) generating each (xn,yn)(\mathbf{x}_n, y_n)
  • instead of a deterministic function f(x)f(\mathbf{x})

Learning Goal

  • we have a unknown distribution P(x,y)P(\mathbf{x},y) generating each (xn,yn)(\mathbf{x}_n, y_n)
  • find a hypothesis gg that replicates how human experts determine credit limits
    • minimize the error between g(x)g(\mathbf{x}) and yy wrt that distribution PP

학습 문제의 본질은 분류에서 회귀로 바뀌어도 변하지 않는다. 데이터를 통해 미지의 타겟을 근사하는 가설을 찾는다는 기계 학습의 기본 틀은 유지된다.


3. Linear Regression Algorithm

3.1 Error Measurement

  • based on minimizing squared error between h(x)h(\mathbf{x}) and yy
    • expected value is taken wrt joint probability distribution P(x,y)P(\mathbf{x}, y)

Eout(h)=E[(h(x)y)2]E_{out}(h) = \mathbb{E} [(h(\mathbf{x}) - y)^2]

Goal

  • find a hypothesis hh that achieves a small Eout(h)E_{out}(h)

Issue

  • EoutE_{out} cannot be computed
  • since P(x,y)P(\mathbf{x}, y) is unknown similar to what we did in classification

3.2 In-sample Error EinE_{in}

  • resort to in-sample error instead:

Sum of Squared Residuals

Ein(h)=1Nn=1N(h(xn)yn)2E_{in}(h) = \displaystyle \frac{1}{N} \sum^N_{n=1}(h(\mathbf{x}_n) - y_n)^2

OLS는 잔차 제곱합을 최소화하는 기법이다. EoutE_{out}P(x,y)P(\mathcal{x}, y)를 모르기 때문에 계산이 불가능하지만 EinE_{in}은 우리가 가진 데이터로 계산 가능하므로 EoutE_{out}의 대리 지표로 사용한다.

Signal Representation

h(x)=i=0dwixi=wTxh(\mathbf{x}) = \sum^d_{i=0} w_ix_i = \mathbf{w}^T\mathbf{x}

  • in linear regression, hh takes the form of
    • a linear combination of the components of xx:
  • wTx\mathbf{w}^T\mathbf{x}: signal

항상 1의 값을 갖는 bias coordinate(x0=1x_0 = 1)가 포함되어 있어 인덱스가 0부터 시작한다.

0개의 댓글