5. Linear Model (3)

Eunji·2026년 4월 20일

Data Mining

목록 보기

8/12

1. Matrix Representation: Data and Error

1.1 Data Matrix and Target Vector

Data Matrix

$X \in \R^{N \times (d+1)}$
rows: inputs $\mathbf{x}_n$ as row vectors

각 개별 데이터 벡터 $\mathbf{x}_n$ 에 1(bias coordinate)이라는 항목을 추가한 뒤, 데이터 행렬 $X$ 를 만들 때는 개별 벡터들을 행 벡터 형태인 $\mathbf{x}^T_n$ 로 변환하여 차례대로 쌓는다.

Target Vector

$\mathbf{y} \in \R^N$
components: target values $y_n$

타겟 벡터 $\mathbf{y}$ 의 n번째 요소인 $y_n$ 은 데이터 행렬 $X$ 의 n번째 행 $\mathbf{x}^T_n$ 에 1:1로 대응되는 실제 결과값을 의미한다.

E.g., d = 1, N = 4

1.2 Matrix Form of $E_{in}(\mathbf{w})$

In-sample Error

A function of $\mathbf{w}$ and data $X$ , $\mathbf{y}$

$|| \cdot ||$ : Eculidean norm of a vector
Scalar $\mathbf{y}^T \mathbf{X} \mathbf{w} = (\mathbf{w}^T \mathbf{X}^T \mathbf{y})^T = \mathbf{w}^T \mathbf{X}^T \mathbf{y}$

학습 오차 $E_{in}$ 를 행렬 형태로 표현하는 것은 수만 개의 데이터를 하나씩 계산하지 않고 한 번의 행렬 연산으로 전체 오차를 정의하기 위함이다.

합산 형식에서 행렬 형식으로의 변환
- 예측값 벡터 $\mathbf{Xw}$ , 잔차 벡터 $\mathbf{Xw-y}$
행렬식의 전개 과정
- 벡터의 2-norm은 $\mathbf{a}^2 = \mathbf{a}^T\mathbf{a}$
스칼라 성질을 이용한 단순화

정리하면 2차 함수 형태, 미분 가능하며 볼록한 모양을 가진다. 이는 기울기가 0인 지점을 찾았을 때 그곳이 반드시 최솟값임을 보장한다.

E.g., d = 1, N = 4

2. Getting the Solution $\mathbf{w}_{lin}$

5.1 Minimization of $E_{in}(\mathbf{w})$

$\mathbf{w}_{lin}$

The solution to linear regression
Derived by minimizing $E_{in}(\mathbf{w})$ over all possible $w \in \R^{d+1}$

모든 데이터의 예측값 $X\mathbf{w}$ 과 실제 정답 $\mathbf{y}$ 의 차이를 구함 $\rightarrow$ 잔차
잔차의 제곱합을 평균 내어 전체 오차 $E_{in}$ 을 구함
$E_{in}$ 이 가장 낮아지는 지점의 가중치 $\mathbf{w}_{lin}$ 을 찾음

$E_{in}(\mathbf{w})$ is continous, differentiable, and convex

Convexity: 함수 그래프 위의 임의의 두점을 연결했을 때, 그 선분이 함수 공간을 빠져나가지 않는 형태

$E_{in}(\mathbf{w}) = \frac{1}{N} (\mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w} - 2\mathbf{w}^T \mathbf{X}^T \mathbf{y} + \mathbf{y}^T \mathbf{y})$

We can use standard matrix calculus to find $\mathbf{w}$ that minimizes $E_{in}(\mathbf{w})$ by requiring
$\nabla E_{in}(\mathbf{w})=0$

General optimization techniques
- E.g., gradient descent

Gradient Identities

$\nabla_{\mathbf{w}} (\mathbf{w}^T \mathbf{A} \mathbf{w}) = (\mathbf{A} + \mathbf{A}^T) \mathbf{w}$
$\nabla_{\mathbf{w}} (\mathbf{w}^T \mathbf{b}) = \mathbf{b}$

Scalar $w$

$E_{in}(w) = aw^2 - 2bw + c$
$\displaystyle \frac{\partial}{\partial w} E_{in}(w) = 2aw - 2b$

Vector $\mathbf{w}$

$E_{in}(\mathbf{w}) = \mathbf{w}^T A \mathbf{w} - 2\mathbf{w}^T \mathbf{b} + c$
$\nabla E_{in}(\mathbf{w}) = (A + A^T)\mathbf{w} - 2\mathbf{b}$

2.2 The Solution

From $E_{in}(\mathbf{w})$
- Both $\mathbf{w}$ and $\nabla E_{in}(\mathbf{w})$ are column vectors

Finally, one should solve for $\mathbf{w}$ that satisfies the linear equations
- to get $\nabla E_{in}(\mathbf{w})$ to be 0

$X^TX\mathbf{w} = X^T\mathbf{y}$

3. The Linear Regression Algorithm

3.1 Two Scenarios for the Solution

1. Invertible

If $X^TX$ is invertible, $\mathbf{w} = X^{\dagger}\mathbf{y}$
- $X^{\dagger} = (X^TX)^{-1}X^T$ is pseudo-inverse of $X$
- Resulting $\mathbf{w}$ is the unique optimal solution to $\mathbf{w}_{lin}$

2. Not Invertible

A pseudo-inverse can still be defined, but no unique solution
There will be more solutions for $\mathbf{w}$ that minimizes $E_{in}$

In Practice

$X^TX$ is invertible in most cases
- Since $N$ is often much bigger than $d+1$
- There will likely be $d+1$ linearly independent vectors $\mathbf{x}_n$
When $X^TX$ is (almost) singular (not invertible)
- Use a well-implemented $X^{\dagger}$ routine instead of $(X^TX)^{-1}X^T$
- This is for numerical stability

3.2 Algorithm

1. Construct matrix $X$ and vector $\mathbf{y}$ as follows

Use data set $(\mathbf{x_1}, y_1), ..., (\mathbf{x_N}, y_N)$
Each $\mathbf{x}$ includes $x_0 = 1$ bias coordinate

2. Compute pseudo-inverse $X^{\dagger}$ of $X$

3. Return $\mathbf{w}_{lin} = X^{\dagger}\mathbf{y}$

Eunji

이전 포스트

5. Linear Model (2)

다음 포스트

5. Linear Model (3)

Data Mining

1. Matrix Representation: Data and Error

1.1 Data Matrix and Target Vector

Data Matrix

Target Vector

E.g., d = 1, N = 4

1.2 Matrix Form of $E_{in}(\mathbf{w})$

In-sample Error

E.g., d = 1, N = 4

2. Getting the Solution $\mathbf{w}_{lin}$

5.1 Minimization of $E_{in}(\mathbf{w})$

$\mathbf{w}_{lin}$

$E_{in}(\mathbf{w})$ is continous, differentiable, and convex

Gradient Identities

Scalar $w$

Vector $\mathbf{w}$

2.2 The Solution

3. The Linear Regression Algorithm

3.1 Two Scenarios for the Solution

1. Invertible

2. Not Invertible

In Practice

3.2 Algorithm

1. Construct matrix $X$ and vector $\mathbf{y}$ as follows

2. Compute pseudo-inverse $X^{\dagger}$ of $X$

3. Return $\mathbf{w}_{lin} = X^{\dagger}\mathbf{y}$

5. Linear Model (2)

5. Linear Model (4)

0개의 댓글

5. Linear Model (3)

Data Mining

1. Matrix Representation: Data and Error

1.1 Data Matrix and Target Vector

Data Matrix

Target Vector

E.g., d = 1, N = 4

1.2 Matrix Form of Ein(w)E_{in}(\mathbf{w})Ein​(w)

In-sample Error

E.g., d = 1, N = 4

2. Getting the Solution wlin\mathbf{w}_{lin}wlin​

5.1 Minimization of Ein(w)E_{in}(\mathbf{w})Ein​(w)

wlin\mathbf{w}_{lin}wlin​

Ein(w)E_{in}(\mathbf{w})Ein​(w) is continous, differentiable, and convex

Gradient Identities

Scalar www

Vector w\mathbf{w}w

2.2 The Solution

3. The Linear Regression Algorithm

3.1 Two Scenarios for the Solution

1. Invertible

2. Not Invertible

In Practice

3.2 Algorithm

1. Construct matrix XXX and vector y\mathbf{y}y as follows

2. Compute pseudo-inverse X†X^{\dagger}X† of XXX

3. Return wlin=X†y\mathbf{w}_{lin} = X^{\dagger}\mathbf{y}wlin​=X†y

5. Linear Model (2)

5. Linear Model (4)

0개의 댓글

1.2 Matrix Form of $E_{in}(\mathbf{w})$

2. Getting the Solution $\mathbf{w}_{lin}$

5.1 Minimization of $E_{in}(\mathbf{w})$

$\mathbf{w}_{lin}$

$E_{in}(\mathbf{w})$ is continous, differentiable, and convex

Scalar $w$

Vector $\mathbf{w}$

1. Construct matrix $X$ and vector $\mathbf{y}$ as follows

2. Compute pseudo-inverse $X^{\dagger}$ of $X$

3. Return $\mathbf{w}_{lin} = X^{\dagger}\mathbf{y}$