Mathmatics for ML

김카누·2024년 7월 11일

Aimers-course

Aimers

목록 보기

3/8

학습 목표: ai기술을 이해하기 위해 바탕이 되는 수학적 지식을 습득한다.

1. Matrix Decomposition(행렬 분해)

1.1 Determinant and trace

행렬이 다음과 같이 존재할 때,

A = \begin{pmatrix}a&b\\c&d\\\end{pmatrix}

$determinant$

det(A) = $\begin{vmatrix}a&b\\c&d\\ \end{vmatrix} = (ad-bc)$ 로 계산된다.

특징

$\det(AB) = \det(A) \times\det(B)$
$\det(A) =\ \det(A^T)$
$\det(A^{-1}) = \ 1/\det(A)$

$Trace$

정의: 정사각 행렬에서 대각선 성분들의 합을 의미한다.

tr(A) = \sum^n_{i=1}a_{ii}

특징

$tr(A+B) = tr(A) + tr(B)$
$tr(\alpha A)$ = \alpha tr(A)$
$tr(I_n) = n$

1.2 Eigenvalues and Eigenvector

특정 벡터 x와 실수값 $\lambda$ 에 대하여 다음과 같이 나타낼 수 있을 때,

Ax = \lambda x \tag{1}

$x$ 는 A의 eigenvector이며, $\lambda$ 는 A의 eigenvalue라 한다.

eigen 구하는 방법!

$det(A-\lambda I_n)=0$ 를 계산해 eigenvalue $\lambda$ 를 구한 뒤, 식(1)에 대입하여 eigenvector를 얻을 수 있다.

determinant와 trace와의 관계
$det(A) = \prod_{i=1}^n \lambda_i\\$ $tr(A) = \sum_{i=1}^n \lambda_i$

1.3 Decomposition

1. Cholesky Decomposition

행렬 A가 symmetric하고 positive definite한 경우, $A=LL^T$ 로 나타낼 수 있다.
( $L$ 은 lower-triangular matrix with positive diagonals)

대각선 위 성분값이 0인 행렬, $L$ 을 A행렬의 Cholesky factor라 부른다.

$A=LL^T$ 로 나타내면 determinant 계산이 쉬워진다.

det(A) = det(L)\times det(L^T)=det(L)^2\\

\textstyle det(A) = \prod_i L^2_{ii}\quad(\;det(L) = \prod_i L_{ii}\;)

2. Eigen Value Decomposition (EVD)

Diagonal matrix
대각선 성분을 제외한 나머지 성분값이 0인 행렬
$D=\begin{pmatrix}d_1 & \cdots & 0\\ \vdots & & \vdots\\ 0 & \cdots & d_n\end{pmatrix}$
$\,$

특징
$D^k = \begin{pmatrix}d^k_1 & \cdots & 0\\ \vdots & & \vdots\\ 0 & \cdots & d^k_n\end{pmatrix},\quad D^{-1} = \begin{pmatrix}1/d_1 & \cdots & 0\\ \vdots & & \vdots\\ 0 & \cdots & 1/d_n\end{pmatrix}, \quad det(D) = d_1d_2\cdots d_n$

위와 같은 특징으로 계산상의 편의성이 있고

행렬 A에 대해 $D=P^{-1}AP$ 로 변환할 수 있는 경우, A는 diagonalizable하다. (이때, P는 invertible)
또한 orthogonal P인 경우( $PP^\top=I$ ), $D=P^{-1}AP=P^\top AP$ 이며 A는 orthogonally diagonalizable하다.

특징

$A^k=PD^kP^{-1}$
$det(A) = det(P) det(D) det(P^{-1}) = det(D) = \prod_i d_{ii}$

A가 symmetric $\Leftrightarrow$ A는 orthogonally diagonalizable

Spectral Theorem ( $A =PDP^\top$ )

A가 symmetric한 경우,
(a) eigenvalues 모두 실수값
(b) eigenvectors 는 서로 perpendicular
(c) orthogonal eigenbasis

따라서 P는 eigenvectors로 두면, D는 eigenvalues가 되고 $AP=PD$ 로 나타낼 수 있다.

Singular Value Decomposition (SVD)

A가 square, non-symmetric한 경우
A가 non-square non-symmetric한 경우

$A\in\R^{m\times n},\; S=A^\top A\in\R^{n\times n}$ S:symmetric & positive semidefinite

$A=U\sum V^\top\quad (\;U\in\R^{m\times m},\, V\in\R^{n\times n},\;\sum :m\times n\;\mathrm{matrix,} \sum_{ii}=\sigma_i>0)$

2. Convex Optimization

머신러닝이나 딥러닝에서 학습시 최적화된 파라미터를 찾는 과정은 optimization problem으로 생각할 수 있으므로 최적화 이론에 대해 알아둘 필요가 있다.

특히,최적화하려는 문제가 convex optimization이라면 최소화하려는 목적함수를 항상 global하게 찾을 수 있다.

2.1 Gradient Algorithms

goal $minimize \;f(x) \quad f(x) :\R^n \rarr \R, \quad f\in C^1$
Gradient-type algorithms $\bold x_{k+1}=\bold x_k + \gamma_k\bold d_k, \quad k=0,1,2,\dots\\$

$\nabla f(x)\cdot\bold d<0$ 인 d를 선택한 경우, $x_\alpha=x+\alpha d \;(\alpha>0) \rarr f(x_\alpha)<f(x)$ 만족
$\gamma$ is step size

d를 gradient의 반대 방향으로 설정하면 간단하게 만족

Steepest gradient descent $\bold d_k=-\nabla f(\bold x_k)^\top.$

학습량에 따라

Batch gradient descent ( the entire n )

Mini-batch gradient descent ( k < n data )

Stochastic gradient descent ( k < n data with unbiased gradient estimation)

업데이트 방식에 따라

Momentum

NAG

Adagrad

RMSprop

Adam

2.2 Stochastic Gradient Descent (SGD)

머신러닝에서 학습시 실제값과 예측값 사이 차이를 Loss라고 하며 이를 줄이는 것이 목적이다.
다음은 Optimization 문제에서 constraint가 없는 경우 파라미터의 업데이트 과정이다.

Loss function $L(\theta)=\sum^n_{i=1} L_n(\theta)$
Gradient update
$\theta_{k+1}=\theta_k-\gamma_k\nabla L(\theta_k)^\top = \theta_k-\gamma_k\sum^N_{n=1}\nabla L_n(\theta_k)^\top$
Batch gradient descent : $\sum^N_{n=1}\nabla L_n(\theta_k)^\top$ 모든 데이터를 한 번에 업데이트
Mini-batch gradient descent : $\sum^N_{n\in K}\nabla L_n(\theta_k)^\top$ 특정 subset으로 나누어 업데이트 ( K < N )
Stochastic gradient descent : subset K에서 무작위로 선택하여 업데이트
$\sum^N_{n=1}\nabla L_n(\theta_k)^\top = \it E\left[\;\sum_{n\in K}\nabla L_n(\theta_k)^\top \right ]$
step size setting
- too small $\rarr$ slow update
- too big $\rarr$ overshoot, zig-zag, fail to converge
  with "Momentum"
  $x_{k+1}=x_k-\gamma_i\nabla f(x_k)^\top +\alpha\Delta x_k \quad \alpha \in [0,1]$
  $\Delta x_k = x_k-x_{k-1}$
  - memory term : $\alpha\Delta x_k,\quad \alpha$ 는 얼마나 과거를 유지할 것인지 조절하는 파라미터

만약 constraint가 있는 경우에는 어떻게 gradient를 구할 수 있을까?

constrained optimization problem 형태를 살펴보면

\begin{aligned} minimize \;&f(x)\\ subject\;to\; &g_i(x) \leq0,\quad i=1,2,\dots, m\\ &h_j(x) =0,\quad i=1,2,\dots, p \end{aligned}

이다.
inequality와 equality constraint를 만족하는 x에 대하여 f(x)를 최소화하는 문제이다.

이를 푸는 방법으로 Lagrange multipliers가 있다.
아이디어는 목적함수의 f(x)에 constraint를 weighted sum한 새로운 함수 $D(\lambda, \nu)$ 를 설정하여 계산하는 것이다.

Lagrangian: $\mathcal{L}(x,\lambda,\nu)=f(x)+\sum\lambda_ig_i(x) + \sum\nu_jh_j(x)$
Lagrange multipliers: $\lambda=(\lambda_i :i=1,\dots, m) \succeq0, \nu=(\nu_1,\dots,\nu_p)$
Lagrange dual function: $\displaystyle D(\lambda, \nu) = \inf_x \mathcal{L}(x,\lambda,\nu)$

라그랑지안을 다시 살펴보면 L과 f(x)에서 뒤에 추가된 항은 0보다 작거나 같다.

dual function 장점은 $D(\lambda, \nu)=d^*\leq p^*$ ( $p^*$ 는 f(x)의 최적의 해 )를 만족한다는 것이다. ( Best lower bound )

2.3 KKT Condition

g_i(x^* )\leq0, h_i(x^* )=0, \lambda^*_i\succeq0\\ \lambda^*_ig_i(x^* ) =0\\ \nabla f(x)+\sum\lambda^*_i\nabla g_i(x^*) + \sum\nu^*_j\nabla h_j(x^*)=0

마지막 줄의 조건을 만족한다면 KKT condition을 만족한다하며 이 경우, \

primal-dual optimal에서 동일한 최적 값을 얻는 필요 충분 조건이 된다. ( $p^*=d^*$ )

3. Principal Component Analysis (PCA)

3.1 Motibation

고차원으로 구성된 데이터는 분석하기 어려운 경우가 많다.
또한 어떤 성분은 분석에 필요없을 정도로 불필요한 경우도 있다. (Redundant-비슷한 값)
따라서 선형 결합으로 분산이 큰 방향의 축들만 사용하여 차원을 줄이고 불필요한 성분 축을 제거하는 방법이 필요하다.

3.2 PCA Algorithm

Centering: feature의 평균을 빼서 데이터의 평균이 원점에 오도록 조정
Standardization: 각 특성의 표준편차로 나누어 서로 다른 특성의 스케일을 조정
(1 & 2는 각 피쳐별로 normalization한것과 같다.)
Eigenvalue/vector: 데이터 공분산(Covariance) 행렬의 큰 M개의 eigenvalue와 eigenvector를 계산 (M은 전체 N 차원에서 축소시킬 차원의 수)
Projection: eigenvectors로 사영
1과 2과정을 역으로 풀어준다. (원 데이터 크기(scale)로 맞춘다.)

Data Covariance Matrix

N: number of samples,
D: number of measurements (features)
iid dataset $X = \{x_1,x_2,\dots,x_N\}$ , mean=0

X=(x_1,\dots,x_N) = \begin{pmatrix}x_{1,1}&x_{1,2}&\dots&x_{1,N}\\ x_{2,1}&x_{2,2}&\dots&x_{2,N}\\ & &\vdots&\\ x_{D,1}&x_{D,2}&\dots&x_{D,N} \end{pmatrix} \in \R^{D \times N}

covariance matrix $S= \frac{1}{N}XX^\top=\frac{1}{N}\sum^N_{n=1}x_nx_n^\top\in\R^{D\times D}$

Low Dimensional Representation

$z_n = B^\top x_n\in\R^M$ , projection matrix $B=(b_1,b_2,\dots,b_M)\in\R^{D\times M}$

$\tilde x_n = B z_n$ (B are orthonormal $BB^\top=I, B^\top=B^{-1}$ )

데이터 피쳐의 D차원은 projection matrix를 거쳐 M차원으로 축소된다. (M<D)
z는 차원 축소된 project 데이터
$\tilde x$ 는 z에 대해서 reconstruction 데이터

3.3 PCA Process & proof

우리의 목표는 데이터가 고차원(D) 피쳐를 가질때, 저차원(M) 피쳐 subspace로 매핑하는 것이다. 기본이 되는 아이디어는 3.1과 같이 분산이 큰 축을 사용하는 것이다.

3.2에서 공분산 행렬의 eigenvalue가 큰 M개의 eigenvector들을 사용하여 차원 축소를 진행하는데 그 이유에 대해서 살펴보자.

의 $z_1$ variance를 구하면

V_1 = var[z_1]=\frac{1}{N}\sum^N_{n=1}z^2_{1n}\;,\quad z_{1n}=b^\top_1x_n\\ V_1 = \frac{1}{N}\sum^N_{n=1}(b^\top_1x_n)^2=\frac{1}{N}\sum^N_{n=1}b^\top_1x_nx_n^\top b_1=b^\top_1\left(\frac{1}{N}\sum^N_{n=1}x_nx_n^\top\right) b_1=b^\top_1Sb_1

공분산 행렬 S가 튀어나오고 이제 variance를 maximization하는 B를 구하면 된다.

optimazation problem을 풀어보자.

Maximize\quad b_1^\top Sb_1\\ subject\;to\quad ||b_1||^2=1

Lagrange multiplier를 사용하면

Sb_1=\lambda_1b_1,\quad b^\top_1b_1=1 \rarr \lambda_1:eigenvalue,\; b_1: eigenvector\; of\; S\\ V_1=b_1^\top Sb_1=\lambda_1b^\top_1b_1=\lambda_1

variance가 큰 축을 순서대로 M개 선택하는 것은 공분산 행렬의 eigenvalue가 큰 eigenvector 선택하는 것임을 알 수 있다.

정리

빅데이터를 다룰 때, 행렬을 통한 계산으로 학습하거나 문제를 풀게 되는데 기본이 되는 행렬의 특징을 살펴보았다.
머신러닝 및 딥러닝에서 잘 학습된 파라미터를 찾을 때, Loss function의 gradient를 사용한다. 이는 목적함수를 최소화하는 Optimization problem으로 생각할 수 있으며 Convex한 방법으로 기술할 수 있다면 최적의 해를 찾을 수 있다. ( Optimization problem은 non-convex한 경우에도 Lagrange dual function으로 풀 수 있다.)
PCA는 데이터의 특성 차원이 큰 경우 선형 변환을 통해 핵심이 되는 축만 선택하여 특성 차원을 줄이는 방법이다. 이때, 공분산 행렬의 eigenvalue가 큰 eigenvector가 핵심 축이 된다.

김카누

데이터를 해석하기 위한 방법을 고민하는 분석 블로그입니다.

Mathmatics for ML

Aimers

목차

1. Matrix Decomposition(행렬 분해)

1.1 Determinant and trace

$determinant$

$Trace$

1.2 Eigenvalues and Eigenvector

1.3 Decomposition

1. Cholesky Decomposition

2. Eigen Value Decomposition (EVD)

Spectral Theorem ( $A =PDP^\top$ )

Singular Value Decomposition (SVD)

2. Convex Optimization

2.1 Gradient Algorithms

2.2 Stochastic Gradient Descent (SGD)

2.3 KKT Condition

3. Principal Component Analysis (PCA)

3.1 Motibation

3.2 PCA Algorithm

Data Covariance Matrix

Low Dimensional Representation

3.3 PCA Process & proof

정리

Ai 윤리

Supervised Learning - Linear Regression

0개의 댓글

Mathmatics for ML

Aimers

목차

1. Matrix Decomposition(행렬 분해)

1.1 Determinant and trace

determinantdeterminantdeterminant

TraceTraceTrace

1.2 Eigenvalues and Eigenvector

1.3 Decomposition

1. Cholesky Decomposition

2. Eigen Value Decomposition (EVD)

Spectral Theorem (A=PDP⊤A =PDP^\topA=PDP⊤)

Singular Value Decomposition (SVD)

2. Convex Optimization

2.1 Gradient Algorithms

2.2 Stochastic Gradient Descent (SGD)

2.3 KKT Condition

3. Principal Component Analysis (PCA)

3.1 Motibation

3.2 PCA Algorithm

Data Covariance Matrix

Low Dimensional Representation

3.3 PCA Process & proof

정리

Ai 윤리

Supervised Learning - Linear Regression

0개의 댓글

$determinant$

$Trace$

Spectral Theorem ( $A =PDP^\top$ )