Principal Components Analysis (PCA)

김짝뚜·2023년 8월 20일

Multivariate Analysis

목록 보기

1/1

PCA는 기본적으로 correlated variables $\bold x^T =(x_1, \ldots,x_q)$ 를 선형 결합하여 uncorrelated variables $\bold y^T =(y_1, \ldots,y_q)$ 로 바꾸는 것이 목표이다.
새로 만들어진 변수들은 importance 순서대로 나타나게 된다.

일반적으로 original variables $x_1,\ldots,x_q$ 에서 variation의 substantial proportion을 설명하기 위해서, lower-dimensional summary를 만들어주기 위해 사용된다.

Principal components는 주로 데이터의 informative graphical representation을 구성하기 위해 이용한다.

Regression analysis에서 principal components가 유용하게 사용되는 경우는 다음과 같다.

observation의 수에 비해 너무 많은 explanatory variable들이 있을 때
explanatory variable들이 매우 높은 상관성을 가질 때

Finding the sample principal components

관측치 $y_1$ 에 대한 first principal component는 linear combination $y_1 = a_{11}x_1 +a_{12}x_2 + \cdots +a_{1q}x_q$ 이며 이는 모든 linear combination 중에서 가장 큰 sample variance 를 갖는다.
단순히 coefficient $\bold a_1^T =(a_{11}, a_{12}, \ldots, a_{1q})$ 를 증가시킴으로써 $y1$ 의 분산을 limit 없이 증가시킬 수 있기 때문에 이런 coefficient에 대한 restriction이 반드시 있어야 한다. 제약은 $\bold a_1^T \bold a_1 = 1$ 로 둔다.
$y1$ 의 sample variance는 $\bold a_1^T \bold S\bold a_1 = 1$ 이다. 여기서 $\bold S$ 는 $x$ 변수들의 $q \times q$ sample covariance matrix이다. 제약 조건 하에서 maximize 하는 방법으로 Lagrange multiplier가 사용된다.

second principal component $y_2$ 는 linear combination $y_2 = a_{21}x_1 +a_{22}x_2 + \cdots +a_{2q}x_q$ 로 정의된다. 즉, $y_2 = \bold a_2^T \bold x$ 이다. greatest variance는 $\bold a_2^T \bold a_2 = 1$ 과 $\bold a_2^T \bold a_1 = 0$ 두 조건을 따른다.

$j$ th principal component는 linear combination $y_j = \bold a_j^T \bold x$ 이고, greatest sample variance는 $\bold a_j^T \bold a_j = 1$ 과 $\bold a_j^T \bold a_i = 0 \; (i<j)$ 를 따른다.

$j$ th principal component coefficient $\bold a_j$ 의 vector는 $j$ th largest eigenvalue와 관련된 $\bold S$ 의 eigenvector이다.
만약 $\bold S$ 의 $q$ eigenvalues가 $\lambda_1,\lambda_2,\ldots,\lambda_q$ 라 하면, $i$ th principal component의 variance는 $\lambda_i$ 로부터 주어진다.
$q$ principal components의 total variance는 original variables의 total variance와 동일하다. 즉, $\sum_{i=1}^q \lambda_i = s_1^2 + s_2^2 + \cdots +s_q^2$ 이다. 여기에서 $s_i^2$ 은 $x_i$ 의 sample variance이다. 더 간단하게 쓰면 $\sum_{i=1}^q \lambda_i = trace(\bold S)$ 이다.
결과적으로 $j$ th principal component는 original data의 total variance의 비율 $P_j$ 로 설명된다.
$P_j = \frac{\lambda_j }{trace(\bold S)}$

$m<q$ 일 때 first $m$ principal components는 original data의 total variation의 비율 $P^{(m)}$ 으로 설명된다.
$P^{(m)} = \frac{\sum_{j=1}^m\lambda_j}{trace(\bold S)}$

김짝뚜

안녕하세요

Principal Components Analysis (PCA)

Multivariate Analysis

Finding the sample principal components

0개의 댓글