PCA(Principal Component Analysis) (feat. sklearn)

이향기·2022년 3월 17일

Pre-requistie

Singular value decomposition

Singular value decomposition(SVD) of $X\in\mathbb{R}^{n\times p}$ is
$X = U D V^{\top}$
- $U\in\mathbb{R}^{n\times n}$ : unitary orthogonal matrix, left singular vector of $X$
- $D\in\mathbb{R}^{n\times p}$ : rectangular diagonal matrix whose diagonal elemensts are $\sqrt{\lambda}$ (eigenvalues)
- $V\in\mathbb{R}^{p\times p}$ : unitary orthogonal matrix, right singular vector of $X$

sklearn

breast cancer data

import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

Principal Component Analysis

Problem setting

Denote dataset $\chi=\{x_1,\dots,x_N\}, x_n\in\R^{D}$ with
- mean $\mathbb{0}$
- the data covariance matrix $S=\dfrac{1}{N}\sum\limits_{n=1}^{N}x_n x_n^{\top}$
Assume that there exists a low-dimensional compressed representation (code)
$z_n = B^{\top}x_n \in \R^{M}$
of $x_n$ , where
- The projection matrix $B:=[b_1,b_2,\dots,b_M]\in\R^{D\times M}$

(NOTE) Why X is in (feature, obs), not (obs, feature)?

Finally, PCA is a linear mapping which reduces the dimension of images (or column space, codomain, spanned by the columns of design matrix $X$ ) smaller than the dimension of domain through orthogonal projection. That is,

y=Ax

So, it is more convinient that set $X\in\R^{P \times N}$ than $X\in\R^{N \times P}$ (like regression problem)

Projection to the original real space

\tilde{x} = BB^{\top}x_n \in U (\subseteq \R^{D})

Geometrically, this mean that
- Do orthognal projection $x$ onto the eigenvector $b_m$ . This results in $b_m b_m^{\top} x,\quad m=1,\dots,M$
- Add up all projected vectors, that us, $b_1 b_1^{\top} x+\dots+b_M b_M^{\top} x=\tilde{x}$

To wrap up,

Linear mapping	dim(domain)	dim(real space)	dim(image)
$B^{\top}$	$D$	$M$	$M$
$BB^{\top}$	$D$	$D$	$M$

Item	`PCA` in `sklearn`	`svd` in `numpy`
Eigenvectors	`pca.components_`	`VT` in `U, S, VT = np.linalg.svd(X_scaled)`
Principal components	`pca.fit_transform(X_scaled)`	`(X_scaled).dot(pca.components_.T)`
Projection onto the principal components	`pca.inverse_transform(pca_transform)`	`(X_scaled).dot(pca.components_.T).dot(pca.components_)`

PCA projection recovery process

from sklearn.decomposision import PCA

n_comp = 330
pca = PCA(n_components = n_comp)
pca_fit_transform = pca.fit_transform(R.T)
pca_inverse_transform = pca.inverse_transfomr(pca_fit_transform)

Additional eigenvalues $\tilde{e}\sim N (\mu_{\mathsf{W}}, \Sigma_{\mathsf{W}})$ where $\hat{\mu_{\mathsf{W}}}=\frac{1}{n}\sum_{s=1}^{S}e_s$ and $\hat{\Sigma_{\mathsf{W}}}=Cov(\boldsymbol{\mathsf{W}})$

mu_hat_for_EV = list(map(lambda x : np.mean(x), COMPONENTS)
Sigma_hat_for_EV = np.cov(COMPONENTS)

S_new = 500
W_prime = np.random.multivariate_normal(mu_hat_for_EV, Sigma_hat_for_EV, S_new)

generated = np.matmul(pca_inverse_transform, W_prime.T)

[Reference]

이향기

Data science & Machine learning, baking and reading(≪,≫)

이전 포스트

야마구치 슈 ≪뉴타입의 시대≫(2019) : 불안정한 시대에서 오는 안정감

다음 포스트

PCA(Principal Component Analysis) (feat. sklearn)

Pre-requistie

Singular value decomposition

sklearn

Principal Component Analysis

Problem setting

(NOTE) Why X is in (feature, obs), not (obs, feature)?

Projection to the original real space

PCA projection recovery process

[Reference]

야마구치 슈 ≪뉴타입의 시대≫(2019) : 불안정한 시대에서 오는 안정감

닐 메타 ≪IT 좀 아는 사람≫(2019) : 생생하게 그려지는 IT 기술들

0개의 댓글

관련 채용 정보