PCA(Principal Component Analysis) (feat. sklearn)

hyangki0119·2022년 3월 17일
0

Pre-requistie

Singular value decomposition

• Singular value decomposition(SVD) of $X\in\mathbb{R}^{n\times p}$ is
$X = U D V^{\top}$
• $U\in\mathbb{R}^{n\times n}$ : unitary orthogonal matrix, left singular vector of $X$
• $D\in\mathbb{R}^{n\times p}$ : rectangular diagonal matrix whose diagonal elemensts are $\sqrt{\lambda}$ (eigenvalues)
• $V\in\mathbb{R}^{p\times p}$ : unitary orthogonal matrix, right singular vector of $X$

SVD and Eigendecomposition

sklearn

• breast cancer data
import pandas as pd
cancer = load_breast_cancer()

Principal Component Analysis

Problem setting

• Denote dataset $\chi=\{x_1,\dots,x_N\}, x_n\in\R^{D}$ with
• mean $\mathbb{0}$
• the data covariance matrix $S=\dfrac{1}{N}\sum\limits_{n=1}^{N}x_n x_n^{\top}$
• Assume that there exists a low-dimensional compressed representation (code)
$z_n = B^{\top}x_n \in \R^{M}$
of $x_n$, where
• The projection matrix $B:=[b_1,b_2,\dots,b_M]\in\R^{D\times M}$

(NOTE) Why X is in (feature, obs), not (obs, feature)?

• Finally, PCA is a linear mapping which reduces the dimension of images (or column space, codomain, spanned by the columns of design matrix $X$) smaller than the dimension of domain through orthogonal projection. That is,
$y=Ax$
• So, it is more convinient that set $X\in\R^{P \times N}$ than $X\in\R^{N \times P}$ (like regression problem)

Projection to the original real space

$\tilde{x} = BB^{\top}x_n \in U (\subseteq \R^{D})$
• Geometrically, this mean that
• Do orthognal projection $x$ onto the eigenvector $b_m$. This results in
$b_m b_m^{\top} x,\quad m=1,\dots,M$
• Add up all projected vectors, that us, $b_1 b_1^{\top} x+\dots+b_M b_M^{\top} x=\tilde{x}$

To wrap up,

Linear mappingdim(domain)dim(real space)dim(image)
$B^{\top}$$D$$M$$M$
$BB^{\top}$$D$$D$$M$

ItemPCA in sklearnsvd in numpy
Eigenvectorspca.components_VT in U, S, VT = np.linalg.svd(X_scaled)
Principal componentspca.fit_transform(X_scaled)(X_scaled).dot(pca.components_.T)
Projection onto the principal componentspca.inverse_transform(pca_transform)(X_scaled).dot(pca.components_.T).dot(pca.components_)

PCA projection recovery process

from sklearn.decomposision import PCA

n_comp = 330
pca = PCA(n_components = n_comp)
pca_fit_transform = pca.fit_transform(R.T)
pca_inverse_transform = pca.inverse_transfomr(pca_fit_transform)
$\tilde{e}\sim N (\mu_{\mathsf{W}}, \Sigma_{\mathsf{W}})$
where $\hat{\mu_{\mathsf{W}}}=\frac{1}{n}\sum_{s=1}^{S}e_s$ and $\hat{\Sigma_{\mathsf{W}}}=Cov(\boldsymbol{\mathsf{W}})$
mu_hat_for_EV = list(map(lambda x : np.mean(x), COMPONENTS)
Sigma_hat_for_EV = np.cov(COMPONENTS)

S_new = 500
W_prime = np.random.multivariate_normal(mu_hat_for_EV, Sigma_hat_for_EV, S_new)
generated = np.matmul(pca_inverse_transform, W_prime.T)

[Reference]

Data science & Machine learning, baking and reading(≪,≫)