PCA(Principal Component Analysis) (feat. sklearn)

hyangki0119·2022년 3월 17일


Singular value decomposition

  • Singular value decomposition(SVD) of XRn×pX\in\mathbb{R}^{n\times p} is
    X=UDVX = U D V^{\top}
    • URn×nU\in\mathbb{R}^{n\times n} : unitary orthogonal matrix, left singular vector of XX
    • DRn×pD\in\mathbb{R}^{n\times p} : rectangular diagonal matrix whose diagonal elemensts are λ\sqrt{\lambda} (eigenvalues)
    • VRp×pV\in\mathbb{R}^{p\times p} : unitary orthogonal matrix, right singular vector of XX

SVD and Eigendecomposition


  • breast cancer data
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

Principal Component Analysis

Problem setting

  • Denote dataset χ={x1,,xN},xnRD\chi=\{x_1,\dots,x_N\}, x_n\in\R^{D} with
    • mean 0\mathbb{0}
    • the data covariance matrix S=1Nn=1NxnxnS=\dfrac{1}{N}\sum\limits_{n=1}^{N}x_n x_n^{\top}
  • Assume that there exists a low-dimensional compressed representation (code)
    zn=BxnRMz_n = B^{\top}x_n \in \R^{M}
    of xnx_n, where
    • The projection matrix B:=[b1,b2,,bM]RD×MB:=[b_1,b_2,\dots,b_M]\in\R^{D\times M}

(NOTE) Why X is in (feature, obs), not (obs, feature)?

  • Finally, PCA is a linear mapping which reduces the dimension of images (or column space, codomain, spanned by the columns of design matrix XX) smaller than the dimension of domain through orthogonal projection. That is,
  • So, it is more convinient that set XRP×NX\in\R^{P \times N} than XRN×PX\in\R^{N \times P} (like regression problem)

Projection to the original real space

x~=BBxnU(RD)\tilde{x} = BB^{\top}x_n \in U (\subseteq \R^{D})
  • Geometrically, this mean that
    • Do orthognal projection xx onto the eigenvector bmb_m. This results in
      bmbmx,m=1,,Mb_m b_m^{\top} x,\quad m=1,\dots,M
    • Add up all projected vectors, that us, b1b1x++bMbMx=x~b_1 b_1^{\top} x+\dots+b_M b_M^{\top} x=\tilde{x}

To wrap up,

Linear mappingdim(domain)dim(real space)dim(image)

ItemPCA in sklearnsvd in numpy
Eigenvectorspca.components_VT in U, S, VT = np.linalg.svd(X_scaled)
Principal componentspca.fit_transform(X_scaled)(X_scaled).dot(pca.components_.T)
Projection onto the principal componentspca.inverse_transform(pca_transform)(X_scaled).dot(pca.components_.T).dot(pca.components_)

PCA projection recovery process

from sklearn.decomposision import PCA

n_comp = 330
pca = PCA(n_components = n_comp)
pca_fit_transform = pca.fit_transform(R.T)
pca_inverse_transform = pca.inverse_transfomr(pca_fit_transform)
  • Additional eigenvalues
    e~N(μW,ΣW)\tilde{e}\sim N (\mu_{\mathsf{W}}, \Sigma_{\mathsf{W}})
    where μW^=1ns=1Ses\hat{\mu_{\mathsf{W}}}=\frac{1}{n}\sum_{s=1}^{S}e_s and ΣW^=Cov(W)\hat{\Sigma_{\mathsf{W}}}=Cov(\boldsymbol{\mathsf{W}})
mu_hat_for_EV = list(map(lambda x : np.mean(x), COMPONENTS)
Sigma_hat_for_EV = np.cov(COMPONENTS)

S_new = 500
W_prime = np.random.multivariate_normal(mu_hat_for_EV, Sigma_hat_for_EV, S_new)
generated = np.matmul(pca_inverse_transform, W_prime.T)


Data science & Machine learning, baking and reading(≪,≫)

0개의 댓글