import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
Principal Component Analysis
Problem setting
Denote dataset χ={x1,…,xN},xn∈RD with
mean 0
the data covariance matrix S=N1n=1∑Nxnxn⊤
Assume that there exists a low-dimensional compressed representation (code)
zn=B⊤xn∈RM
of xn, where
The projection matrix B:=[b1,b2,…,bM]∈RD×M
(NOTE) Why X is in (feature, obs), not (obs, feature)?
Finally, PCA is a linear mapping which reduces the dimension of images (or column space, codomain, spanned by the columns of design matrix X) smaller than the dimension of domain through orthogonal projection. That is,
y=Ax
So, it is more convinient that set X∈RP×N than X∈RN×P (like regression problem)
Projection to the original real space
x~=BB⊤xn∈U(⊆RD)
Geometrically, this mean that
Do orthognal projection x onto the eigenvector bm. This results in
bmbm⊤x,m=1,…,M
Add up all projected vectors, that us, b1b1⊤x+⋯+bMbM⊤x=x~