PCA(Principal Component Analysis) is one of well-known dimensionality reduction techniques. Simply put, we are representing 100-dimensional data with three-dimensional data. Before we get to the complicated part, let's first find out why we should reduce dimensionality.

Why do we reduce dimensionality?

Curse of Dimensionality

Curse of dimensionality is one of the main reasons why we reduce dimensionality. As the catchy name suggests, it means the increase of dimensionality is not that preferable.Image source

As shown above, the data densely distributed in two-dimensional space are distributed relatively sparsely in three-dimensional space. In other words, as the dimensionality increases, the data needed for explanation increases rapidly. Therefore, if the dimensionality of data is higher than necessary, it becomes more difficult to train a model (too little data given for the dimensionality). This is why we use dimensionality-reduced inputs for a model when the dimensionality of input is too high.

Data Visualization

Dimensionality reduction is also used in data visualization. Suppose we just finished a clustering task on 100-dimensional data. In order to "see" if clustering is well performed, 100-dimensional data should be reduced to two-dimensional or three-dimensional data while maintaining the characteristics of data as much as possible. This is when dimensionality reduction techniques such as PCA are used.The plots above are the clustering results of music data that I created a few months ago. The first result was visualized using PCA and the second result was created using t-SNE(t-distributed Stochastic Neighbor Embedding). (t-SNE is also a commonly used dimensionality reduction technique. You can get some information about t-SNE in this post.

How PCA Works

Basic Idea

The fundamental idea of PCA is maintaining the characteristics of data. Let's say that a set of two-dimensional data is given as follows.Suppose we reduce the data to one-dimensional data. What does it mean to reduce dimensionality while maintaining the characteristics of the data?Image source

The plots above show two different cases of dimensionality reduction(blue dots to red dots). Projection is used in both cases, yet the directions of the projection are different. In the left case, the projected data(red dots) are densely concentrated in the center. Even the distant data points are projected close to other points. But the situation on the right is different. The projected data are quite widely distributed on the black line (variance is large). In other words, the characteristics of data is well preserved in the right case.

Mathematical Interpretation

So, the key is maintaining the characteristics of data. As mentioned above, it can be done by keeping the variance of data to the maximum. Since the basic ideas have been organized, let's get into the math part.

X=[x1x2xn](xiRd)X=\begin{bmatrix}\\x_1&x_2&\dots&x_n\\\\ \end{bmatrix}\quad (x_i\in \mathbb{R}^d)

Let's say we have nn dd-dimensional data. If we say that the first vector to project the data (black line in the example above) is uu, the problem can be summarized as follows.

u1=arg maxuVar(uTX)u_1=\underset{u}{\operatorname{\argmax}}\,Var(u^TX)

To solve this problem, we first need to centralize XX.

X~=XXˉ=[x1x2xn][x1ˉxdˉ](xiˉ=1nj=1nxji)\begin{aligned} \tilde{X}&=X-\bar{X}\\&=\begin{bmatrix}\\x_1&x_2&\dots&x_n\\\\ \end{bmatrix}-\begin{bmatrix}&&&\bar{x^1}&&&\\\\&&&\vdots&&&\\\\&&&\bar{x^d}&&&\\ \end{bmatrix}\quad (\bar{x^i}=\frac{1}{n}\displaystyle\sum_{j=1}^{n}{x_{ji}}) \end{aligned}

Centralizing simply means moving the whole data to the origin. Centralizing only moves the location of data and does not change the overall shape of the data. Since we only care about the direction of u1u_1, we may use X~\tilde{X} instead of XX. Therefore the optimization problem can be rewritten as below.

u1=arg maxuVar(uTX~)u_1=\underset{u}{\operatorname{\argmax}}\,Var(u^T\tilde{X})

Let's look at uTX~u^T\tilde{X} for a moment. uTX~u^T\tilde{X} can be written (xi~\tilde{x_i} is the ii-th column vector of X~\tilde{X})

uTX~=[u1u2ud][x1~x2~xn~](uiR)=[ux1~ux2~uxn~]\begin{aligned} u^T\tilde{X}&=\begin{bmatrix}u_1&u_2&\dots&u_d \end{bmatrix} \begin{bmatrix}\\\tilde{x_1}&\tilde{x_2}&\dots&\tilde{x_n}\\\\\end{bmatrix}\quad (u_i\in \mathbb{R}) \\&=\begin{bmatrix}u\cdot\tilde{x_1}&u\cdot\tilde{x_2}&\dots&u\cdot\tilde{x_n} \end{bmatrix} \end{aligned}

The expectation of uTX~u^T\tilde{X} is thus

E(uTX~)=1n(ux1~+ux2~++uxn~)=1n{u1(x11~++xn1~)++ud(x1d~++xnd~)}\begin{aligned} E(u^T\tilde{X})&=\frac{1}{n}(u\cdot\tilde{x_1}+u\cdot\tilde{x_2}+\dots+u\cdot\tilde{x_n}) \\&=\frac{1}{n}\{u_1(\tilde{x_{11}}+\dots+\tilde{x_{n1}})+\dots+u_d(\tilde{x_{1d}}+\dots+\tilde{x_{nd}})\} \end{aligned}

Since X~\tilde{X} is a centralized matrix, the value of x1i~++xni~\tilde{x_{1i}}+\dots+\tilde{x_{ni}} is zero. Therefore E(uTX~)E(u^T\tilde{X}) equals zero. Now let's get back to the original problem.

u1=arg maxuVar(uTX~)=arg maxu1n1(uTX~X~Tu)(E(uTX~)=0)=arg maxuuT1n1(X~X~T)u=arg maxuuTVar(X~)u=arg maxuuTSu(S=Var(X~))\begin{aligned} u_1&=\underset{u}{\operatorname{\argmax}}\,Var(u^T\tilde{X}) \\&=\underset{u}{\operatorname{\argmax}}\,\frac{1}{n-1}(u^T\tilde{X}\tilde{X}^Tu)\quad (\because E(u^T\tilde{X})=0) \\&=\underset{u}{\operatorname{\argmax}}\,u^T\frac{1}{n-1}(\tilde{X}\tilde{X}^T)u \\&=\underset{u}{\operatorname{\argmax}}\,u^TVar(\tilde{X})u \\&=\underset{u}{\operatorname{\argmax}}\,u^TSu\quad (S=Var(\tilde{X})) \end{aligned}

Since uTSuu^TSu is a quadratic form, we cannot find u1u_1 without any constraint. But as mentioned above, we only care about the direction of u1u_1. Thus, the length of u1u_1 can be set to 1. Using Lagrangian multiplier method,

L(u,λ)=uTSuλ(uTu1)\mathcal{L}(u,\lambda)=u^TSu-\lambda(u^Tu-1)
Lu=2Su2λu=0\frac{\partial \mathcal{L}}{\partial u}=2Su-2\lambda u=0
Su=λu(1)\therefore Su=\lambda u\quad (1)

Let's apply the result to the optimization problem.

u1=arg maxuuTSu=arg maxuuTλu=arg maxuλuTu=arg maxuλ(uTu=1)\begin{aligned} u_1&=\underset{u}{\operatorname{\argmax}}\,u^TSu \\&=\underset{u}{\operatorname{\argmax}}\,u^T\lambda u \\&=\underset{u}{\operatorname{\argmax}}\,\lambda u^Tu \\&=\underset{u}{\operatorname{\argmax}}\,\lambda\quad (\because u^Tu=1) \end{aligned}

Voilà! The u1u_1 we need is the vector that maximizes λ\lambda. Since λ\lambda is an eigenvalue of SS(=Var(X~)=Var(\tilde{X})) according to (1), u1u_1 is the eigenvector of the covariance matrix of X~\tilde{X} corresponding to the largest eigenvalue. Such u1u_1 is called the first principal component. The second principal component is the vector that is orthogonal to the first principal component and maximizes the variance of the projected data (you can get the rest of the principal components in the same way).

PCA is also a concept deeply related to SVD(Singular Value Decomposition).

X=UΣVTX=U\Sigma V^T

UU consists of the eigenvectors of XXTXX^T and VV consists of the eigenvectors of XTXX^TX. What happens if we centralize XX? (The UU,Σ\Sigma,VV below are different from the UU,Σ\Sigma,VV above, but for convenience they are denoted as UU,Σ\Sigma,VV)

X~=UΣVT\tilde{X}=U\Sigma V^T

The column vectors of UU will be the eigenvectors of X~X~T\tilde{X}\tilde{X}^T. Yes, the column vectors of UU are actually the eigenvectors of the covariance matrix of X~\tilde{X}. If the singular values in Σ\Sigma are listed in descending order, the first column vector of UU will be the first principal component of XX. Simply put, we can get the principal components of XX by implementing SVD on centralized XX.


So far, we have looked at the most basic form of PCA. There are some advanced PCA techniques that are suitable for particular cases - one example is dual PCA which is used when the value of dd is much greater than nn (when the dimensionality is much higher than the number of data). This lecture video will be quite helpful if you want to know additional PCA techniques.

0개의 댓글