선형대수 6-5*. PCA

WooSeongkyun·2023년 1월 14일
0

AI를 위한 선형대수

목록 보기
13/14
  • goodfellow의 딥러닝 저서를 보고 정리한 내용으로 작성중입니다

Principal Componens Analysis

  • Condition
    - suppose we have a collection of m points {x(1),x(2),,x(m)}\{ \boldsymbol{x} ^{(1)},\boldsymbol{x} ^{(2)},\cdots,\boldsymbol{x} ^{(m)} \} in Rn\mathbb{R}^n
    - suppose we would like to apply lossy compression to these points
    - Lossy compression means strong the points in a way that requires less memory but may lose some precision
    - we would like to lose as little precision as possible

  • Statement
    - we can make encoding function ff such that for each point x(i)Rn\boldsymbol{x} ^{(i)} \in \mathbb{R} ^{n} corresponds code vector ciRl\boldsymbol{c} ^{i} \in \mathbb{R} ^{l} , where l<nl<n
    - f(x)=cf(\boldsymbol{x}) =\boldsymbol{c}
    - we can make decoding function gg such that xg(f(x))\boldsymbol{x} \sim g(f(\boldsymbol{x}))
    - For simple, choose suppose DRn×l\boldsymbol{D} \in \mathbb{R} ^{n \times l}
    - let columns of D\boldsymbol{D} to be orthogonal to each other
    - we can control the scale of D\boldsymbol{D}. in this case, we constrain all of the columns of D\boldsymbol{D} to have unit norm
    - Let g(c)=Dcg(\boldsymbol{c})=\boldsymbol{D}\boldsymbol{c}
    - we can find optical point c\boldsymbol{c} ^{*} by calculating the minimized distance between c=argmincxg(c)\boldsymbol{c} ^{*}=argmin \,\,{\boldsymbol{c}}\,\, \| \boldsymbol{x}- g(\boldsymbol{c}) \|
    - we can switch to the squared L2L^2 norm instead of L2L^2 norm itself, because both are minimized by same value of c\boldsymbol{c}
    - c=argmincxg(c)22\boldsymbol{c} ^{*}=argmin \,\,{\boldsymbol{c}} \| \boldsymbol{x} - g(\boldsymbol{c}) \|_2^2
    - (xg(c))T(xg(c))(\boldsymbol{x}-g(\boldsymbol{c}))^T(\boldsymbol{x}-g(\boldsymbol{c}))
    - =xTxxTg(c)g(c)Tx+g(c)Tg(c)=\boldsymbol{x} ^{T} \boldsymbol{x} - \boldsymbol{x} ^{T}g( \boldsymbol{c})- g(\boldsymbol{c}) ^{T} \boldsymbol{x}+g(\boldsymbol{c}) ^{T}g(\boldsymbol{c})
    - =xTx2xTg(c)+g(c)Tg(c)=\boldsymbol{x} ^{T} \boldsymbol{x}- 2 \boldsymbol{x} ^{T}g(\boldsymbol{c}) + g(\boldsymbol{c}) ^{T} g( \boldsymbol{c}) (because xTg(c)\boldsymbol{x} ^{T} g(\boldsymbol{c}) is scalar)
    - we can ignore first term since this term does not depend on c\boldsymbol{c}
    - c=argmin{c}2xTg(c)+g(c)Tg(c)\boldsymbol{c} ^{*}=argmin \,\,\{\boldsymbol{c}\}\,\, -2\boldsymbol{x} ^{T}g(\boldsymbol{c})+g(\boldsymbol{c}) ^{T}g(\boldsymbol{c})
    - =argmin{c}2xTDc+cTDTDc=argmin \,\,\{\boldsymbol{c}\}\,\, -2\boldsymbol{x} ^{T}\boldsymbol{D}\boldsymbol{c}+\boldsymbol{c} ^{T}\boldsymbol{D} ^{T}\boldsymbol{D}\boldsymbol{c}
    - =argmin{c}2xTDc+cTIlc=argmin \,\,\{\boldsymbol{c}\}\,\, -2\boldsymbol{x} ^{T}\boldsymbol{D}\boldsymbol{c}+\boldsymbol{c} ^{T}\boldsymbol{I} _{l} \boldsymbol{c}
    - (by orthogonality and unit norm constraints of D\boldsymbol{D})
    - =argmin{c}2xTDc+cTc= argmin \,\,\{\boldsymbol{c}\}\,\, -2\boldsymbol{x} ^{T}\boldsymbol{D}\boldsymbol{c}+\boldsymbol{c} ^{T}\boldsymbol{c}
    - c(2xTDc+cTc)=0\nabla{}_{\boldsymbol{c}}(-2\boldsymbol{x} ^{T} \boldsymbol{D} \boldsymbol{c}+\boldsymbol{c} ^{T} \boldsymbol{c})=0
    - 2DTx+2c=0-2\boldsymbol{D} ^{T}\boldsymbol{x}+2\boldsymbol{c}=0
    - c=DTx\boldsymbol{c}=\boldsymbol{D} ^{T}\boldsymbol{x}

    	- conclusion
    		- $f(\boldsymbol{x})=\boldsymbol{D} ^{T}\boldsymbol{x}$
  • Statement2
    - r(x)=g(f(x))=DDTxr(\boldsymbol{x})=g(f(\boldsymbol{x}))=\boldsymbol{D}\boldsymbol{D} ^{T}\boldsymbol{x}
    - We will find the D\boldsymbol{D} minimizing the L2L ^{2} distance between inputs and reconstructions
    - D=argmin{D}i,j(xj(i)r(x(i))j)2\boldsymbol{D} ^{*}=argmin \,\,\{\boldsymbol{D}\}\,\, \sqrt{\displaystyle\sum_{i,j}^{}{(x _{j} ^{(i)}-r(\boldsymbol{x} ^{(i)})_{j}) ^{2}}} (subject to DTD=Il\boldsymbol{D} ^{T} \boldsymbol{D}= \boldsymbol{I} _{l})
    - To derive the algorithm for finding D\boldsymbol{D} ^{*}, we will start by considering the case where l=1l=1
    - In this single case D\boldsymbol{D} is just a single vector d\boldsymbol{d}
    - d=argmin{d}ix(i)ddTx(i)2\boldsymbol{d} ^{*}= argmin \,\,\{\boldsymbol{d}\}\,\, \displaystyle\sum_{i}^{}{\| \boldsymbol{x} ^{(i)}-\boldsymbol{d}\boldsymbol{d} ^{T}\boldsymbol{x} ^{(i)} \|^2 } subject to d2=1\| \boldsymbol{d}_{2} \|=1
    - exploiting the fact that a scalr is its own transpose
    - d=argmin{d}ix(i)dTx(i)d2\boldsymbol{d} ^{*}=argmin \,\,\{\boldsymbol{d}\}\,\, \displaystyle\sum_{i}^{}{\| \boldsymbol{x} ^{(i)} - \boldsymbol{d} ^{T}\boldsymbol{x}^{(i)} \boldsymbol{d} \|}^2 subject to d2=1\| \boldsymbol{d}_{2} \|=1
    - let XRm×n\boldsymbol{X} \in \mathbb{R} ^{m \times n} be matrix defined by stacking all of the vectors describing the points such that Xi,:=x(i)T\boldsymbol{X}_{i,:}=\boldsymbol{x}^{(i)^T}
    - we can now rewrite the problem as
    - d=argmin{d}XXddTF2\boldsymbol{d} ^*=argmin \,\,\{\boldsymbol{d}\}\,\, \| \boldsymbol{X}-\boldsymbol{X}\boldsymbol{d}\boldsymbol{d}^T \| ^{2}_{F}
    - 여기서 Frobenius norm은 XF2=Tr(XXT)=ijxij2\| \boldsymbol{X} \| ^{2}_{F}=Tr(\boldsymbol{X}\boldsymbol{X} ^{T})=\sqrt{\displaystyle\sum_{i}^{}{\displaystyle\sum_{j}^{}{|x _{ij}| ^{2}}}}이다
    - argmin{d}XXddTF2argmin \,\,\{\boldsymbol{d}\}\,\, \| \boldsymbol{X}-\boldsymbol{X}\boldsymbol{d}\boldsymbol{d}^T \| ^{2}_{F}
    - =argmin{d}Tr((XXddT)T(XXddT))=argmin \,\,\{\boldsymbol{d}\}\,\, Tr((\boldsymbol{X}-\boldsymbol{Xd}\boldsymbol{d}^T)^T(\boldsymbol{X}-\boldsymbol{X}\boldsymbol{d}\boldsymbol{d}^T))
    - =argmin{d}Tr(XTXXTXddTddTXTX+ddTXTXddT)=argmin \,\,\{\boldsymbol{d}\}\,\, Tr(\boldsymbol{X}^{T}\boldsymbol{X}-\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}\boldsymbol{d}^T-\boldsymbol{d}\boldsymbol{d}^{T}\boldsymbol{X}^{T}\boldsymbol{X}+\boldsymbol{d}\boldsymbol{d}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}\boldsymbol{d}^{T})
    - =argmin{d}Tr(XTXddT)Tr(ddTXTX)+Tr(ddTXTXddT)=argmin \,\,\{\boldsymbol{d}\}\,\, -Tr(\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}\boldsymbol{d}^{T})-Tr(\boldsymbol{d}\boldsymbol{d}^{T}\boldsymbol{X}^{T}\boldsymbol{X})+Tr(\boldsymbol{dd}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}\boldsymbol{d}^{T})
    - (term not involving d\boldsymbol{d} ignored since do not affect the arg min)
    - use property of 'Trace of matrix'
    - =argmin{d}2Tr(XTXddT+Tr(XTXddTddT))=argmin \,\,\{\boldsymbol{d}\}\,\, -2Tr(\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}\boldsymbol{d}^{T}+Tr(\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}\boldsymbol{d}^{T}\boldsymbol{d}\boldsymbol{d}^{T}))
    - subject to dTd=1\boldsymbol{d}^{T} \boldsymbol{d}=1
    - =-argmin{d}Tr(XTXddT)argmin \,\,\{\boldsymbol{d}\}\,\, Tr(\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d}\boldsymbol{d}^{T})
    - subject to dTd=1\boldsymbol{d}^{T} \boldsymbol{d}=1
    - =argmax{d}Tr(dTXTXd)argmax\{\boldsymbol{d}\}\,\,Tr(\boldsymbol{d}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{d})
    - subject to dTd=1\boldsymbol{d}^{T} \boldsymbol{d}=1
    - optimal dd is given by eigenvector of XTX\boldsymbol{X}^{T}\boldsymbol{X} corresponding to the largest eigenvalue
    - In the general case, the matrix D\boldsymbol{D} is given by ll eigenvectors corresponding to the largest eigenvalues This may be shown using proof by induction

profile
안녕하세요!

0개의 댓글