2주차 - Gaussian Process

ToBigs1617 Time-Series·2022년 4월 14일

2주차 - Gaussian Process

Gaussian Basics

Gaussian random variable $X\sim N(\mu, \Sigma)$

$P(X;\mu, \Sigma)=\frac{1}{(2\pi)^{d/2}|\Sigma|}e^{-\frac{1}{2}((x-\mu)^T\Sigma^{-1}(x-\mu))}$ - prbability density function
$\int_{x}P(X;\mu, \Sigma)dx=1$
$P(X_A)=\int_{X_B}P(X_A,X_B;\mu, \Sigma)dX_B$ - marginalization
$P(X_B)=\int_{X_A}P(X_A,X_B;\mu, \Sigma)dX_A$ - marginalization
$P(X_A|X_B)=\frac{P(X_A,X_B;\mu, \Sigma)}{\int_{X_B}P(X_A,X_B;\mu, \Sigma)dX_B}$ - conditioning
$X_A|X_B=X_B\sim N(\mu_A+\Sigma_{AB}\Sigma_{BB}^{-1}(X_B-\mu_B),\Sigma_{AA}-\Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA})$

Multivariate Gaussian
$X=\begin{bmatrix} X_A \\ X_B \\ \end{bmatrix} , \mu=\begin{bmatrix} \mu_A \\ \mu_B \\ \end{bmatrix}, \Sigma=\begin{bmatrix} \Sigma_{AA} & \Sigma_{AB}\\ \Sigma_{BA} & \Sigma_{BB}\\ \end{bmatrix}, \Sigma=E[(X_i-\mu_i)(X_j-\mu_j)]$

multivariate gaussian

matrix

covariance matrix는 변수간 상관관계에 대한 정보를 담고있다.

몇 가지 중요한 특징들

$\Sigma$ is positive semi-definite
$\Sigma_{ii}=Var(y_i), \Sigma_{ii}>=0$
if $Y_i, Y_j$ are independent , $X_i$ is very different from $X_j$ $\rightarrow \Sigma_{ij}, \Sigma_{ji}=0$
if $X_i \approx X_j, \Sigma_{ij}, \Sigma_{ji}>0$

Example - Predict House price

covariance matrix

GP Regression

gaussian regression - 일반적인 Linear regression

2 different ways to learning parameter $W$

$D=\{(x_1, y_1), ...., (x_n, y_n)\}$

1) MLE: $P(D;w)=\prod_{i=1}^{n} P(y_i|x_i;w)$ - probability of data given parameters

2) MAP: $P(w|D)\propto P(D|w)P(w)$ - given our data, what is most likely set of parameters
$\\$
Once gaussian always gaussian
Assume a Gaussian Noise $\epsilon \sim N(0,\sigma^2)$ , $P(y_i|x_i;w)=N(w^Tx, \sigma^2I)$

$P(w)=N(0, \Sigma_p)$
$\therefore P(D;w),P(w|D)$ 는 모두 Gaussian
$\\$

일반적인 machine learning process:
$D\rightarrow W \rightarrow y=W^TX$ (데이터를 학습시켜 파라미터 W를 추정하고 예측모델에 테스트 데이터를 입력하여 예측을 수행.)
MLE와 MAP에서는 하나의 특정 parameter w에 대한 predictive model을 준다.

Bayesian says.. Instead of model the probability of w (and we do prediction for that w), why don’t we from the start from modeling the prediction of the test point.
$\\$

$P(y|x^*,D)$ : given the test set of x, what is distribution of label y?
$P(y|x,D)=\int_w P(y,w|D,x)dw=\int_w P(y|w,D,x)P(w|D)dw$

marginalize out the model: average the prediction overall possible parameter W's with weight $P(W|D)$

$P(y|x,D)=\int_w P(y,w|D,x)dw=\int_w$ $P(y|w,x)$ $P(w|D)$ $dw$

$P(y|w,x)$ : Gaussian

$P(w|D)$ : Gaussian

$P(y|x,D) \rightarrow Gaussian$ (Once gaussian always gaussian)

하지만, 위의 integral식은 closed form으로 풀리지 않는다. 그러나, Gaussian likelihood and prior assumption을 통해 $P(y^*|x^*,D)$ 는 여전히 gaussian이기에 우리는 이 문제를 풀 수 있다. 왜냐하면, 우리는 gaussian의 형태를 잘 알고 있기 때문 !

$P(y_*|x_*,D) \sim N(\mu_{y_*|D},\Sigma_{y_*|D})$

$Y_*|(Y_1=y_1,...,Y_n=y_n,x_1,...,x_n,x_t)\sim N(K_*^T(K+\sigma^2I)^{-1}y,K_{**}-K_*^T(K+\sigma^2I)^{-1}K_*)$

Example - Gaussian Process

gaussian process example where there is data you should be confident, when you don't have data you should believe on your prior knowledge.

Gaussian Process Math

Assuming that $w\sim N(0, \Sigma_p) \\$ $P(W|D)\sim N(\sigma^{-2}A^{-1}Xy, A^{-1})$ where $A=\frac{1}{\sigma^2}XX^T+\Sigma_p^{-1}$

prediction $Y_*$ for a testing sample $X_*$
$P(y_*|x_*,D) \sim N(\mu_{y_*|D},\Sigma_{y_*|D})$
where $\mu_{y_*|D}=\sigma^{-2}x_*^TA^{-1}Xy$ , $\Sigma_{y_*|D}=x_*^TA^{-1}x_*$

From the property of matrix inversion lemma,
$(I+UV)^{-1}U=U(I+VU)^{-1}$
$(I+UV)^{-1}=I-U(I+VU)^{-1}V$

gp math

Kernel

위의 그림처럼 linear model f(x)로 문제를 풀 수 없을 때, input space X를 linear model이 해결할 수 있는 형태인 고차원의 feature space $\phi(X)$ 로 만들면 문제를 해결할 수 있다.

하지만, $\phi(X)$ 를 정의하는 것 자체가 어렵거나 $\phi(X)$ 를 푸는 연산량이 많은 문제점이 발생. 이를 해결하고자 Kernel을 적용한다.

ex) polynomial kernel
$\phi(x)=\begin{pmatrix} 1 \\ x_1 \\ . \\ . \\ x_d \\ x_1x_2 \\ . \\ . \\ x_{d-1}x_d \\ . \\ . \\ x_1x_2...x_d \end{pmatrix}$
$\rightarrow \phi(x)^T\phi(z)=1+x_1z_1+x_2z_2+...+x_1..x_dz_1...z_d=\prod_{k=1}^{d}(1+x_kz_k)$

계산비용을 $2^d$ 에서 $kd$ 로 줄일 수 있다. $K(x_i,x_j)=\phi(x_i)^T\phi(x_j)=K_{ij}$

GP에서는 RBF Kernel 주로 사용
$\Sigma_{ij}=K(x_i,x_j)=\gamma e^{\frac{-||x_i-x_j||^2}{\sigma^2}}$

0, $||x_i-x_j|| \rightarrow \inf$
1, $x_i=x_j$

we can decompose $\Sigma$ as $\begin{pmatrix} K & K_* \\ K_*^T & K_{**} \end{pmatrix}$

kernel matrices $K, K_*, K_{**}$ are functions of $x_1,x_2,..x_n,x_*$
kernel is well defined covariance function

따라서 우리는 Gaussian Process를 평균과 공분산 kernel로 나타낼 수 있다.

$y\sim GP(m(x), k(x,x'))$

gp sample

Once you've constructed these matrices of similarities ( $K, K_*, K_{**}$ ), the process of prediction is just computing the posterior distribution of $f_*$ .

$P(y_*|x_*,D) \sim N(\mu_{y_*|D},\Sigma_{y_*|D})$
Since I have a gaussian distribution of $x_*$ , I can draw samples from gaussian.

Gaussian Process 정리

$y=W^Tx+\epsilon \rightarrow$ Noisy $\epsilon\sim N(0,\sigma^2)$

Function evaluation에 Noise가 있는 Noisy GP regression을 정의

$P(y|x,D)=\int_w P(y|w,D,x)P(w|D)dw$

X가 given 되어있을 때, y의 distribution을 계산하기 위해 w로 marginalize

$W\sim N(0, \Sigma_p)$

$P(W|D)\sim N(\sigma^{-2}A^{-1}Xy, A^{-1})$ where $A=\frac{1}{\sigma^2}XX^T+\Sigma_p^{-1}$

Prior를 Gaussian으로 가정. covariance matrix는 similarity kernel(RBF)로 계산

$P(y|w,D,x)=\prod_{i=1}^{n} P(y_i|x_i;w)$

관측된 모든 데이터 포인트들에서의 추정값은 iid를 가정하기 때문에 전부 곱해서 계산. 역시 gaussian을 따름.

matrix2

$P(y_|x_,D) \sim N(\mu_{y_|D},\Sigma_{y_|D})$

$\mu_=K_^T(K+\sigma^2I)^{-1}y$

$\Sigma_*=K_{**}-K_^T(K+\sigma^2I)^{-1}K_$

이제 새로운 $x_*$ 데이터가 들어왔을 때, y의 covariance matrix가 어떻게 생겼는지 알 수 있고 평균과 분산을 활용하여 예측 모델을 만들어 낼 수 있음.

GPs are elegant and powerful ML method.

We get a measure of uncertainty for predictions for free.

GPs work very well for regression problems with small training data set sizes.

Running time O( $n^3$ ) due to matrix inversion. thus, GPs get slow when n >> 0.

Reference

https://www.youtube.com/watch?v=R-NUdqxKjos&t=1s - Cornell대학 Kilian교수님 강의

https://www.edwith.org/bayesiandeeplearning/joinLectures/14426 - edwith 최성준님 강의

https://www.youtube.com/watch?v=4vGiHC35j9s - UBC Nando de Freitas교수님 강의

https://www.youtube.com/watch?v=MfHKW5z-OOA - UBC Nando de Freitas교수님 강의

ToBigs1617 Time-Series

빅데이터 분석 및 인공지능 대표 연합 동아리 투빅스(ToBig's) 16기 & 17기 시계열 심화세미나 기록입니다.

2주차 - Gaussian Process

2주차 - Gaussian Process

Gaussian Basics

몇 가지 중요한 특징들

Example - Predict House price

GP Regression

Example - Gaussian Process

Gaussian Process Math

Kernel

따라서 우리는 Gaussian Process를 평균과 공분산 kernel로 나타낼 수 있다.

Gaussian Process 정리

$y=W^Tx+\epsilon \rightarrow$ Noisy $\epsilon\sim N(0,\sigma^2)$

$P(y|x,D)=\int_w P(y|w,D,x)P(w|D)dw$

$W\sim N(0, \Sigma_p)$

$P(W|D)\sim N(\sigma^{-2}A^{-1}Xy, A^{-1})$ where $A=\frac{1}{\sigma^2}XX^T+\Sigma_p^{-1}$

$P(y|w,D,x)=\prod_{i=1}^{n} P(y_i|x_i;w)$

$P(y_|x_,D) \sim N(\mu_{y_|D},\Sigma_{y_|D})$

$\mu_=K_^T(K+\sigma^2I)^{-1}y$

$\Sigma_*=K_{**}-K_^T(K+\sigma^2I)^{-1}K_$

GPs are elegant and powerful ML method.

We get a measure of uncertainty for predictions for free.

GPs work very well for regression problems with small training data set sizes.

Running time O( $n^3$ ) due to matrix inversion. thus, GPs get slow when n >> 0.

Reference

1주차 - AR, MA, ARMA, ARIMA, SARIMA, SARIMAX

3주차 - Neural Process

0개의 댓글

관련 채용 정보

2주차 - Gaussian Process

2주차 - Gaussian Process

Gaussian Basics

몇 가지 중요한 특징들

Example - Predict House price

GP Regression

Example - Gaussian Process

Gaussian Process Math

Kernel

따라서 우리는 Gaussian Process를 평균과 공분산 kernel로 나타낼 수 있다.

Gaussian Process 정리

y=WTx+ϵ→y=W^Tx+\epsilon \rightarrowy=WTx+ϵ→ Noisy ϵ∼N(0,σ2)\epsilon\sim N(0,\sigma^2)ϵ∼N(0,σ2)

P(y∣x,D)=∫wP(y∣w,D,x)P(w∣D)dwP(y|x,D)=\int_w P(y|w,D,x)P(w|D)dwP(y∣x,D)=∫w​P(y∣w,D,x)P(w∣D)dw

W∼N(0,Σp)W\sim N(0, \Sigma_p)W∼N(0,Σp​)

P(W∣D)∼N(σ−2A−1Xy,A−1)P(W|D)\sim N(\sigma^{-2}A^{-1}Xy, A^{-1})P(W∣D)∼N(σ−2A−1Xy,A−1) where A=1σ2XXT+Σp−1A=\frac{1}{\sigma^2}XX^T+\Sigma_p^{-1}A=σ21​XXT+Σp−1​

P(y∣w,D,x)=∏i=1nP(yi∣xi;w)P(y|w,D,x)=\prod_{i=1}^{n} P(y_i|x_i;w)P(y∣w,D,x)=∏i=1n​P(yi​∣xi​;w)

P(y∗∣x∗,D)∼N(μy∗∣D,Σy∗∣D)P(y_*|x_*,D) \sim N(\mu_{y_*|D},\Sigma_{y_*|D})P(y∗​∣x∗​,D)∼N(μy∗​∣D​,Σy∗​∣D​)

μ∗=K∗T(K+σ2I)−1y\mu_*=K_*^T(K+\sigma^2I)^{-1}yμ∗​=K∗T​(K+σ2I)−1y

Σ∗=K∗∗−K∗T(K+σ2I)−1K∗\Sigma_*=K_{**}-K_*^T(K+\sigma^2I)^{-1}K_*Σ∗​=K∗∗​−K∗T​(K+σ2I)−1K∗​

GPs are elegant and powerful ML method.

We get a measure of uncertainty for predictions for free.

GPs work very well for regression problems with small training data set sizes.

Running time O(n3n^3n3) due to matrix inversion. thus, GPs get slow when n >> 0.

Reference

1주차 - AR, MA, ARMA, ARIMA, SARIMA, SARIMAX

3주차 - Neural Process

0개의 댓글

관련 채용 정보

$y=W^Tx+\epsilon \rightarrow$ Noisy $\epsilon\sim N(0,\sigma^2)$

$P(y|x,D)=\int_w P(y|w,D,x)P(w|D)dw$

$W\sim N(0, \Sigma_p)$

$P(W|D)\sim N(\sigma^{-2}A^{-1}Xy, A^{-1})$ where $A=\frac{1}{\sigma^2}XX^T+\Sigma_p^{-1}$

$P(y|w,D,x)=\prod_{i=1}^{n} P(y_i|x_i;w)$

$P(y_|x_,D) \sim N(\mu_{y_|D},\Sigma_{y_|D})$

$\mu_=K_^T(K+\sigma^2I)^{-1}y$

$\Sigma_*=K_{**}-K_^T(K+\sigma^2I)^{-1}K_$

Running time O( $n^3$ ) due to matrix inversion. thus, GPs get slow when n >> 0.