통계방법론 W8

ese2o·2024년 6월 15일
0
post-thumbnail

Simple Regression Analysis and Correlation

Correlation Coefficient

Pearson's Correlation Coefficient

Sample coefficient of correlation

r=(xxˉ)(yyˉ)(xxˉ)2(yyˉ)2=xyxyn(x2(x)2n)(y2(y)2n)r=\frac{\sum(x-\bar{x})(y-\bar{y})}{\sqrt{\sum(x-\bar{x})^2 \sum(y-\bar{y})^2}}=\frac{\sum x y-\frac{\sum x \sum y}{n}}{\sqrt{\left(\sum x^2-\frac{\left(\sum x\right)^2}{n}\right)\left(\sum y^2-\frac{\left(\sum y\right)^2}{n}\right)}}
  • -1<r<1
  • r is symmetric
  • Correlation does not imply Causation

linear regression

yi=β0+β1xi+ϵi,i=1,,n where ϵi i.i.d N(0,σ2)y_i=\beta_0+\beta_1 x_i+\epsilon_i, \quad i=1, \ldots, n \quad \text { where } \epsilon_i \stackrel{\text { i.i.d }}{\sim} \mathcal{N}\left(0, \sigma^2\right)
yi=β0+β1xi+ϵi,i=1,,ny_i=\beta_0+\beta_1 x_i+\epsilon_i, \quad i=1, \ldots, n

assumptions: ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}\left(0, \sigma^2\right) and i.i.d.

Therefore, yiN(β0+β1xi,σ2)y_i \sim \mathcal{N}\left(\beta_0+\beta_1 x_i, \sigma^2\right) and i.i.d.

prediction

we use a deterministic model to predict the value of y

y^=b0+b1x\hat{y}=b_0+b_1 x

최소제곱법; Least Squares

예측값과 실제 값의 차이를 최소화 시킨다.

minb0,b1i=1n(yib0b1xi)2=minb0,b1i=1n(yiy^i)2\min _{b_0, b_1} \sum_{i=1}^n\left(y_i-b_0-b_1 x_i\right)^2=\min _{b_0, b_1} \sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2

proof

SS=i=1n(yib0b1xi)2\mathrm{SS}=\sum_{i=1}^n\left(y_i-b_0-b_1 x_i\right)^2

Step 1. b0으로 미분 후 =0

SSb0=2i=1n(yib0b1xi)=0\frac{\partial \mathrm{SS}}{\partial b_0}=-2 \sum_{i=1}^n\left(y_i-b_0-b_1 x_i\right)=0
b0=1ni=1n(yib1xi)=yˉb1xˉb_0=\frac{1}{n} \sum_{i=1}^n\left(y_i-b_1 x_i\right)=\bar{y}-b_1 \bar{x}

Step 2. b1로 미분 후 =0

SSb1=2i=1nxi(yib0b1xi)=0\frac{\partial \mathrm{SS}}{\partial b_1}=-2 \sum_{i=1}^n x_i\left(y_i-b_0-b_1 x_i\right)=0

Step 3. 위의 b0을 대입

i=1nxi((yiyˉ)b1(xixˉ))=0\sum_{i=1}^n x_i\left(\left(y_i-\bar{y}\right)-b_1\left(x_i-\bar{x}\right)\right)=0
b1=i=1nxi(yiyˉ)i=1n(xi(xixˉ))b_1=\frac{\sum_{i=1}^n x_i\left(y_i-\bar{y}\right)}{\sum_{i=1}^n\left(x_i\left(x_i-\bar{x}\right)\right)}
i=1nxi(yiyˉ)i=1(xi(xixˉ))=i=1n(xixˉ+xˉ)(yiyˉ)i=1n(xixˉ+xˉ)(xixˉ)=i=1n(xixˉ)(yiyˉ)+i=1nxˉ(yiyˉ)i=1n(xixˉ)(xixˉ)+i=1nxˉ(xixˉ)=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2\frac{\sum_{i=1}^n x_i\left(y_i-\bar{y}\right)}{\sum_{i=1}\left(x_i\left(x_i-\bar{x}\right)\right)}=\frac{\sum_{i=1}^n\left(x_i-\bar{x}+\bar{x}\right)\left(y_i-\bar{y}\right)}{\sum_{i=1}^n\left(x_i-\bar{x}+\bar{x}\right)\left(x_i-\bar{x}\right)}=\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)+\sum_{i=1}^n \bar{x}\left(y_i-\bar{y}\right)}{\sum_{i=1}^n\left(x_i-\bar{x}\right)\left(x_i-\bar{x}\right)+\sum_{i=1}^n \bar{x}\left(x_i-\bar{x}\right)}=\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}
i=1n(xixˉ)=0i=1n(yiyˉ)=0\begin{aligned} & * \sum_{i=1}^n\left(x_i-\bar{x}\right)=0 \\ & * \sum_{i=1}^n\left(y_i-\bar{y}\right)=0 \end{aligned}

최종

b0=yˉb1xˉb1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2.\begin{aligned} & b_0=\bar{y}-b_1 \bar{x} \\ & b_1=\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2} . \end{aligned}

residual

the least squares regression minimizes the sum of squared residuals

i=1n(yiy^i)2=i=1nei2\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2=\sum_{i=1}^n e_i^2

linear regression assumes

  • The model is linear

  • The error terms have constant variances The error terms are independent

  • The error terms are normally distributed

  • model is not linear

  • errors do not have a constant variance

  • errors are not independent

sum of squares of the residual (=SS)

SSE=i=1n(yiy^i)2\mathrm{SSE}=\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2

mean square error

 MSE =SSEn2\text { MSE }=\frac{S S E}{n-2}
E[MSE]=σ2\mathbb{E}[\mathrm{MSE}]=\sigma^2

standard error of the estimate

se=MSE=SSEn2s_e=\sqrt{\mathrm{MSE}}=\sqrt{\frac{\mathrm{SSE}}{n-2}}

r2r^2

  • 0<r2<10<r^2<1
  • r2r^2이 클수록 설명력이 높다.

observed data:

yi=y^i+(yiy^i)y_i=\hat{y}_i+\left(y_i-\hat{y}_i\right)

subtract yˉ\bar y from both sides

yiyˉ=(y^iyˉ)+(yiy^i)y_i-\bar{y}=\left(\hat{y}_i-\bar{y}\right)+\left(y_i-\hat{y}_i\right)

important three quantities

SST=(yiyˉ)2SSR=(y^iyˉ)2SSE=(yiy^i)2\begin{aligned} & \mathrm{SST}=\sum\left(y_i-\bar{y}\right)^2 \\ & \mathrm{SSR}=\sum\left(\hat{y}_i-\bar{y}\right)^2 \\ & \mathrm{SSE}=\sum\left(y_i-\hat{y}_i\right)^2 \end{aligned}

SST: Total variation
SSR: Explained variation
SSE: Unexplained variation
SST = SSR+SSE

r2:=SSRSST=b12SSxxSSyyr^2:=\frac{\mathrm{SSR}}{\mathrm{SST}}=\frac{b_1^2 \mathrm{SS}_{x x}}{\mathrm{SS}_{y y}}
SST=(yiyˉ)2=SSyySSR=(y^iyˉ)2=b12SSxx\begin{aligned} & \mathrm{SST}=\sum\left(y_i-\bar{y}\right)^2=\mathrm{SS}_{y y} \\ & \mathrm{SSR}=\sum\left(\hat{y}_i-\bar{y}\right)^2=b_1^2 \mathrm{SS}_{x x} \end{aligned}

여기서

i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2=SSxySSxx\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}=\frac{\operatorname{SS}_{x y}}{\operatorname{SS}_{x x}}
b1=i=1n(xixˉ)yiSSxx=i=1nkiyi,ki=xixˉSSxxb_1=\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right) y_i}{\operatorname{SS}_{x x}}=\sum_{i=1}^n k_i y_i, \quad k_i=\frac{x_i-\bar{x}}{\mathrm{SS}_{x x}}
  • properties related to kik_i
i=1nki=0i=1nkixi=1i=1nki2=1/SSxx\sum_{i=1}^n k_i=0 \quad \sum_{i=1}^n k_i x_i=1 \quad \sum_{i=1}^n k_i^2=1 / \mathrm{SS}_{x x}
  • expectation value of b1b_1
E[b1]=E[i=1nkiyi]=i=1nkiE[yi]=i=1nki(β0+β1xi)=β0i=1nki+β1i=1nkixi=β1\mathbb{E}\left[b_1\right]=\mathbb{E}\left[\sum_{i=1}^n k_i y_i\right]=\sum_{i=1}^n k_i \mathbb{E}\left[y_i\right]=\sum_{i=1}^n k_i\left(\beta_0+\beta_1 x_i\right)=\beta_0 \sum_{i=1}^n k_i+\beta_1 \sum_{i=1}^n k_i x_i=\beta_1
  • similarly, the variance of b1b_1 can be computed as
V[b1]=V[i=1nkiyi]=i=1nki2V[yi]=σ2i=1nki2=σ2SSxx\left.\mathbb{V} [b_1\right]=\mathbb{V}\left[\sum_{i=1}^n k_i y_i\right]=\sum_{i=1}^n k_i^2 \mathbb{V}\left[y_i\right]=\sigma^2 \sum_{i=1}^n k_i^2=\frac{\sigma^2}{\mathrm{SS}_{x x}}
  • Furthermore, recall that the unbiased estimator of σ2\sigma^2 was
MSE=SSEn2=i=1n(yiy^i)2n2=se2\mathrm{MSE}=\frac{\mathrm{SSE}}{n-2}=\frac{\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2}{n-2}=s_e^2
  • Therefore, the unbiased estimator of V[b1]\mathbb{V} [b_1] is
se2SSxx=se2i=1n(xixˉ)2=sb12\frac{s_e^2}{\mathrm{SS}_{x x}}=\frac{s_e^2}{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}=s_{b_1}^2

Hypothesis Test for the Slope

Step 1.

H0:β1=0Ha:β10\begin{aligned} & H_0: \beta_1=0 \\ & H_a: \beta_1 \neq 0 \end{aligned}

Step 2.

b1β1sb1tn2\frac{b_1-\beta_1}{s_{b_1}} \sim t_{n-2}

Step 3.

observed value of t is larger than the critical value
-> reject H0

b0

100(1-a)% Confidence Interval

estimate of the average value of y for a given x

V[y^x0]=σ2(1n+(x0xˉ)2SSxx)\left.\mathbb{V}[ \hat{y}_{x_0}\right] = \sigma^2\left(\frac{1}{n}+\frac{\left(x_0-\bar{x}\right)^2}{\mathrm{SS}_{x x}}\right)
y^x0E[yx0]se1n+(x0xˉ)2SSxxtn2\frac{\hat{y}_{x_0}-\mathbb{E}\left[y_{x_0}\right]}{s_e \sqrt{\frac{1}{n}+\frac{\left(x_0-\bar{x}\right)^2}{\mathrm{SS}_{x x}}}} \sim t_{n-2}
y^x0±tα/2,n2se1n+(x0xˉ)2SSxx\hat{y}_{x_0} \pm t_{\alpha / 2, n-2} s_e \sqrt{\frac{1}{n}+\frac{\left(x_0-\bar{x}\right)^2}{\mathrm{SS}_{x x}}}

100(1-a)% Prediction Interval

MSE=s22\mathrm{MSE}=s_2^2

y~x0y^x0se1+1n+(x0xˉ)2SSxxtn2\frac{\tilde{y}_{x_0}-\hat{y}_{x_0}}{s_e \sqrt{1+\frac{1}{n}+\frac{\left(x_0-\bar{x}\right)^2}{\mathrm{SS}_{x x}}}} \sim t_{n-2}
y^x0±tα/2,n2Se1+1n+(x0xˉ)2SSxx\hat{y}_{x_0} \pm t_{\alpha / 2, n-2} S_e \sqrt{1+\frac{1}{n}+\frac{\left(x_0-\bar{x}\right)^2}{\mathrm{SS}_{x x}}}

Confidence Interval & Prediction Interval

Confidence interval이 Prediction interval보다 얇다.
이유: because the average value of y is towards the middle of a group of y values; there is less error in estimating a mean value as opposed to
predicting an individual value

0개의 댓글

관련 채용 정보