[DetnEst] 8. General Bayesian Estimators

KBC·2024년 12월 4일
0

Detection and Estimation

목록 보기
13/23

Overview

  • Previously
    • Introduced the idea of a a priori information on θ\theta\rightarrow use prior pdf: p(θ)p(\theta)
    • Defined a new optimality criterion \rightarrow Bayesian MSE
    • Showed the Bmse is minimized by E(θx)E(\theta|\text{x}), called mean of posterior pdf or conditional mean
  • Now
    • Define a more general optimality criterion
      • leads to several different Bayesian approaches
      • includes Bmse as special case
    • Why?
      • Provides flexibility in balancing: model, performance, and computations

Bayesian Risk Functions

  • Previously, we used Bmse as the Bayesian measure to minimize
    Bmse(θ^)=E[(θθ^)2]w.r.t.  p(x,θ)error  ϵ=θθ^\text{Bmse}(\hat\theta)=E\left[(\theta-\hat\theta)^2\right] \text{w.r.t.}\;p(\text{x},\theta)\rightarrow\text{error}\;\epsilon=\theta-\hat\theta
  • In general, define a quadratic cost function, C(ϵ)=ϵ2=(θθ^)2C(\epsilon)=\epsilon^2=(\theta-\hat\theta)^2
  • Bayes risk R\mathcal{R} : the average cost R=E[C(ϵ)]\mathcal{R} =E[C(\epsilon)]
    • Quadratic error : C(ϵ)=ϵ2C(\epsilon)=\epsilon^2
    • Absolute error : C(ϵ)=ϵC(\epsilon)=|\epsilon|
    • Hit-ot-miss error : C(ϵ)={0,ϵ<δ1,ϵ>δC(\epsilon) = \begin{cases} 0, & |\epsilon| < \delta \\ 1, & |\epsilon| > \delta \end{cases}
      R=E[C(ϵ)]=C(θθ^)p(x,θ)dxdθ=[C(θθ^)p(θx)dθ]p(x)dx\mathcal{R}=E[C(\epsilon)]=\int\int C (\theta-\hat\theta)p(\text{x},\theta)d\text{x}d\theta\\[0.2cm] =\int\left[\int C(\theta-\hat\theta) p(\theta|\text{x})d\theta\right]p(\text{x})d\text{x}

      minimize the inner integral


General Bayesian Estimators

  • For a given desired cost function, find the form of the optimal estimator
    1. Quadratic
      R(θ^)=Bmse(θ^)=E[(θθ^)2]θ^=E(θx)=mean of p(θx)\mathcal{R}(\hat\theta)=\text{Bmse}(\hat\theta)=E\left[(\theta-\hat\theta)^2\right]\\ \rightarrow\hat\theta=E(\theta|\text{x})=\text{mean of }p(\theta|\text{x})
    2. Absolute
      R(θ^)=E[θθ^]θ^=median of p(θx)\mathcal{R}(\hat\theta)=E\left[|\theta-\hat\theta|\right]\rightarrow\hat\theta=\text{median of }p(\theta|\text{x})
    3. Hit-or-Miss
      θ^=mode of p(θx)  (max. a posterior, MAP)\hat\theta=\text{mode of }p(\theta|\text{x})\;\text{(max. a posterior, MAP)}

  • Derivation for Absolute Cost Function
    g(θ^)=θθ^p(θx)dθ=θ^(θ^θ)p(θx)dθ+θ^(θ^θ)p(θx)dθg(\hat\theta)=\int|\theta-\hat\theta|p(\theta|\text{x})d\theta=\int^{\hat\theta}_{-\infty}(\hat\theta-\theta)p(\theta|\text{x})d\theta + \int_{\hat\theta}^{\infty}(\hat\theta-\theta)p(\theta|\text{x})d\theta
  • Set g(θ^)θ^=0\frac{\partial g(\hat \theta)}{\partial\hat\theta}=0 and use Leibnitz's rule,
    x(a(x)b(x)f(x,t)dt)=f(x,b(x))b(x)f(x,a(x))a(x)+a(x)b(x)xf(x,t)dt\frac{\partial}{\partial x}\left(\int^{b(x)}_{a(x)}f(x, t)dt\right)=f(x,b(x))b'(x)-f(x,a(x))a'(x)+\int^{b(x)}_{a(x)}\frac{\partial}{\partial x}f(x, t)dt
  • 1st integral
    f(θ^,θ)=(θ^θ)p(θx)f(θ^,b(θ^))=0,  a(θ^)=0f(\hat\theta, \theta)=(\hat\theta-\theta)p(\theta|\text{x})\rightarrow f(\hat\theta, b(\hat\theta))=0,\;a'(\hat\theta)=0
  • 2nd integral
    f(θ^,θ)=(θθ^)p(θx)f(θ^,a(θ^))=0,  b(θ^)=0g(θ^)θ^=θ^p(θx)dθθ^p(θx)dθ=0θ^p(θx)dθ=θ^p(θx)dθθ^ is the median of the posterior PDF(Pr{θθ^x}=12)f(\hat\theta, \theta)=(\theta-\hat\theta)p(\theta|\text{x})\rightarrow f(\hat\theta,a(\hat\theta))=0,\;b'(\hat\theta)=0\\[0.2cm] \rightarrow\frac{\partial g(\hat\theta)}{\partial\hat\theta}=\int^{\hat\theta}_{-\infty}p(\theta|\text{x})d\theta-\int^\infty_{\hat\theta}p(\theta|\text{x})d\theta=0\\[0.2cm] \rightarrow\int^{\hat\theta}_{-\infty}p(\theta|\text{x})d\theta=\int^\infty_{\hat\theta}p(\theta|\text{x})d\theta\\[0.2cm] \rightarrow \hat\theta \text{ is the median of the posterior PDF}\\[0.2cm] (\Pr\{\theta\leq\hat\theta|\text{x}\}=\frac{1}{2})

  • Derivation for Hit-or-Miss Cost Function
g(θ^)=θ^δ1p(θx)dθ+θ^+δ1p(θx)dθ=1θ^δθ^+δp(θx)dθg(\hat{\theta}) = \int_{-\infty}^{\hat{\theta} - \delta} 1 \cdot p(\theta | \mathbf{x}) d\theta + \int_{\hat{\theta} + \delta}^{\infty} 1 \cdot p(\theta | \mathbf{x}) d\theta = 1 - \int_{\hat{\theta} - \delta}^{\hat{\theta} + \delta} p(\theta | \mathbf{x}) d\theta
  • Then Maximize
    θ^δθ^+δp(θx)dθ\int^{\hat\theta+\delta}_{\hat\theta-\delta}p(\theta|\text{x})d\theta
  • For arbitrarily samll δ,  θ^\delta,\;\hat\theta corresponds to the location of maximum or mode of the posterior PDF p(θx)p(\theta|\text{x})

    maximum a posteriori (MAP) estimator


Minimum Mean Square Error Estimators

  • For the scalar parameter case
    θ^=E(θx)=mean of p(θx)\hat\theta=E(\theta|\text{x})=\text{mean of }p(\theta|\text{x})
  • Vector MMSE estimator
    θ^=E(θx)\hat\theta=E(\theta|\text{x}) which minimizes the MSE
    for each component of the unknown vector parameter
    E[(θiθ^i)2]=(θiθ^i)2p(θix)dθiθ^i=θip(x,θ)dxdθ=E(θix)θ^=[θ^1  θ^2    θ^p]T=[E(θ1x)  E(θ2x)    E(θpx)]T=E(θx)E\left[(\theta_i-\hat\theta_i)^2\right]=\int(\theta_i-\hat\theta_i)^2p(\theta_i|\text{x})d\theta_i\\[0.2cm] \hat\theta_i=\int\int\theta_ip(\text{x},\theta)d\text{x}d\theta=E(\theta_i|\text{x})\\[0.2cm] \hat\theta=\left[\begin{matrix} \hat\theta_1\;\hat\theta_2\;\dots \;\hat\theta_p\end{matrix}\right]^T=\left[E(\theta_1|\text{x})\;E(\theta_2|\text{x})\;\dots\;E(\theta_p|\text{x})\right]^T=E(\theta|\text{x})

  • Ex) Bayesian Fourier analysis
  • Signal model : x[n]=acos2πf0n+bsin2πf0n+w[n],  n=0,1,,N1x[n] = a\cos2\pi f_0n+b\sin2\pi f_0n+\text{w}[n],\;n=0,1,\cdots,N-1
    where f0f_0 is a multiple of 1/N1/N except 0 or 12\frac{1}{2}, w[n]\text{w}[n] is WGN with variance σ2\sigma^2
    θ=[a  b]T with prior PDF θN(0,σθ2I)\theta=[a\;b]^T \text{ with prior PDF }\theta\sim N(0,\sigma^2_\theta I)
    common propagation model called Rayleigh Fading
    x=Hθ+wH=[10cos2πf0sin2πf0cos[2πf0(N1)]sin[2πf0(N1)]]\text{x}=H\theta+\text{w}\\[0.3cm] H=\left[\begin{matrix}1&0\\\cos2\pi f_0&\sin2\pi f_0\\ \vdots&\vdots\\ \cos[2\pi f_0(N-1)]&\sin[2\pi f_0(N-1)]\end{matrix}\right]

θ^=E(θx)=σθ2HT(Hσθ2HT+σ2I)1xCθx=σθ2Iσθ2HT(Hσθ2HT+σ2I)1Hσθ2\hat\theta=E(\theta|\text{x})=\sigma^2_\theta H^T(H\sigma^2_\theta H^T+\sigma^2I)^{-1}\text{x}\\[0.2cm] C_{\theta|x}=\sigma^2_\theta I-\sigma^2_\theta H^T(H\sigma^2_\theta H^T+\sigma^2I)^{-1}H\sigma^2_\theta

From μθ=0,  Cθ=σθ2I,  Cw=σ2I\mu_\theta=0,\;C_\theta=\sigma^2_\theta I,\;C_w=\sigma^2 I and multi-variate Gaussian results

E(yx)=E(y)+CyxCxx1(xE(x)),Cyx=CyyCyxCxx1CxyE(\text{y}|\text{x})=E(\text{y})+C_{yx}C^{-1}_{xx}(\text{x}-E(\text{x})),\\ C_{y|x}=C_{yy}-C_{yx}C^{-1}_{xx}C_{xy}

Alternatively, using matrix inversion lemma,

θ^=E(θx)=(1σθ2I+HT1σ2H)1HT1σ2xCθx=(1σθ2I+HT1σ2H)1\hat\theta=E(\theta|\text{x})=\left(\frac{1}{\sigma^2_\theta} I+H^T\frac{1}{\sigma^2} H\right)^{-1}H^T\frac{1}{\sigma^2}\text{x}\\[0.2cm] C_{\theta|x}=\left(\frac{1}{\sigma^2_\theta}I+H^T\frac{1}{\sigma^2}H\right)^{-1}

Since, HTH=N2IH^TH=\frac{N}{2}I

θ^=(1σθ2I+N2σ2I)1HT1σ2x=1σ21σθ2+N2σ2HTx{a^=11+2σ2/Nσθ2[2Nn=0N1x[n]cos2πf0n]b^=11+2σ2/Nσθ2[2Nn=0N1x[n]sin2πf0n],cθx=11σθ2+N2σ2I{Bmse(a^)=11σθ2+N2σ2Bmse(b^)=11σθ2+N2σ2\hat\theta=\left(\frac{1}{\sigma^2_\theta}I+\frac{N}{2\sigma^2}I\right)^{-1}H^T\frac{1}{\sigma^2}\text{x}=\frac{\frac{1}{\sigma^2}}{\frac{1}{\sigma^2_\theta}+\frac{N}{2\sigma^2}}H^T\text{x}\\[0.3cm] \begin{cases} \hat{a} = \frac{1}{1 + \frac{2\sigma^2 / N}{\sigma_\theta^2}} \left[ \frac{2}{N} \sum_{n=0}^{N-1} x[n] \cos 2\pi f_0 n \right] \\ \hat{b} = \frac{1}{1 + \frac{2\sigma^2 / N}{\sigma_\theta^2}} \left[ \frac{2}{N} \sum_{n=0}^{N-1} x[n] \sin 2\pi f_0 n \right] \end{cases}, \quad c_{\theta|\mathbf{x}} = \frac{1}{\frac{1}{\sigma_\theta^2} + \frac{N}{2\sigma^2}} \mathbf{I}\\[0.3cm] \rightarrow \begin{cases} \text{Bmse}(\hat{a}) = \frac{1}{\frac{1}{\sigma_\theta^2} + \frac{N}{2\sigma^2}} \\ \text{Bmse}(\hat{b}) = \frac{1}{\frac{1}{\sigma_\theta^2} + \frac{N}{2\sigma^2}} \end{cases}

Properties of the MMSE Estimator

  1. Commutes over affine mappings(linearity):
    α=Aθ+bα^=E(αx)=E(Aθ+bx)=AE(θx)+b=Aθ^+b\alpha=A\theta+b\rightarrow\hat\alpha=E(\alpha|\text{x})=E(A\theta+b|\text{x})=AE(\theta|\text{x})+b=A\hat\theta+b
  2. Additive Property for independent data sets
    θ^=E(θx1,x2)\hat\theta=E(\theta|\text{x}_1,\text{x}_2)
    θ,x1,x2\theta,\text{x}_1,\text{x}_2 are jointly Gaussian and x1,x2\text{x}_1,\text{x}_2 are independent
    Let x=[x1T  x2T]T\text{Let x}=[\text{x}_1^T\;\text{x}_2^T]^T
    from multi-variate Gaussian results
    θ^=E(θx)=E(θ)+CθxCxx1(xE(x)),Cxx1=[Cx1x1Cx1x2Cx2x1Cx2x2]1=[Cx1x1100Cx2x21]θ^=E(θ)+[Cθx1Cθx2][Cx1x1100Cx2x21][x1E(x1)x2E(x2)]=E(θ)+Cθx1Cx1x11(x1E(x1))+Cθx2Cx2x21(x2E(x2))\hat{\theta} = E(\theta|\mathbf{x}) = E(\theta) + C_{\theta x} C_{xx}^{-1} (\mathbf{x} - E(\mathbf{x})), \quad C_{xx}^{-1} = \begin{bmatrix} C_{x_1 x_1} & C_{x_1 x_2} \\ C_{x_2 x_1} & C_{x_2 x_2} \end{bmatrix}^{-1} = \begin{bmatrix} C_{x_1 x_1}^{-1} & 0 \\ 0 & C_{x_2 x_2}^{-1} \end{bmatrix}\\[0.2cm] \rightarrow \hat{\theta} = E(\theta) + \begin{bmatrix} C_{\theta x_1} & C_{\theta x_2} \end{bmatrix} \begin{bmatrix} C_{x_1 x_1}^{-1} & 0 \\ 0 & C_{x_2 x_2}^{-1} \end{bmatrix} \begin{bmatrix} \mathbf{x}_1 - E(\mathbf{x}_1) \\ \mathbf{x}_2 - E(\mathbf{x}_2) \end{bmatrix}\\[0.2cm] = E(\theta) + C_{\theta x_1} C_{x_1 x_1}^{-1} (\mathbf{x}_1 - E(\mathbf{x}_1)) + C_{\theta x_2} C_{x_2 x_2}^{-1} (\mathbf{x}_2 - E(\mathbf{x}_2))
  3. Jointly Gaussian case leads to a linear estimator : θ^=Px+m\hat\theta=P\text{x}+m

Maximum a Posteriori(MAP) Estimators

  • Maximum a posteriori(MAP) estimators
    θ^MAP=argmaxθp(θx)θ^MAP=argmaxθp(xθ)p(θ)θ^MAP=argmaxθ[lnp(xθ)+lnp(θ)]\hat\theta_{MAP}=\arg\max_\theta p(\theta|\text{x})\\[0.2cm] \rightarrow\hat\theta_{MAP}=\arg\max_\theta p(\text{x}|\theta)p(\theta)\\[0.2cm] \rightarrow\hat\theta_{MAP}=\arg\max_\theta\left[\ln p(\text{x}|\theta)+\ln p(\theta)\right]
    • Note : The "hit-or-miss" cost function gave the MAP estimator \rightarrow it maximizes the posteriori PDF
    • Given that the MMSE estimator is "the most natural" one, why the MAP estimator considered?
    • If x\text{x} and θ\theta are not jointly Gaussian, the form for MMSE estimate requires integration to find the conditional mean. MAP avoids this computational problem (doesn't require this integration), trading "natural criterion(MMSE)" vs. "computational ease(MAP)"
    • More flexibility to choose the prior PDF
  • Ex) Exponential PDF
    p(x[n]θ)={θexp(θx[n]),x[n]>00,x[n]<0,x[n]’s are conditional IIDp(xθ)=n=0N1p(x[n]θ)The prior PDF, p(θ)={λexp(λθ),θ>00,θ<0p(x[n]|\theta) = \begin{cases} \theta \exp(-\theta x[n]), & x[n] > 0 \\ 0, & x[n] < 0 \end{cases}, \quad x[n] \text{'s are conditional IID}\\[0.2cm] p(\mathbf{x}|\theta) = \prod_{n=0}^{N-1} p(x[n]|\theta)\\[0.2cm] \text{The prior PDF, } p(\theta) = \begin{cases} \lambda \exp(-\lambda \theta), & \theta > 0 \\ 0, & \theta < 0 \end{cases}
    The MAP estimator is found by maximizing
    g(θ)=lnp(xθ)+lnp(θ)=NlnθNθxˉ+lnλλθ,θ>0dg(θ)dθ=NθNxˉλ=0θ^=1xˉ+λNAs λ0, the prior PDF becomes uniform.  Bayesian MLE.g(\theta) = \ln p(\mathbf{x}|\theta) + \ln p(\theta) = N \ln \theta - N \theta \bar{x} + \ln \lambda - \lambda \theta, \quad \theta > 0\\[0.2cm] \frac{dg(\theta)}{d\theta} = \frac{N}{\theta} - N \bar{x} - \lambda = 0\\[0.2cm] \rightarrow \hat{\theta} = \frac{1}{\bar{x} + \frac{\lambda}{N}} \quad \\\text{As } \lambda \rightarrow 0, \text{ the prior PDF becomes uniform. } \rightarrow \text{ Bayesian MLE.}

Bayesian MLE

  • As we keep getting good data, p(θx)p(\theta|\text{x}) becomes more concentrated as a function of θ\theta
    But since:
    θ^MAP=argmaxθp(θx)=argmaxθp(xθ)p(θ)\hat\theta_{MAP}=\arg\max_\theta p(\theta|\text{x})=\arg\max_\theta p(\text{x}|\theta)p(\theta)
    p(xθ)p(\text{x}|\theta) should also become more concentrated as a function of θ\theta
    • Note that the prior PDF is nearly constant where p(xθ)p(\text{x}|\theta) is non-zero
    • This becomes truer as NN\rightarrow\infty and p(xθ)p(\text{x}|\theta) gets more concentrated
      argmaxθp(θx)argmaxθp(xθ)MAPBayesian MLE\rightarrow \arg \max_{\theta} p(\theta | \mathbf{x}) \approx \arg \max_{\theta} p(\mathbf{x} | \theta) \\[0.2cm] \text{MAP}\approx\text{Bayesian MLE}

All Content has been written based on lecture of Prof. eui-seok.Hwang in GIST(Detection and Estimation)

profile
AI, Security

0개의 댓글