[DetnEst] 8. General Bayesian Estimators

KBC·2024년 12월 4일

Detection&Estimation

Detection and Estimation

목록 보기

13/23

Overview

Previously
- Introduced the idea of a a priori information on $\theta\rightarrow$ use prior pdf: $p(\theta)$
- Defined a new optimality criterion $\rightarrow$ Bayesian MSE
- Showed the Bmse is minimized by $E(\theta|\text{x})$ , called mean of posterior pdf or conditional mean
Now
- Define a more general optimality criterion
  - leads to several different Bayesian approaches
  - includes Bmse as special case
- Why?
  - Provides flexibility in balancing: model, performance, and computations

Bayesian Risk Functions

Previously, we used Bmse as the Bayesian measure to minimize $\text{Bmse}(\hat\theta)=E\left[(\theta-\hat\theta)^2\right] \text{w.r.t.}\;p(\text{x},\theta)\rightarrow\text{error}\;\epsilon=\theta-\hat\theta$
In general, define a quadratic cost function, $C(\epsilon)=\epsilon^2=(\theta-\hat\theta)^2$
Bayes risk $\mathcal{R}$ : the average cost $\mathcal{R} =E[C(\epsilon)]$
- Quadratic error : $C(\epsilon)=\epsilon^2$
- Absolute error : $C(\epsilon)=|\epsilon|$
- Hit-ot-miss error : $C(\epsilon) = \begin{cases} 0, & |\epsilon| < \delta \\ 1, & |\epsilon| > \delta \end{cases}$
  $\mathcal{R}=E[C(\epsilon)]=\int\int C (\theta-\hat\theta)p(\text{x},\theta)d\text{x}d\theta\\[0.2cm] =\int\left[\int C(\theta-\hat\theta) p(\theta|\text{x})d\theta\right]p(\text{x})d\text{x}$
  
  minimize the inner integral

General Bayesian Estimators

For a given desired cost function, find the form of the optimal estimator
1. Quadratic $\mathcal{R}(\hat\theta)=\text{Bmse}(\hat\theta)=E\left[(\theta-\hat\theta)^2\right]\\ \rightarrow\hat\theta=E(\theta|\text{x})=\text{mean of }p(\theta|\text{x})$
2. Absolute $\mathcal{R}(\hat\theta)=E\left[|\theta-\hat\theta|\right]\rightarrow\hat\theta=\text{median of }p(\theta|\text{x})$
3. Hit-or-Miss $\hat\theta=\text{mode of }p(\theta|\text{x})\;\text{(max. a posterior, MAP)}$

Derivation for Absolute Cost Function $g(\hat\theta)=\int|\theta-\hat\theta|p(\theta|\text{x})d\theta=\int^{\hat\theta}_{-\infty}(\hat\theta-\theta)p(\theta|\text{x})d\theta + \int_{\hat\theta}^{\infty}(\hat\theta-\theta)p(\theta|\text{x})d\theta$
Set $\frac{\partial g(\hat \theta)}{\partial\hat\theta}=0$ and use Leibnitz's rule, $\frac{\partial}{\partial x}\left(\int^{b(x)}_{a(x)}f(x, t)dt\right)=f(x,b(x))b'(x)-f(x,a(x))a'(x)+\int^{b(x)}_{a(x)}\frac{\partial}{\partial x}f(x, t)dt$
1st integral $f(\hat\theta, \theta)=(\hat\theta-\theta)p(\theta|\text{x})\rightarrow f(\hat\theta, b(\hat\theta))=0,\;a'(\hat\theta)=0$
2nd integral $f(\hat\theta, \theta)=(\theta-\hat\theta)p(\theta|\text{x})\rightarrow f(\hat\theta,a(\hat\theta))=0,\;b'(\hat\theta)=0\\[0.2cm] \rightarrow\frac{\partial g(\hat\theta)}{\partial\hat\theta}=\int^{\hat\theta}_{-\infty}p(\theta|\text{x})d\theta-\int^\infty_{\hat\theta}p(\theta|\text{x})d\theta=0\\[0.2cm] \rightarrow\int^{\hat\theta}_{-\infty}p(\theta|\text{x})d\theta=\int^\infty_{\hat\theta}p(\theta|\text{x})d\theta\\[0.2cm] \rightarrow \hat\theta \text{ is the median of the posterior PDF}\\[0.2cm] (\Pr\{\theta\leq\hat\theta|\text{x}\}=\frac{1}{2})$

Derivation for Hit-or-Miss Cost Function

g(\hat{\theta}) = \int_{-\infty}^{\hat{\theta} - \delta} 1 \cdot p(\theta | \mathbf{x}) d\theta + \int_{\hat{\theta} + \delta}^{\infty} 1 \cdot p(\theta | \mathbf{x}) d\theta = 1 - \int_{\hat{\theta} - \delta}^{\hat{\theta} + \delta} p(\theta | \mathbf{x}) d\theta

Then Maximize $\int^{\hat\theta+\delta}_{\hat\theta-\delta}p(\theta|\text{x})d\theta$
For arbitrarily samll $\delta,\;\hat\theta$ corresponds to the location of maximum or mode of the posterior PDF $p(\theta|\text{x})$

maximum a posteriori (MAP) estimator

Minimum Mean Square Error Estimators

For the scalar parameter case $\hat\theta=E(\theta|\text{x})=\text{mean of }p(\theta|\text{x})$
Vector MMSE estimator
$\hat\theta=E(\theta|\text{x})$ which minimizes the MSE
for each component of the unknown vector parameter $E\left[(\theta_i-\hat\theta_i)^2\right]=\int(\theta_i-\hat\theta_i)^2p(\theta_i|\text{x})d\theta_i\\[0.2cm] \hat\theta_i=\int\int\theta_ip(\text{x},\theta)d\text{x}d\theta=E(\theta_i|\text{x})\\[0.2cm] \hat\theta=\left[\begin{matrix} \hat\theta_1\;\hat\theta_2\;\dots \;\hat\theta_p\end{matrix}\right]^T=\left[E(\theta_1|\text{x})\;E(\theta_2|\text{x})\;\dots\;E(\theta_p|\text{x})\right]^T=E(\theta|\text{x})$

Ex) Bayesian Fourier analysis
Signal model : $x[n] = a\cos2\pi f_0n+b\sin2\pi f_0n+\text{w}[n],\;n=0,1,\cdots,N-1$
where $f_0$ is a multiple of $1/N$ except 0 or $\frac{1}{2}$ , $\text{w}[n]$ is WGN with variance $\sigma^2$ $\theta=[a\;b]^T \text{ with prior PDF }\theta\sim N(0,\sigma^2_\theta I)$ common propagation model called Rayleigh Fading $\text{x}=H\theta+\text{w}\\[0.3cm] H=\left[\begin{matrix}1&0\\\cos2\pi f_0&\sin2\pi f_0\\ \vdots&\vdots\\ \cos[2\pi f_0(N-1)]&\sin[2\pi f_0(N-1)]\end{matrix}\right]$

\hat\theta=E(\theta|\text{x})=\sigma^2_\theta H^T(H\sigma^2_\theta H^T+\sigma^2I)^{-1}\text{x}\\[0.2cm] C_{\theta|x}=\sigma^2_\theta I-\sigma^2_\theta H^T(H\sigma^2_\theta H^T+\sigma^2I)^{-1}H\sigma^2_\theta

From $\mu_\theta=0,\;C_\theta=\sigma^2_\theta I,\;C_w=\sigma^2 I$ and multi-variate Gaussian results

E(\text{y}|\text{x})=E(\text{y})+C_{yx}C^{-1}_{xx}(\text{x}-E(\text{x})),\\ C_{y|x}=C_{yy}-C_{yx}C^{-1}_{xx}C_{xy}

Alternatively, using matrix inversion lemma,

\hat\theta=E(\theta|\text{x})=\left(\frac{1}{\sigma^2_\theta} I+H^T\frac{1}{\sigma^2} H\right)^{-1}H^T\frac{1}{\sigma^2}\text{x}\\[0.2cm] C_{\theta|x}=\left(\frac{1}{\sigma^2_\theta}I+H^T\frac{1}{\sigma^2}H\right)^{-1}

Since, $H^TH=\frac{N}{2}I$

\hat\theta=\left(\frac{1}{\sigma^2_\theta}I+\frac{N}{2\sigma^2}I\right)^{-1}H^T\frac{1}{\sigma^2}\text{x}=\frac{\frac{1}{\sigma^2}}{\frac{1}{\sigma^2_\theta}+\frac{N}{2\sigma^2}}H^T\text{x}\\[0.3cm] \begin{cases} \hat{a} = \frac{1}{1 + \frac{2\sigma^2 / N}{\sigma_\theta^2}} \left[ \frac{2}{N} \sum_{n=0}^{N-1} x[n] \cos 2\pi f_0 n \right] \\ \hat{b} = \frac{1}{1 + \frac{2\sigma^2 / N}{\sigma_\theta^2}} \left[ \frac{2}{N} \sum_{n=0}^{N-1} x[n] \sin 2\pi f_0 n \right] \end{cases}, \quad c_{\theta|\mathbf{x}} = \frac{1}{\frac{1}{\sigma_\theta^2} + \frac{N}{2\sigma^2}} \mathbf{I}\\[0.3cm] \rightarrow \begin{cases} \text{Bmse}(\hat{a}) = \frac{1}{\frac{1}{\sigma_\theta^2} + \frac{N}{2\sigma^2}} \\ \text{Bmse}(\hat{b}) = \frac{1}{\frac{1}{\sigma_\theta^2} + \frac{N}{2\sigma^2}} \end{cases}

Properties of the MMSE Estimator

Commutes over affine mappings(linearity): $\alpha=A\theta+b\rightarrow\hat\alpha=E(\alpha|\text{x})=E(A\theta+b|\text{x})=AE(\theta|\text{x})+b=A\hat\theta+b$
Additive Property for independent data sets $\hat\theta=E(\theta|\text{x}_1,\text{x}_2)$ $\theta,\text{x}_1,\text{x}_2$ are jointly Gaussian and $\text{x}_1,\text{x}_2$ are independent $\text{Let x}=[\text{x}_1^T\;\text{x}_2^T]^T$ from multi-variate Gaussian results $\hat{\theta} = E(\theta|\mathbf{x}) = E(\theta) + C_{\theta x} C_{xx}^{-1} (\mathbf{x} - E(\mathbf{x})), \quad C_{xx}^{-1} = \begin{bmatrix} C_{x_1 x_1} & C_{x_1 x_2} \\ C_{x_2 x_1} & C_{x_2 x_2} \end{bmatrix}^{-1} = \begin{bmatrix} C_{x_1 x_1}^{-1} & 0 \\ 0 & C_{x_2 x_2}^{-1} \end{bmatrix}\\[0.2cm] \rightarrow \hat{\theta} = E(\theta) + \begin{bmatrix} C_{\theta x_1} & C_{\theta x_2} \end{bmatrix} \begin{bmatrix} C_{x_1 x_1}^{-1} & 0 \\ 0 & C_{x_2 x_2}^{-1} \end{bmatrix} \begin{bmatrix} \mathbf{x}_1 - E(\mathbf{x}_1) \\ \mathbf{x}_2 - E(\mathbf{x}_2) \end{bmatrix}\\[0.2cm] = E(\theta) + C_{\theta x_1} C_{x_1 x_1}^{-1} (\mathbf{x}_1 - E(\mathbf{x}_1)) + C_{\theta x_2} C_{x_2 x_2}^{-1} (\mathbf{x}_2 - E(\mathbf{x}_2))$
Jointly Gaussian case leads to a linear estimator : $\hat\theta=P\text{x}+m$

Maximum a Posteriori(MAP) Estimators

Maximum a posteriori(MAP) estimators
$\hat\theta_{MAP}=\arg\max_\theta p(\theta|\text{x})\\[0.2cm] \rightarrow\hat\theta_{MAP}=\arg\max_\theta p(\text{x}|\theta)p(\theta)\\[0.2cm] \rightarrow\hat\theta_{MAP}=\arg\max_\theta\left[\ln p(\text{x}|\theta)+\ln p(\theta)\right]$
- Note : The "hit-or-miss" cost function gave the MAP estimator $\rightarrow$ it maximizes the posteriori PDF
- Given that the MMSE estimator is "the most natural" one, why the MAP estimator considered?
- If $\text{x}$ and $\theta$ are not jointly Gaussian, the form for MMSE estimate requires integration to find the conditional mean. MAP avoids this computational problem (doesn't require this integration), trading "natural criterion(MMSE)" vs. "computational ease(MAP)"
- More flexibility to choose the prior PDF
Ex) Exponential PDF $p(x[n]|\theta) = \begin{cases} \theta \exp(-\theta x[n]), & x[n] > 0 \\ 0, & x[n] < 0 \end{cases}, \quad x[n] \text{'s are conditional IID}\\[0.2cm] p(\mathbf{x}|\theta) = \prod_{n=0}^{N-1} p(x[n]|\theta)\\[0.2cm] \text{The prior PDF, } p(\theta) = \begin{cases} \lambda \exp(-\lambda \theta), & \theta > 0 \\ 0, & \theta < 0 \end{cases}$ The MAP estimator is found by maximizing $g(\theta) = \ln p(\mathbf{x}|\theta) + \ln p(\theta) = N \ln \theta - N \theta \bar{x} + \ln \lambda - \lambda \theta, \quad \theta > 0\\[0.2cm] \frac{dg(\theta)}{d\theta} = \frac{N}{\theta} - N \bar{x} - \lambda = 0\\[0.2cm] \rightarrow \hat{\theta} = \frac{1}{\bar{x} + \frac{\lambda}{N}} \quad \\\text{As } \lambda \rightarrow 0, \text{ the prior PDF becomes uniform. } \rightarrow \text{ Bayesian MLE.}$

Bayesian MLE

As we keep getting good data, $p(\theta|\text{x})$ becomes more concentrated as a function of $\theta$
But since:
$\hat\theta_{MAP}=\arg\max_\theta p(\theta|\text{x})=\arg\max_\theta p(\text{x}|\theta)p(\theta)$
$p(\text{x}|\theta)$ should also become more concentrated as a function of $\theta$
- Note that the prior PDF is nearly constant where $p(\text{x}|\theta)$ is non-zero
- This becomes truer as $N\rightarrow\infty$ and $p(\text{x}|\theta)$ gets more concentrated $\rightarrow \arg \max_{\theta} p(\theta | \mathbf{x}) \approx \arg \max_{\theta} p(\mathbf{x} | \theta) \\[0.2cm] \text{MAP}\approx\text{Bayesian MLE}$