[DetnEst] 6. Maximum Likelihood Estimation

KBC·2024년 10월 24일
0

Detection and Estimation

목록 보기
10/23

Review of Finding the MVUE

  • When the observation and the data are related in a linear way x=Hθ+w\text{x}=\text{H}\theta+\text{w}
  • And the noise was Gaussian, then the MVUE was easy to find :
    θ^=(HTH)1HTx\hat\theta=(\text{H}^T\text{H})^{-1}\text{H}^T\text{x}
    with covariance matrix, Cθ^=I1(θ)=σ2(HTH)1C_{\hat\theta}=I^{-1}(\theta)=\sigma^2(\text{H}^T\text{H})^{-1}
  • Using the CRLB, if you got lucky, you could write
    lnp(x;θ)θ=I(θ)(g(x)θ)\frac{\partial \ln p(x;\theta)}{\partial \theta}=I(\theta)(g(x)-\theta)
    where θ^=g(x)\hat \theta = g(x) would be your MVUE, which would also be an efficient estimator(meeting the CRLB)
  • Even if no efficient estimator exists, an MVUE may exist

    A systematic way of determining the MVUE is introduced, if it exists


  • We wish to estimate the parameter(s) θ\theta from the observations x\text{x}
  • Step 1
    • Determine (if possible) a sufficient statistic T(x)T(x) for the paramter to be estimated θ\theta
    • This may be done using the Neyman-Fisher factorization theorem
  • Step 2
    • Determine whether the sufficient statistic is also complete
    • This is generally hard to do
    • If it is not complete, we can say nothing more about the MVUE
    • If it is, continue to Step 3
  • Step 3
    • Find the MVUE θ^\hat\theta from T(x)T(x) is one of two ways using the Rao-Blackwell-Lehmann-Scheffe(RBLS) theorem :
      • Find a function g()g(\cdot) of the sufficient statistic that yields an unbiased estimator θ^=g(T(x))\hat\theta = g(T(x)), the MVUE
      • By definition of completeness of the statistic, this will yield the MVUE
    • Find any unbiased estimator θˇ\check\theta for θ\theta
      • And then determine θ^=E[θˇT(x)]\hat\theta=E[\check\theta|T(x)]
      • This is usually very difficult to do
      • The expectation is taken over the distribution p(θˇT(x))p(\check\theta|T(x))

Motivation for MVUE

  • MVUE often does not exist or can't be found!

    Solution : If the PDF is known, then MLE can always be used!!
    (MLE is one of the most popular practical methods)

  • Advantages
    • It is a "Turn-The-Crank" method
    • Optimal for large enough data size
  • Disadvantages
    • Not optimal for small data size
    • Can be computationally complex, may require numerical methods

CRLB/RBLS Not Applicable Example

  • DC Level in WGN w/ unknown variance
    (Either CRLB theorem or RBLS theorem is not applicable)
    x[n]=A+w[n],n=0,1,,N1x[n]=A+w[n],\quad n=0,1,\cdots,N-1
    AA : unknown level (A>0)(A>0), w[n]w[n] : WGN with unknown variance AA
    p(x;A)=1(2πA)N/2exp[12An=0N1(x[n]A)2]p(\text{x};A)=\frac{1}{(2\pi A)^{N/2}}\exp\left[-\frac{1}{2A}\sum_{n=0}^{N-1}(x[n]-A)^2\right]\\[0.3cm]
  • x[n]N(A,A)x[n]\sim\mathcal{N}(A,A)
  • E[x[n]]=AE[x[n]]=A
  • E[x2[n]]=Var[x[n]]+E2[x[n]]=A+A2E[x^2[n]]=\text{Var}[x[n]]+E^2[x[n]]=A+A^2

  • Firstly, check if CRLB theorem can give an efficient estimator :
    lnp(x;A)A=N2A+1An=0N1(x[n]A)+12A2n=0N1(x[n]A)2=N2A+1An=0N1x[n]N+12A2n=0N1x2[n]1An=0N1x[n]+N2=N2AN+12A2n=0N1x2[n]+N2=N2AN2+12A2n=0N1x2[n]\frac{\partial\ln p(x;A)}{\partial A}=-\frac{N}{2A} + \frac{1}{A}\sum_{n=0}^{N-1} (x[n]-A)+\frac{1}{2A^2}\sum_{n=0}^{N-1}(x[n]-A)^2\\[0.3cm] =-\frac{N}{2A}+\frac{1}{A}\sum_{n=0}^{N-1}x[n]-N+\frac{1}{2A^2}\sum_{n=0}^{N-1}x^2[n]-\frac{1}{A}\sum_{n=0}^{N-1}x[n]+\frac{N}{2}\\[0.3cm] =-\frac{N}{2A}-N+\frac{1}{2A^2}\sum_{n=0}^{N-1}x^2[n]+\frac{N}{2}\\[0.3cm] =-\frac{N}{2A}-\frac{N}{2}+\frac{1}{2A^2}\sum_{n=0}^{N-1}x^2[n]
    : cannot be factored as I(A)(g(x)A)I(A)(g(x)-A), so an efficient estimator cannot be founc, but CRLB is given
    var(A^)1E[2lnp(x;A)A2]=A2N(A+12)2lnp(x;A)A2=N2A21A3n=0N1x2[n]E[2lnp(x;A)A2]=N2A2+1A3N(A+A2)=1A2N(A+12)\text{var}(\hat A)\geq\frac{1}{-E\left[\frac{\partial^2\ln p(x;A)}{\partial A^2}\right]}=\frac{A^2}{N(A+\frac{1}{2})}\\[0.3cm[] \frac{\partial^2\ln p(x;A)}{\partial A^2}=\frac{N}{2A^2}-\frac{1}{A^3}\sum_{n=0}^{N-1}x^2[n]\\[0.3cm] -E\left[\frac{\partial^2\ln p(x;A)}{\partial A^2}\right]=-\frac{N}{2A^2}+\frac{1}{A^3}N\cdot (A+A^2)\\[0.3cm] =\frac{1}{A^2}N(A+\frac{1}{2})

  • Next, we try to find the MVUE based on sufficient statistics
    p(x;A)=1(2πA)N/2exp[12A(n=0N1x[n]22An=0N1x[n]+NA2)]=1(2πA)N/2exp[12(1An=0N1x[n]2+NA)]exp(Nxˉ)p(\mathbf{x}; A) = \frac{1}{(2\pi A)^{N/2}} \exp \left[ - \frac{1}{2A} \left( \sum_{n=0}^{N-1} x[n]^2 - 2A \sum_{n=0}^{N-1} x[n] + NA^2 \right) \right]\\[0.3cm] = \frac{1}{(2\pi A)^{N/2}} \exp \left[ -\frac{1}{2} \left( \frac{1}{A} \sum_{n=0}^{N-1} x[n]^2 + NA \right) \right] \exp(N \bar{x})
  • T(x)=n=0N1x2[n]T(x)=\sum_{n=0}^{N-1}x^2[n] is a single sufficient statistic for AA by Neyman-Fisher factorization theorem
  • Assuming that T(x)T(x) is complete, find a function gg of T(x)T(x) that produces an unbiased estimator
    E[n=0N1x2[n]]=NE[x2[n]]=N[var(x[n])+E2(x[n])]=N(A+A2)E\left[\sum_{n=0}^{N-1}x^2[n]\right]=NE[x^2[n]]\\[0.3cm] =N[\text{var}(x[n])+E^2(x[n])]\\ =N(A+A^2)
  • It is not obvious how to choose gg
  • Alternative way is to evaluate E[AˇT(x)]=E[x[0]n=0N1x2[n]]E[\check A|T(x)] =E[x[0]|\sum_{n=0}^{N-1}x^2[n]], which is quite challenging

Approximately Optimal Estimator

  • Cannot find an MVUE : propose an estimator that is approximately optimal
    As N,{E(A^)A(asymptotically unbiased)var(A^)CRLB(asymptotically efficient)\text{As } N \to \infty, \quad \begin{cases} E(\hat{A}) \to A \quad \text{(asymptotically unbiased)} \\ \text{var}(\hat{A}) \to \text{CRLB} \quad \text{(asymptotically efficient)} \end{cases}
  • For finite data records, we can say nothing about its optimality
  • Better estimator may exist, but finding them may not be easy

  • DC Level in WGN example, continued
  • Candidate estimator :
    A^=12+1Nn=0N1x2[n]+14\hat A = -\frac{1}{2} + \sqrt{\frac{1}{N}\sum_{n=0}^{N-1}x^2[n]+\frac{1}{4}}
  • E(A^)AE(\hat A)\neq A : biased, but as NN \rightarrow \infty
    1Nn=0N1x2[n]E[x2[n]]=A+A2A^AA^=12+(A+12)2\frac{1}{N}\sum_{n=0}^{N-1}x^2[n]\rightarrow E[x^2[n]]=A+A^2 \rightarrow \hat A \rightarrow A\\[0.3cm] \hat A = -\frac{1}{2}+\sqrt{\left(A+\frac{1}{2}\right)^2}
  • A^\hat A is said to be a consistent estimator, which means that the estimator converges to the parameter in probability as NN\rightarrow\infty

    Asymtatically Unbiased

  • To find the mean and variance of A^\hat A as NN\rightarrow \infty, statiscal linearization can be applied
    Let u=1Nn=0N1x2[n]andA^=g(u)=12+u+14g(u)g(u0)+dg(u)duu=u0(uu0),where  u0=E(u)=A+A2A^=A+1/2A+1/2[1Nn=0N1x2[n](A+A2)]dg(u)duu=u0=12(u0+14)1/2=12(A+12)1\text{Let } u = \frac{1}{N} \sum_{n=0}^{N-1} x^2[n] \quad \text{and} \quad \hat{A} = g(u) = -\frac{1}{2} + \sqrt{u + \frac{1}{4}}\\[0.3cm] g(u) \approx g(u_0) + \left. \frac{dg(u)}{du} \right|_{u=u_0} (u - u_0), \quad \text{where} \; u_0 = E(u) = A + A^2\\[0.3cm] \Rightarrow \hat{A} = A + \frac{1/2}{A + 1/2} \left[ \frac{1}{N} \sum_{n=0}^{N-1} x^2[n] - (A + A^2) \right]\\[0.3cm] \left. \frac{dg(u)}{du} \right|_{u=u_0} = \frac{1}{2} \left( u_0 + \frac{1}{4} \right)^{-1/2}\\[0.3cm] =\frac{1}{2}\left(A+\frac{1}{2}\right)^{-1}
  • E(A^)=AE(\hat A)=A when NN\rightarrow \infty, so the linearization works
  • Asymptotic variance
    var(A^)=(1/2A+1/2)2  var[1Nn=0N1x2[n]]=1/4N(A+1/2)24A2(A+12)=A2N(A+1/2)\text{var}(\hat A)=\left(\frac{1/2}{A + 1/2}\right)^2\;\text{var}\left[\frac{1}{N}\sum_{n=0}^{N-1}x^2[n]\right]\\[0.3cm] =\frac{1/4}{N(A+1/2)^2}4A^2\left(A+\frac{1}{2}\right)\\[0.3cm] =\frac{A^2}{N(A+1/2)}

    CRLB is achieved


Rationale for MLE

  • Choose the parameter value that : makes the data you did observe the most likely data to have been observed!!!
  • Consider 2 possible parameter values : θ1\theta_1 and θ2\theta_2
    • Ask the following : If θi\theta_i were really the true value

      What is the probability that I would get the data set I really got?

    • Let this probability be PiP_i
      P1=p(x;θ1)dx  and  P2=p(x;θ2)dxP_1=p(\text{x};\theta_1)d\text{x}\;\text{and}\;P_2=p(\text{x};\theta_2)d\text{x}
    • If PiP_i is small..
      • It says you actually got a data set that unlikely to occur!
      • Not a good guess for θi\theta_i!
    • So it is more likely that θ=θ2\theta = \theta_2 is the tru value since P2>P1P_2 > P_1

Maximum Likelihood Estimator(MLE)

  • θ^ML\hat\theta_{ML} is the value of θ\theta that maximizes the Likelihood Function, p(x;θ)p(\text{x};\theta)

    θ^ML=argmaxθp(x;θ)=argmaxθlnp(x;θ)\hat\theta_{ML}=\arg\max_\theta p(x;\theta)\\ =\arg\max_\theta\ln p(x;\theta)
  • Note : Equivalently, maximizes the log-likelihood function ln(p(x;θ))\ln (p(x;\theta))

  • General Analytical Procedure to Find the MLE

    1. Find log-likelihood function : ln(p(x;θ))\ln (p(x;\theta))
    2. Differentiate w.r.t. θ\theta and set to 0 : ln(p(x;θ))/θ=0\partial\ln(p(x;\theta))/\partial\theta=0
    3. Solve for θ\theta value that satisfies the equation

      We are still discussing Classical estimation(θ\theta is not a random variable)

      • Bayesian MLE is θ\theta that maximizes p(xθ)p(x|\theta) with a prior probability p(θ)p(\theta)
      • While MAP estimator maximizes p(θx)p(\theta|x) Chap. 11

MLE Example

  • Same example : DC Level in WGN w/ unknown variance
    (Either CRLB theorem or RBLS theorem is not applicable)
    p(x;A)=1(2πA)N/2exp[12An=0N1(x[n]A)2]lnp(x;A)A=N2A+1An=0N1(x[n]A)+12A2n=0N1(x[n]A)2p(\text{x};A)=\frac{1}{(2\pi A)^{N/2}}\exp\left[-\frac{1}{2A}\sum_{n=0}^{N-1}(x[n]-A)^2\right]\\[0.3cm] \frac{\partial \ln p(\text{x};A)}{\partial A}=-\frac{N}{2A}+\frac{1}{A}\sum_{n=0}^{N-1}(x[n]-A)+\frac{1}{2A^2}\sum_{n=0}^{N-1}(x[n]-A)^2
  • And A^\hat A is the value of AA that makes it 0
  • Replacing AA by A^\hat A
    N2A^+1A^n=0N1x[n]N+12A^2n=0N1x2[n]1A^n=0N1x[n]+N2=0A^2+A^1Nn=0N1x2[n]=0(A^+12)2=1Nn=0N1x2[n]+14A^=12+14+1Nn=0N1x2[n]-\frac{N}{2\hat{A}} + \frac{1}{\hat{A}} \sum_{n=0}^{N-1} x[n] - N + \frac{1}{2\hat{A}^2} \sum_{n=0}^{N-1} x^2[n] - \frac{1}{\hat{A}} \sum_{n=0}^{N-1} x[n] + \frac{N}{2} = 0\\[0.3cm] \Rightarrow \hat{A}^2 + \hat{A} - \frac{1}{N} \sum_{n=0}^{N-1} x^2[n] = 0\\[0.3cm] \Rightarrow \left( \hat{A} + \frac{1}{2} \right)^2 = \frac{1}{N} \sum_{n=0}^{N-1} x^2[n] + \frac{1}{4}\\[0.3cm] \Rightarrow \hat{A} = -\frac{1}{2} + \sqrt{\frac{1}{4} + \frac{1}{N} \sum_{n=0}^{N-1} x^2[n] }

    Finally, we check the second derivative to determine if it is a maximum or a minimum

Properties of the MLE

  • The example MLE is asymptotically, Unbiased, Efficient(i.e. achieves CRLB) and had a Gaussian PDF
    θ^aN(θ,I1(θ))\hat \theta \sim ^a\mathcal{N}(\theta,I^{-1}(\theta))
  • If a truly efficient estimator exists, then the ML procedure finds it!
  • Use
    θ^0=lnp(x;θ)θ=I(θ)(g(x)θ)\hat \theta\rightarrow 0=\frac{\partial\ln p(\text{x};\theta)}{\partial\theta}=I(\theta)(g(\text{x})-\theta)

    If lnp(x;θ)/θ0\partial\ln p(x;\theta)/\partial\theta \rightarrow 0 then achieve CRLB

  • Theorem : Asymptotic properties of the MLE
    • If the PDF p(x;θ)p(\text{x};\theta) of the data x\text{x} satisfies some regularity conditions, then the MLE of the unknown parameter θ\theta is asymptotically distributed (for large data records) according to
      θ^MLaN(θ,I1(θ))Regularity : E[lnp(x;θ)θ]=0\hat \theta_{ML}\sim^a\mathcal{N}(\theta, I^{-1}(\theta))\\[0.2cm] \text{Regularity : }E\left[\frac{\partial\ln p(x;\theta)}{\partial\theta}\right]=0
      where I(θ)I(\theta) is the Fisher information evaluated at the true value of θ\theta, the unknown parameter
    • Regularity condition : existence of the derivatives of the log-likelihood function and nonzero Fisher information
    • Related to Central limit theorem

Monte Carlo Simulations(Appendix 7A)

  • A methodology for doing computer simulations to evaluate performance of any estimation method illustrate for deterministic signal s[n;θ]s[n;\theta] in AWGN
  • Monte Carlo Simulation :
    1. Select a particular true parameter value, θtrue\theta_{true}
      Often interested in doing this for a variety of values of θ\theta, so you would run one MC simulation for each θ\theta value of interest
    2. Generate signal having true θ:s[n;θt]\theta:s[n;\theta_t]
    3. Generate WGN having unit variance, w=randn(size(s))w=\text{randn}(\text{size}(s))
    4. Form measured data : x=s+sigmawx=s+\text{sigma}*w
      • Choose σ\sigma to get the desired SNR
      • Usually want to run at many SNR values → do one MC simulation for each SNR value
    5. Compute estimate from data xx
    6. Repeat steps 3-5 MM times
    7. Store all MM estimates in a vector EST (assumes scalar θ\theta)
  • Statistical Evaluation :
    1. Compute bias
    2. Compute error RMS
    3. Compute the error Variance
    4. Plot Histogram or Scatter Plot (if desired)
  • Explore (via plots) how : Bias, RMS, and VAR vary with : θ\theta value, SNR value, NN value, etc.
  • DC Level in WGN(Asymptotic properties of the MLE)
    A=1  E[A^]A=1var[A^]=2βN=1CRLB,(N=1)A=1\\\;E[\hat A]\rightarrow A=1\\ \text{var}[\hat A]=\frac{2\beta}{N}=\frac{1}{\text{CRLB}},(N=1)
    If NN get bigger 4 times, std → 1/2, var → 1/4

Example : Phase Estimation in WGN

  • MLE of the sinusoidal phase in WGN
    x[n]=Acos(2πf0n+ϕ+w[n]),  n=0,1,,N1A,f0,σ2:known,  w[n]:WGNx[n]=A\cos(2\pi f_0n+\phi+w[n]),\;n=0,1,\cdots,N-1\\[0.2cm] A,f_0,\sigma^2:\text{known},\;w[n]:\text{WGN}
  • All methods for finding the MVUE will fail! → So... Try MLE
  • Jointly sufficient statistics
    T1(x)=n=0N1x[n]cos2πf0n,  T2(x)=n=0N1x[n]sin2πf0nT_1(\text{x})=\sum_{n=0}^{N-1}x[n]\cos2\pi f_0n,\;T_2(\text{x})=\sum_{n=0}^{N-1}x[n]\sin 2\pi f_0n
  • MLE : ϕ\phi that maximizes
    p(x;ϕ)=1(2πσ2)N/2exp(12σ2n=0N1[x[n]Acos(2πf0n+ϕ)]2)p(\mathbf{x}; \phi) = \frac{1}{(2\pi\sigma^2)^{N/2}} \exp\left( -\frac{1}{2\sigma^2} \sum_{n=0}^{N-1} \left[ x[n] - A \cos(2\pi f_0 n + \phi) \right]^2 \right)
  • or minimize
    J(ϕ)=n=0N1[x[n]Acos(2πf0n+ϕ)]2J(\phi) = \sum_{n=0}^{N-1} \left[ x[n] - A \cos(2\pi f_0 n + \phi) \right]^2
  • Then
    J(ϕ)ϕ=2n=0N1[x[n]Acos(2πf0n+ϕ)]Asin(2πf0n+ϕ)=0n=0N1x[n]sin(2πf0n+ϕ^)=An=0N1sin(2πf0n+ϕ^)cos(2πf0n+ϕ^)\frac{\partial J(\phi)}{\partial \phi} = 2 \sum_{n=0}^{N-1} \left[ x[n] - A \cos(2\pi f_0 n + \phi) \right] A \sin(2\pi f_0 n + \phi) = 0 \\[0.3cm] \Rightarrow \sum_{n=0}^{N-1} x[n] \sin(2\pi f_0 n + \hat{\phi}) = A \sum_{n=0}^{N-1} \sin(2\pi f_0 n + \hat{\phi}) \cos(2\pi f_0 n + \hat{\phi})
  • Where for f0f_0 not near 00 or 1/21/2
    1Nn=0N1sin(2πf0n+ϕ^)cos(2πf0n+ϕ^)=12Nn=0N1sin(4πf0n+2ϕ^)0(n=0N1x[n]sin2πf0n)cosϕ^=(n=0N1x[n]cos2πf0n)sinϕ^\frac{1}{N} \sum_{n=0}^{N-1} \sin(2\pi f_0 n + \hat{\phi}) \cos(2\pi f_0 n + \hat{\phi}) = \frac{1}{2N} \sum_{n=0}^{N-1} \sin(4\pi f_0 n + 2\hat{\phi}) \approx 0\\[0.3cm] \Rightarrow \left( \sum_{n=0}^{N-1} x[n] \sin 2\pi f_0 n \right) \cos \hat{\phi} = -\left( \sum_{n=0}^{N-1} x[n] \cos 2\pi f_0 n \right) \sin \hat{\phi}
  • Therefore
    ϕ^=arctann=0N1x[n]sin2πf0nn=0N1x[n]cos2πf0n=T2(x)T1(x)\hat\phi=-\arctan\frac{\sum_{n=0}^{N-1}x[n]\sin2\pi f_0n}{\sum_{n=0}^{N-1}x[n]\cos2\pi f_0n}=\frac{T_2(x)}{T_1(x)}
  • MLE is function of the sufficient statistics since,
    p(x;θ)=g(T1(x),T2(x),,Tr(x),θ)h(x)p(\text{x};\theta)=g(T_1(\text{x}),T_2(\text{x}),\cdots,T_r(\text{x}),\theta)h(\text{x})
  • Asymptotic PDF of the estimator
    ϕ^N(ϕ,I1(ϕ))\hat \phi\sim\mathcal{N}\left(\phi,I^{-1}(\phi)\right)
    where I(ϕ)=NA22σ2I(\phi)=\frac{NA^2}{2\sigma^2}
  • Variance
    var(ϕ^)=1Nη,where η=A2/2σ2 is the SNR10log10var(ϕ^)=10log101Nη=10log10N10log10η\text{var}(\hat{\phi}) = \frac{1}{N \eta}, \quad \text{where } \eta = \frac{A^2 / 2}{\sigma^2} \text{ is the SNR}\\[0.3cm] 10 \log_{10} \text{var}(\hat{\phi}) = 10 \log_{10} \frac{1}{N \eta} = -10 \log_{10} N - 10 \log_{10} \eta

MLE for Transformed Parameters

  • Given PDF p(x;θ)p(\text{x};\theta) but want an estimate of α=g(θ)\alpha = g(\theta)

  • What is the MLE for α\alpha?

    • Two cases :

      • α=g(θ)\alpha=g(\theta) is a one-to-one function

        α^ML maximizes p(x;g1(α))\hat \alpha_{ML} \text{ maximizes }p(\text{x};g^{-1}(\alpha))

      • α=g(θ)\alpha = g(\theta) is not a one-to-one function

        Need to define modified likelihood function

        pˉT(x;α)=max{θ;α=g(θ)}p(x;θ)α^ML maximizes pˉT(x;α)\bar p_T(\text{x};\alpha)=\max_{\{\theta;\alpha=g(\theta)\}}p(\text{x};\theta)\\[0.3cm] \hat \alpha_{ML} \text{ maximizes }\bar p_T(\text{x};\alpha)

Example 1

  • Example) Transformed DC Level in WGN
    x[n]=A+w[n],  n=0,1,,N1p(x;A)=1(2πσ2)N/2exp[12σ2n=0N1(x[n]A)2],  <A<x[n]=A+w[n],\;n=0,1,\cdots,N-1\\[0.3cm] p(\text{x};A)=\frac{1}{(2\pi\sigma^2)^{N/2}}\exp\left[-\frac{1}{2\sigma^2}\sum_{n=0}^{N-1}(x[n]-A)^2\right],\;-\infty<A<\infty
  • Wish to find the MLE of α=exp(A)\alpha = \exp(A)
    pT(x;α)=1(2πσ2)N/2exp[12σ2n=0N1(x[n]lnα)2],  α>0p_T(\text{x};\alpha)=\frac{1}{(2\pi\sigma^2)^{N/2}}\exp\left[-\frac{1}{2\sigma^2}\sum_{n=0}^{N-1}(x[n]-\ln \alpha)^2\right],\;\alpha>0
  • Maximizing pT(x;α)p_T(\text{x};\alpha) results in
    n=0N1(x[n]lnα^)1α^=0α^α^=exp(xˉ)=exp(A^)\sum_{n=0}^{N-1}(x[n]-\ln\hat\alpha)\frac{1}{\hat\alpha}=0\rightarrow\hat\alpha\\[0.3cm] \hat\alpha=\exp(\bar x)=\exp(\hat A)
  • The MLE of the transformed parameter is the transform of the MLE of the original parameter

    Invariance property

Example 2

  • Transformed DC Level in WGN, α=A2A=±α\alpha=A^2\rightarrow A=\pm\sqrt \alpha : not one-to-one
    pT1(x;α)=1(2πσ2)N/2exp[12σ2n=0N1(x[n]α)2],α0pT2(x;α)=1(2πσ2)N/2exp[12σ2n=0N1(x[n]+α)2],α>0p_{T_1}(\mathbf{x}; \alpha) = \frac{1}{(2\pi\sigma^2)^{N/2}} \exp \left[ -\frac{1}{2\sigma^2} \sum_{n=0}^{N-1} \left( x[n] - \sqrt{\alpha} \right)^2 \right], \quad \alpha \geq 0\\[0.3cm] p_{T_2}(\mathbf{x}; \alpha) = \frac{1}{(2\pi\sigma^2)^{N/2}} \exp \left[ -\frac{1}{2\sigma^2} \sum_{n=0}^{N-1} \left( x[n] + \sqrt{\alpha} \right)^2 \right], \quad \alpha > 0
  • The MLE of α\alpha is the value that maximizes the maximum between pT1(x;α)p_{T_1}(\text{x};\alpha) and pT2(x;α)p_{T_2}(\text{x};\alpha)
    α^=argmaxα0{pT1(x;α),pT2(x;α)}α^=argmaxα0{p(x;α),p(x;α}=[argmaxα0{p(x;α),p(x;α)}]2=[argmaxAp(x;A)]2=A^2=xˉ2\hat \alpha=\arg\max_{\alpha\geq0}\{p_{T_1}(\text{x};\alpha),p_{T_2}(\text{x};\alpha)\}\\[0.2cm] \rightarrow\hat\alpha=\arg\max_{\alpha\geq0}\{p(\text{x};\sqrt\alpha),p(\text{x};-\sqrt\alpha\}\\[0.2cm] =\left[\arg\max_{\sqrt\alpha\geq0}\{p(\text{x};\sqrt\alpha),p(\text{x};-\sqrt\alpha)\}\right]^2\\[0.2cm] =\left[\arg\max_A p(\text{x};A)\right]^2=\hat A^2=\bar x^2

Invariance Property of MLE

  • Theorem : Invariance property of the MLE
    • If parameter θ\theta is mapped according to α=g(θ)\alpha =g(\theta) then the MLE of α\alpha is given by
      α^=g(θ^)\hat \alpha =g(\hat\theta)
      where θ^\hat\theta is the MLE of θ\theta, found by maximizing p(x;θ)p(\text{x};\theta)

      If g(θ)g(\theta) is not a one-to-one function, then α^\hat\alpha maximizes the modified likelihood function pˉT(x;α)\bar p_T(\text{x};\alpha), defined as

      pˉT(x;α)=max{θ:α=g(θ)}p(x;θ)\bar p_T(\text{x};\alpha)=\max_{\{\theta:\alpha=g(\theta)\}}p(\text{x};\theta)

Numerical Determination of the MLE

  • In all previous examples we ended up with a closed-form expression for the MLE
    θ^ML=f(x)\hat\theta_{ML}=f(\text{x})
  • Numerical determination of the MLE
    A distint advantage of the MLE is that we can find it for a given data set numerically
  • For a finite interval - grid search
  • But... For the parameters that span infinite interval
    1. Newton-Raphson method
    2. Scoring approach
    3. Expectation-Maximization(EM) algorithm

      Iterative methods ▷ will result in local maximum ▷ good initial guess is important

  • For the parameters that can span infinite interval - iterative method

    • Step 1 : Pick some initial estimate θ^0\hat\theta_0
    • Step 2 : Iteratively improve it using
      θ^k+1=f(θ^k,x) such thatlimkp(x;θk)=maxθp(x;θ)\hat\theta_{k+1}=f(\hat\theta_k,\text{x})\text{ such that}\\[0.2cm] \lim_{k\rightarrow\infty} p(\text{x};\theta_k)=\max_\theta p(\text{x};\theta)
  • Convergence Issues : May not converge, or may converge, but to local maximum

    Good initial guess is needed
    (Use rough grid search to initialize or multiple initialization)

Exponential in WGN Example

  • Estimate rr
    x[n]=rn+w[n],  n=0,1,,N1E[x[n]]=rnp(x;r)=1(2πσ2)N/2exp[12σ2n=0N1(x[n]rn)2]x[n]=r^n+w[n],\;n=0,1,\cdots,N-1\\[0.2cm] E[x[n]]=r^n\\[0.2cm] \rightarrow p(\text{x};r)=\frac{1}{(2\pi\sigma^2)^{N/2}}\exp\left[-\frac{1}{2\sigma^2}\sum_{n=0}^{N-1}(x[n]-r^n)^2\right]
  • To find MLE
    lnp(x;θ)θ=0n=0N1(x[n]rn)nrn1=0:can’t be solved directly\frac{\partial \ln p(\text{x};\theta)}{\partial\theta}=0\\[0.3cm] \rightarrow \sum_{n=0}^{N-1}(x[n]-r^n)nr^{n-1}=0:\text{can't be solved directly}
  • Iterative method : Newton-Raphson method
    g(θ)=lnp(x;θ)θ=0g(θ)g(θ0)+dg(θ)dθθ=θ0(θθ0)θ1=θ0g(θ0)dg(θ)dθθ=θ0θk+1=θkg(θk)dg(θ)dθθ=θkθk+1=θk[2lnp(x;θ)θ2]1lnp(x;θ)θθ=θkg(\theta) = \frac{\partial \ln p(\mathbf{x}; \theta)}{\partial \theta} = 0 \Rightarrow g(\theta) \approx g(\theta_0) + \left. \frac{dg(\theta)}{d\theta} \right|_{\theta = \theta_0} (\theta - \theta_0)\\[0.3cm] \theta_1 = \theta_0 - \frac{g(\theta_0)}{\left. \frac{dg(\theta)}{d\theta} \right|_{\theta = \theta_0}} \Rightarrow \theta_{k+1} = \theta_k - \frac{g(\theta_k)}{\left. \frac{dg(\theta)}{d\theta} \right|_{\theta = \theta_k}}\\[0.3cm] \Rightarrow \theta_{k+1} = \theta_k - \left[ \frac{\partial^2 \ln p(\mathbf{x}; \theta)}{\partial \theta^2} \right]^{-1} \left. \frac{\partial \ln p(\mathbf{x}; \theta)}{\partial \theta} \right|_{\theta = \theta_k}
  • The iteration may not converge
  • Even if the iteration converges, the point found may not be the global maximum but possibly only a local maximum or even a local minimum

    Good initial guess is important

    lnp(x;r)r=1σ2n=0N1(x[n]rn)nrn12lnp(x;r)r2=1σ2[n=0N1n(n1)x[n]rn2n=0N1n(2n1)r2n2]=1σ2n=0N1nrn2[(n1)x[n](2n1)rn]I(r)=E[2lnp(x;r)r2]rk+1=rkn=0N1nrkn2((n1)x[n](2n1)rkn)n=0N1nrkn2[(n1)x[n](2n1)rkn]\frac{\partial \ln p(\mathbf{x}; r)}{\partial r} = \frac{1}{\sigma^2} \sum_{n=0}^{N-1} \left( x[n] - r^n \right) n r^{n-1}\\[0.3cm] \frac{\partial^2 \ln p(\mathbf{x}; r)}{\partial r^2} = \frac{1}{\sigma^2} \left[ \sum_{n=0}^{N-1} n(n-1) x[n] r^{n-2} - \sum_{n=0}^{N-1} n(2n-1) r^{2n-2} \right]\\[0.3cm] = \frac{1}{\sigma^2} \sum_{n=0}^{N-1} n r^{n-2} \left[ (n-1) x[n] - (2n-1) r^n \right]\\[0.3cm] I(r)=-E\left[\frac{\partial^2\ln p(x;r)}{\partial r^2}\right]\\[0.3cm] \Rightarrow r_{k+1} = r_k - \frac{\sum_{n=0}^{N-1} n r_k^{n-2} \left( (n-1) x[n] - (2n-1) r_k^n \right)}{\sum_{n=0}^{N-1} n r_k^{n-2} \left[ (n-1) x[n] - (2n-1) r_k^n \right]}

  • Iterative method : the method of scoring
  • Motivation :
    2lnp(x;θ)θ2θ=θkI(θk)\frac{\partial^2 \ln p(\mathbf{x}; \theta)}{\partial \theta^2} \Bigg|_{\theta = \theta_k} \approx -I(\theta_k)
  • For IID samples, by the law of large numbers
    2lnp(x;θ)θ2=n=0N12lnp(x[n];θ)θ2=N(1Nn=0N12lnp(x[n];θ)θ2)NE[2lnp(x[n];θ)θ2]=Ni(θ)=I(θ)\frac{\partial^2 \ln p(\mathbf{x}; \theta)}{\partial \theta^2} = \sum_{n=0}^{N-1} \frac{\partial^2 \ln p(x[n]; \theta)}{\partial \theta^2} = N \left( \frac{1}{N} \sum_{n=0}^{N-1} \frac{\partial^2 \ln p(x[n]; \theta)}{\partial \theta^2} \right)\\[0.3cm] \approx N E\left[ \frac{\partial^2 \ln p(x[n]; \theta)}{\partial \theta^2} \right] = -N i(\theta) = -I(\theta)
  • The method of scoring
    θk+1=θk+I1(θ)lnp(x;θ)θθ=θk\theta_{k+1} = \theta_k + I^{-1}(\theta) \left. \frac{\partial \ln p(\mathbf{x}; \theta)}{\partial \theta} \right|_{\theta = \theta_k}

    The expectation provides more stable iteration

    I(r)=1σ2n=0N1n2r2n2rk+1=rk+n=0N1(x[n]rkn)nrkn1n=0N1n2rk2n2I(r) = \frac{1}{\sigma^2} \sum_{n=0}^{N-1} n^2 r^{2n-2}\\[0.3cm] \Rightarrow r_{k+1} = r_k + \frac{\sum_{n=0}^{N-1} \left( x[n] - r_k^n \right) n r_k^{n-1}}{\sum_{n=0}^{N-1} n^2 r_k^{2n-2}}

MLE for a Vector Parameter

The vector parameter is: θ=[θ1θ2θp]T\text{The vector parameter is: } \mathbf{\theta} = [\theta_1 \, \theta_2 \, \dots \, \theta_p]^T
  • θ^ML\hat\theta_{ML} is the vector that satifies
    lnp(x;θθ=0\frac{\partial \ln p(\text{x};\theta}{\partial\theta}=0
  • Example : DC Level in AWGN with unknown noise variance
    x[n]=A+w[n],  n=0,1,,N1θ=[Aσ2]Tlnp(x;θ)A=1σ2n=0N1(x[n]A)lnp(x;θ)σ2=N2σ2+12σ4n=0N1(x[n]A)2A^=xˉ,σ^2=1Nn=0N1(x[n]xˉ)2x[n]=A+w[n],\;n=0,1,\cdots,N-1\\[0.2cm] \theta=[A\sigma^2]^T\\[0.2cm] \frac{\partial \ln p(\mathbf{x}; \theta)}{\partial A} = \frac{1}{\sigma^2} \sum_{n=0}^{N-1} (x[n] - A)\\[0.2cm] \frac{\partial \ln p(\mathbf{x}; \theta)}{\partial \sigma^2} = -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{n=0}^{N-1} (x[n] - A)^2\\[0.2cm]\\[0.2cm] \rightarrow \hat A=\bar x,\\[0.3cm] \hat{\sigma}^2 = \frac{1}{N} \sum_{n=0}^{N-1} (x[n] - \bar{x})^2

Asymptotic Properties of Vector MLE

  • Theorem : Asymptotic properties of the MLE (vector parameter)
    • If the PDF p(x;θ)p(\text{x};\theta) of the data x\text{x} satisfies some regularity conditions,
    • Then the MLE of the unknown parameter θ\theta is asymptotically distributed(for large data records) according to
      θ^aN(θ,I1(θ))\hat\theta\sim^a\mathcal{N}(\theta,I^{-1}(\theta))
      where I(θ)I(\theta) is the Fisher information matrix evaluated at the true value of the unknown parameter
  • So the vector ML is asymptotically unbiased and efficient
  • Invariance Property Holds for Vector Case
    α=g(θ), then α^ML=g(θ^ML)\alpha=g(\theta),\text{ then }\hat\alpha_{ML}=g(\hat\theta_{ML})
  • Example : DC Level in AWGN with unknown noise variance
    xˉN(A,σ2N)T=n=0N1(x[n]xˉ)2σ2χN12\bar x \sim\mathcal{N}(A,\frac{\sigma^2}{N})\\[0.3cm] T=\sum_{n=0}^{N-1}\frac{(x[n]-\bar x)^2}{\sigma^2} \sim\chi^2_{N-1}
    and mutually independent
    • For large NN, by central limit theorem
      χN2aN(N,2N)TaN(N1,2(N1))\chi^2_N \sim a\mathcal{N}(N, 2N) \quad \Rightarrow \quad T \sim a\mathcal{N}(N-1, 2(N-1))
      or
      E(θ^)=[AN(N1)σ2N][Aσ2]=θ,C(θ)=[σ2N002(N1)σ4N2][σ2N002σ4N]=I1(θ)E(\hat{\theta}) = \begin{bmatrix} \frac{A}{N} \\ \frac{(N-1)\sigma^2}{N} \end{bmatrix} \Rightarrow \begin{bmatrix} A \\ \sigma^2 \end{bmatrix} = \theta, \\[0.3cm] C(\theta) = \begin{bmatrix} \frac{\sigma^2}{N} & 0 \\ 0 & \frac{2(N-1)\sigma^4}{N^2} \end{bmatrix} \Rightarrow \begin{bmatrix} \frac{\sigma^2}{N} & 0 \\ 0 & \frac{2\sigma^4}{N} \end{bmatrix} = I^{-1}(\theta)

Expectation-Maximization Algorithm

  1. Newton-Raphson method
  2. Scoring approach
  3. Expectation-Maximization(EM) algorithm
    1. Increase the likelihood at each step
    2. Guaranteed to converge under mild conditions(to at least a local maximum)
    3. Uses complete and incomplete data concepts
    4. Good for complicated multi-parameter cases
  • Example : Multiple frequencies in WGn
    x[n]=i=1pcos2πfin+w[n],  n=0,1,,N1x[n]=\sum_{i=1}^p \cos2\pi f_in+w[n],\;n=0,1,\cdots,N-1
    • w[n]w[n] : WGN
    • Variance σ2\sigma^2
    • f=[f1f2fp]T\mathbf{f} = \begin{bmatrix} f_1 & f_2 & \dots & f_p \end{bmatrix}^T : parameter to estimate
    • Objective function to minimize :
      J(f)=n=0N1(x[n]i=1pcos(2πfin))2J(\mathbf{f}) = \sum_{n=0}^{N-1} \left( x[n] - \sum_{i=1}^{p} \cos(2\pi f_i n) \right)^2
    • If the original data could be replaced by the independent data sets
      yi[n]=cos2πfin+wi[n],  n=0,1,,N1,  i=1,2,,py_i[n]=\cos2\pi f_in+w_i[n],\;n=0,1,\cdots,N-1,\;i=1,2,\cdots,p
      where wi[n]w_i[n] is WGN with variance σi2\sigma_i^2, then the problem would be decoupled
    • Decoupled objective function to minimize :
      J(fi)=n=0N1(yi[n]cos2πfin)2,i=1,2,,pJ(f_i) = \sum_{n=0}^{N-1} \left( y_i[n] - \cos 2\pi f_i n \right)^2, \quad i = 1, 2, \dots, p
      pDp-D minimization reduced to pp seperate 1-D minimization
    • The new data set {y1[n],y2[n],,yp[n]}\{y_1[n],y_2[n],\cdots,y_p[n]\} : complete data
    • x[n]x[n] : original or incomplete data
      x[n]=i=1pyi[n],w[n]=i=1pwi[n],σ2=i=1pσi2x[n] = \sum_{i=1}^{p} y_i[n], \quad w[n] = \sum_{i=1}^{p} w_i[n], \quad \sigma^2 = \sum_{i=1}^{p} \sigma_i^2
    • The decomposition is not unique :
      x[n]=y1[n]+y2[n],  y1[n]=i=1pcos2πfin,  y2[n]=w[n]x[n]=y_1[n]+y_2[n],\;y_1[n]=\sum_{i=1}^p\cos2\pi f_in,\;y_2[n]=w[n]

Expectation-Maximization Algorithm - General

  • Suppose that there is a complete to incomplete data transformation
    x=g(y1,y2,,yM)=g(y)\text{x}=g(y_1,y_2,\cdots,y_M)=g(y), where gg is a many-to-one transformation
  • MLE : θ\theta that maximizes lnpx(x;θ)\ln p_x(\text{x};\theta)
  • EM algorithm : maximize lnpy(y;θ)\ln p_y(y;\theta), but yy is not available, so maximize
    Eyx[lnpy(y;θ)]=lnpy(y;θ)p(yx;θ)dyE_{y|x}[\ln p_y(y;\theta)]=\int\ln p_y(y;\theta)p(y|x;\theta)dy
    • Expectation (E) Step : Determine the average log-likelihood the complete data
      U(θ,θk)=lnpy(y;θ)p(yx;θk)dyU(\theta,\theta_k)=\int\ln p_y(y;\theta)p(y|x;\theta_k)dy
    • Maximization (M) Step : Maximize the average log-likelihood function of the complete data
  • Example : Multiple frequencies in WGN
    • Expectation (E) Step : For i=1,2,,pi=1,2,\cdots,p
      y^i[n]=cos2πfikn+βi(x[n]i=1pcos2πfikn)\hat y_i[n]=\cos2\pi f_{i_k}n+\beta_i\left(x[n]-\sum_{i=1}^p\cos2\pi f_{i_k}n\right)
    • Maximization (M) Step : For i=1,2,,pi=1,2,\cdots,p
      fik+1=argmaxfin=0N1y^i[n]cos2πfinf_{i_{k+1}}=\arg\max_{f_i}\sum_{n=0}^{N-1}\hat y_i[n]\cos 2\pi f_i n
      where βi\beta_i's can be arbitrarily chosen as long as i=1pβi=1\sum_{i=1}^p\beta_i=1

MLE Example - Range Estimation

  • Transmit pulse s(t)s(t), nonzero over t[0,Ts]t\in[0,T_s]
  • Receive reflection s(tτ0)s(t-\tau_0)
  • Estimation of time delay τ0\tau_0, since the round trip delay τ0=2R/c\tau_0=2R/c
  • Continuous time signal model
    x(t)=s(tτ0)+w(t),  0tT=Ts+τ0maxx(t)=s(t-\tau_0)+w(t),\;0\leq t\leq T=T_s+\tau_{0_\text{max}}
  • Discrete time signal model
    • Sample every =1/(2B)\triangle=1/(2B) sec, x[n]=s(nτ0)+w[n],  n=0,1,,N1x[n]=s(n\triangle-\tau_0)+w[n],\;n=0,1,\cdots, N-1
    • s(nτ0)s(n\triangle-\tau_0) has MM non-zero samples starting at n0n_0
      x[n]={w[n],0nn01s(nΔτ0)+w[n],n0nn0+M1w[n],n0+MnN1x[n] = \begin{cases} w[n], & 0 \leq n \leq n_0 - 1 \\ s(n\Delta - \tau_0) + w[n], & n_0 \leq n \leq n_0 + M - 1 \\ w[n], & n_0 + M \leq n \leq N - 1 \end{cases}
      PSD of w(t)w(t)ACF of w(t)w(t)
    • rww(τ)=N0Bsin2πτB2πτBr_{ww}(\tau)=N_0B\frac{\sin2\pi\tau B}{2\pi \tau B}
  • Range Estimation Likelihood Function :
    p(x;n0)=(n=0n0112πσ2exp[x2[n]2σ2])(n=n0n0+M112πσ2exp[(x[n]s[nn0])22σ2])(n=n0+MN112πσ2exp[x2[n]2σ2])=1(2πσ2)N/2exp[12σ2n=0N1x2[n]](n=n0n0+M112πσ2exp[12σ2(2s[nn0]x[n]+s2[nn0])])p(\mathbf{x}; n_0) = \left( \prod_{n=0}^{n_0-1} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[ -\frac{x^2[n]}{2\sigma^2} \right] \right) \cdot \left( \prod_{n=n_0}^{n_0+M-1} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[ -\frac{(x[n] - s[n-n_0])^2}{2\sigma^2} \right] \right) \\[0.3cm]\cdot \left( \prod_{n=n_0+M}^{N-1} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[ -\frac{x^2[n]}{2\sigma^2} \right] \right)\\[0.3cm] = \frac{1}{(2\pi\sigma^2)^{N/2}} \exp\left[ -\frac{1}{2\sigma^2} \sum_{n=0}^{N-1} x^2[n] \right] \\[0.3cm]\cdot \left( \prod_{n=n_0}^{n_0+M-1} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[ -\frac{1}{2\sigma^2} \left( -2s[n-n_0]x[n] + s^2[n-n_0] \right) \right] \right)
  • Minimize
    2n=n0n0+M1x[n]s[nn0]+n=n0n0+M1s2[nn0]-2 \sum_{n=n_0}^{n_0+M-1} x[n] s[n-n_0] + \sum_{n=n_0}^{n_0+M-1} s^2[n-n_0]
  • Or, maximize
    n=n0n0+M1x[n]s[nn0]\sum_{n=n_0}^{n_0+M-1} x[n] s[n-n_0]

    So, MLE implementation is based on Cross-correlation :
    "Correlate" received signal x[n]x[n] with trasmitted signal s[n]s[n]

  • Therefore
    n^0=argmax0mNM{CXS[m]},CXS[m]=n=0n1x[n]s[nm]R^=(C/2)n^0\hat n_0=\arg\max_{0\leq m\leq N-M}\{C_{XS}[m]\},\\[0.3cm] C_{XS}[m]=\sum_{n=0}^{n-1}x[n]s[n-m]\\[0.3cm] \hat R=(C\triangle/2)\hat n_0
    • Think of this as an inner product for each mm
    • Compare data x[n]x[n] to all possible delays of signal s[n]s[n]
      Pick n0n_0 to make them most alike

All Content has been written based on lecture of Prof. eui-seok.Hwang in GIST(Detection and Estimation)

profile
AI, Security

0개의 댓글