Estimation Theory Fundamentals

From Detection to Estimation

Detection chooses among a discrete set of hypotheses; estimation determines a continuous parameter from noisy observations. In communications, the parameter to estimate is often the channel itself: its gain, phase, delay, or frequency response. This section develops the mathematical framework for optimal estimation, starting with the fundamental limit (Cramer-Rao bound) and the two main paradigms: frequentist (ML) and Bayesian (MMSE/LMMSE).

Definition:

Estimator, Bias, and Efficiency

An estimator θ^(y)\hat{\theta}(\mathbf{y}) is a function of the observed data y\mathbf{y} that produces an estimate of an unknown parameter θ\theta.

  • Bias: b(θ)=E[θ^]θb(\theta) = E[\hat{\theta}] - \theta. An estimator is unbiased if b(θ)=0b(\theta) = 0 for all θ\theta.

  • Mean-square error: MSE(θ^)=E[(θ^θ)2]=Var(θ^)+b2(θ)\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Var}(\hat{\theta}) + b^2(\theta).

  • Consistency: θ^pθ\hat{\theta} \xrightarrow{p} \theta as the number of observations NN \to \infty.

  • Efficiency: An unbiased estimator is efficient if it achieves the Cramer-Rao lower bound (CRLB) with equality for all θ\theta.

The MSE decomposition MSE=variance+bias2\text{MSE} = \text{variance} + \text{bias}^2 is fundamental: sometimes a biased estimator with lower variance can achieve lower MSE than the best unbiased estimator.

Theorem: Cramer-Rao Lower Bound (CRLB)

For any unbiased estimator θ^(y)\hat{\theta}(\mathbf{y}) of a scalar parameter θ\theta, the variance is lower bounded by

Var(θ^)1I(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}

where I(θ)I(\theta) is the Fisher information:

I(θ)=E ⁣[2lnp(y;θ)θ2]=E ⁣[(lnp(y;θ)θ)2]I(\theta) = -E\!\left[\frac{\partial^2 \ln p(\mathbf{y}; \theta)} {\partial \theta^2}\right] = E\!\left[\left(\frac{\partial \ln p(\mathbf{y}; \theta)} {\partial \theta}\right)^2\right]

For a vector parameter θRp\boldsymbol{\theta} \in \mathbb{R}^p, the CRLB generalises to the matrix inequality:

Cov(θ^)I1(θ)\text{Cov}(\hat{\boldsymbol{\theta}}) \succeq \mathbf{I}^{-1}(\boldsymbol{\theta})

where [I(θ)]ij=E[2lnp/θiθj][\mathbf{I}(\boldsymbol{\theta})]_{ij} = -E[\partial^2 \ln p / \partial \theta_i \partial \theta_j] is the Fisher information matrix (FIM).

The Fisher information measures how "peaky" the likelihood function is around θ\theta: high curvature means the data are informative about θ\theta, so the estimation variance can be small. Low curvature means the data weakly constrain θ\theta, and the variance must be large.

,

Definition:

Fisher Information

The Fisher information about a parameter θ\theta contained in an observation y\mathbf{y} is

I(θ)=E ⁣[(lnp(y;θ)θ)2]=E ⁣[2lnp(y;θ)θ2]I(\theta) = E\!\left[\left(\frac{\partial \ln p(\mathbf{y}; \theta)} {\partial \theta}\right)^2\right] = -E\!\left[\frac{\partial^2 \ln p(\mathbf{y}; \theta)} {\partial \theta^2}\right]

Key properties:

  • Additivity: for NN i.i.d. observations, IN(θ)=NI1(θ)I_N(\theta) = N \cdot I_1(\theta)
  • For y=θ+wy = \theta + w with wN(0,σ2)w \sim \mathcal{N}(0, \sigma^2): I(θ)=1/σ2I(\theta) = 1/\sigma^2
  • The CRLB for NN observations becomes Var(θ^)σ2/N\text{Var}(\hat{\theta}) \geq \sigma^2/N

The Fisher information determines the fundamental precision achievable for a given measurement model and noise level.

Definition:

Maximum Likelihood (ML) Estimator

The ML estimator maximises the likelihood of the observed data:

θ^ML=argmaxθ  p(y;θ)=argmaxθ  lnp(y;θ)\hat{\theta}_{\text{ML}} = \arg\max_{\theta}\; p(\mathbf{y}; \theta) = \arg\max_{\theta}\; \ln p(\mathbf{y}; \theta)

Properties of the ML estimator:

  • Consistent: θ^MLpθ\hat{\theta}_{\text{ML}} \xrightarrow{p} \theta as NN \to \infty
  • Asymptotically efficient: achieves the CRLB as NN \to \infty
  • Asymptotically Gaussian: θ^MLN(θ,1/I(θ))\hat{\theta}_{\text{ML}} \sim \mathcal{N}(\theta, 1/I(\theta)) for large NN
  • Invariant: if θ^ML\hat{\theta}_{\text{ML}} is the ML estimate of θ\theta, then g(θ^ML)g(\hat{\theta}_{\text{ML}}) is the ML estimate of g(θ)g(\theta)

The ML estimator does not require prior knowledge of θ\theta (frequentist viewpoint) and is often computationally tractable via gradient methods.

For finite NN, the ML estimator may be biased (e.g., the ML estimate of variance uses 1/N1/N instead of 1/(N1)1/(N-1)), but the bias vanishes as NN \to \infty.

Definition:

MMSE Estimator (Bayesian)

The minimum mean-square error (MMSE) estimator minimises the Bayesian MSE E[(θ^θ)2]E[(\hat{\theta} - \theta)^2] where the expectation is over both y\mathbf{y} and θ\theta:

θ^MMSE=E[θy]\hat{\theta}_{\text{MMSE}} = E[\theta \mid \mathbf{y}]

The MMSE estimator is the conditional mean of θ\theta given the observations. It requires a prior distribution p(θ)p(\theta).

The MMSE is

MMSE=E[Var(θy)]\text{MMSE} = E[\text{Var}(\theta \mid \mathbf{y})]

the expected posterior variance.

When θ\theta and y\mathbf{y} are jointly Gaussian, the conditional mean is a linear function of y\mathbf{y}, and the MMSE estimator coincides with the LMMSE estimator.

The MMSE estimator is optimal in the MSE sense among all estimators (linear and nonlinear). The cost is that it requires knowledge of the prior p(θ)p(\theta) and computation of the posterior p(θy)p(\theta \mid \mathbf{y}), which may be intractable for complex models.

Theorem: LMMSE Estimator for Jointly Gaussian Case

For the linear observation model

y=Xθ+w\mathbf{y} = \mathbf{X}\boldsymbol{\theta} + \mathbf{w}

where θN(μθ,Cθ)\boldsymbol{\theta} \sim \mathcal{N}(\boldsymbol{\mu}_\theta, \mathbf{C}_\theta) and wN(0,Cw)\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \mathbf{C}_w) are independent, the LMMSE estimator is

θ^LMMSE=μθ+CθXH(XCθXH+Cw)1(yXμθ)\hat{\boldsymbol{\theta}}_{\text{LMMSE}} = \boldsymbol{\mu}_\theta + \mathbf{C}_\theta \mathbf{X}^H (\mathbf{X} \mathbf{C}_\theta \mathbf{X}^H + \mathbf{C}_w)^{-1} (\mathbf{y} - \mathbf{X}\boldsymbol{\mu}_\theta)

The MSE matrix is

Ce=CθCθXH(XCθXH+Cw)1XCθ\mathbf{C}_e = \mathbf{C}_\theta - \mathbf{C}_\theta \mathbf{X}^H (\mathbf{X}\mathbf{C}_\theta\mathbf{X}^H + \mathbf{C}_w)^{-1} \mathbf{X}\mathbf{C}_\theta

For the jointly Gaussian case, this equals the MMSE estimator.

Scalar case: y=xθ+wy = x\theta + w with θN(0,σθ2)\theta \sim \mathcal{N}(0, \sigma_\theta^2) and wN(0,σw2)w \sim \mathcal{N}(0, \sigma_w^2):

θ^LMMSE=xσθ2x2σθ2+σw2y\hat{\theta}_{\text{LMMSE}} = \frac{x\sigma_\theta^2}{|x|^2 \sigma_\theta^2 + \sigma_w^2}\, y

The LMMSE estimator is a regularised version of the LS estimator. At high SNR (σw20\sigma_w^2 \to 0), it reduces to LS. At low SNR (σw2\sigma_w^2 \to \infty), it shrinks toward the prior mean μθ\boldsymbol{\mu}_\theta, relying more on prior knowledge than on the noisy data.

Example: ML Estimation of Signal Amplitude in AWGN

A constant signal θ\theta is observed NN times in AWGN:

yn=θ+wn,wnN(0,σ2),n=1,,Ny_n = \theta + w_n, \qquad w_n \sim \mathcal{N}(0, \sigma^2), \quad n = 1, \ldots, N

(a) Find the ML estimate of θ\theta.

(b) Is it unbiased? Compute its variance.

(c) Does it achieve the CRLB?

Example: LMMSE Channel Estimation

A single-tap channel hCN(0,σh2)h \sim \mathcal{CN}(0, \sigma_h^2) is estimated from NN pilot observations:

yn=xnh+wn,n=1,,Ny_n = x_n h + w_n, \qquad n = 1, \ldots, N

where xnx_n are known pilots with xn2=Ep|x_n|^2 = E_p and wnCN(0,σw2)w_n \sim \mathcal{CN}(0, \sigma_w^2).

(a) Find the LS and LMMSE estimates of hh.

(b) Compare their MSE.

MMSE vs LS Estimation

Compare the LS and MMSE estimators for a frequency-selective channel. The LS estimator is unbiased but noisy; the MMSE estimator uses channel correlation to smooth the estimate. Observe how the MSE gap between LS and MMSE increases at low SNR, where prior knowledge has the greatest value.

Parameters
10
16
4

Quick Check

The Fisher information for estimating a channel gain hh from NN pilot observations at SNR =Ep/σw2= E_p/\sigma_w^2 is I(h)=NEp/σw2I(h) = N E_p/\sigma_w^2. What happens to the CRLB as the number of pilots doubles?

The CRLB halves (estimation variance floor decreases by 3 dB)

The CRLB doubles

The CRLB remains unchanged

The CRLB decreases to zero

Common Mistake: Biased Estimators Can Have Lower MSE Than Unbiased Ones

Mistake:

Always preferring unbiased estimators because "bias is bad."

Correction:

The MSE decomposes as MSE == variance ++ bias2^2. A biased estimator with significantly lower variance can achieve lower MSE than the minimum-variance unbiased estimator (MVUE).

Example: the LMMSE estimator of a zero-mean channel gain is biased (it shrinks toward zero), yet it has lower MSE than the unbiased LS estimator at every SNR.

This is the essence of the bias-variance trade-off: accepting some bias can dramatically reduce variance, especially when data are limited or noisy. The MMSE criterion explicitly optimises the total MSE, not just the variance.

Bayesian vs Frequentist Estimation

Frequentist (classical) estimation treats θ\theta as a fixed but unknown constant. The CRLB and ML estimator belong to this paradigm. Performance is measured by worst-case or average behaviour over the data distribution p(y;θ)p(\mathbf{y}; \theta).

Bayesian estimation treats θ\theta as a random variable with a known prior p(θ)p(\theta). The MMSE and MAP estimators belong to this paradigm. Performance is measured by averaging over both p(yθ)p(\mathbf{y} \mid \theta) and p(θ)p(\theta).

In wireless communications, the Bayesian viewpoint is natural: the channel is indeed random (due to fading), and its statistics (delay spread, Doppler, correlation) are often known from measurements or standards. The LMMSE channel estimator is the most prominent example of Bayesian estimation in practice.

Why This Matters: LMMSE Channel Estimation in OFDM

In OFDM systems (4G LTE, 5G NR, Wi-Fi), the channel is estimated at pilot subcarrier locations and then interpolated to data subcarriers. The LS estimator at pilot positions is

H^LS[k]=Y[k]/Xp[k]\hat{H}_{\text{LS}}[k] = Y[k] / X_p[k]

where Xp[k]X_p[k] is the known pilot symbol. The LMMSE estimator exploits the frequency correlation of the channel:

H^LMMSE=RHH(RHH+σw2(XpHXp)1)1H^LS\hat{\mathbf{H}}_{\text{LMMSE}} = \mathbf{R}_{HH} (\mathbf{R}_{HH} + \sigma_w^2 (\mathbf{X}_p^H\mathbf{X}_p)^{-1})^{-1} \hat{\mathbf{H}}_{\text{LS}}

where RHH\mathbf{R}_{HH} is the channel frequency correlation matrix, determined by the power delay profile. The MSE gain of LMMSE over LS is typically 3-5 dB in practical scenarios, translating directly to improved detection performance.

Why This Matters: Full Estimation Theory in the FSI Book

This section provides a condensed treatment of estimation theory sufficient for channel estimation and detection. For the complete theory — including sufficient statistics, Rao-Blackwell theorem, exponential families, MMSE with non-Gaussian priors, and the connection to Wiener and Kalman filtering — see the FSI book (Fundamentals of Statistical Inference), which is based on Caire's FSI course at TU Berlin.

Key extensions in the FSI book:

  • Ch 2: MMSE estimation with general priors, Wiener filter
  • Ch 3: MLE for complex models, EM algorithm, sufficient statistics
  • Chs 8-10: Compressed sensing and sparse estimation
  • Chs 11-13: Factor graphs, belief propagation, AMP/OAMP

Cramer-Rao Lower Bound (CRLB)

A lower bound on the variance of any unbiased estimator: Var(θ^)1/I(θ)\text{Var}(\hat{\theta}) \geq 1/I(\theta). It is the fundamental limit on estimation precision for a given statistical model.

Related: Fisher Information, Maximum Likelihood (ML) Estimator, Estimator, Bias, and Efficiency

Fisher Information

A measure of the information that an observation carries about an unknown parameter. Defined as the expected curvature of the log-likelihood: I(θ)=E[2lnp/θ2]I(\theta) = -E[\partial^2 \ln p / \partial\theta^2]. Higher Fisher information means more precise estimation is possible.

Related: Cramer-Rao Lower Bound (CRLB), Maximum Likelihood (ML) Estimator, Score Function

LMMSE Estimator

The linear minimum mean-square error estimator: the best estimator of the form θ^=aHy+b\hat{\theta} = \mathbf{a}^H \mathbf{y} + b that minimises E[θ^θ2]E[|\hat{\theta} - \theta|^2]. For jointly Gaussian variables, the LMMSE coincides with the (nonlinear) MMSE estimator.

Related: MMSE Estimator (Bayesian), Bayesian Estimation, Wiener Filter