Ferkans — Interactive Telecom Tutor

From Detection to Estimation

Detection chooses among a discrete set of hypotheses; estimation determines a continuous parameter from noisy observations. In communications, the parameter to estimate is often the channel itself: its gain, phase, delay, or frequency response. This section develops the mathematical framework for optimal estimation, starting with the fundamental limit (Cramer-Rao bound) and the two main paradigms: frequentist (ML) and Bayesian (MMSE/LMMSE).

Definition:
Estimator, Bias, and Efficiency

An estimator $\hat{\theta}(\mathbf{y})$ is a function of the observed data $\mathbf{y}$ that produces an estimate of an unknown parameter $\theta$ .

Bias: $b(\theta) = E[\hat{\theta}] - \theta$ . An estimator is unbiased if $b(\theta) = 0$ for all $\theta$ .
Mean-square error: $\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Var}(\hat{\theta}) + b^2(\theta)$ .
Consistency: $\hat{\theta} \xrightarrow{p} \theta$ as the number of observations $N \to \infty$ .
Efficiency: An unbiased estimator is efficient if it achieves the Cramer-Rao lower bound (CRLB) with equality for all $\theta$ .

The MSE decomposition $\text{MSE} = \text{variance} + \text{bias}^2$ is fundamental: sometimes a biased estimator with lower variance can achieve lower MSE than the best unbiased estimator.

Theorem: Cramer-Rao Lower Bound (CRLB)

For any unbiased estimator $\hat{\theta}(\mathbf{y})$ of a scalar parameter $\theta$ , the variance is lower bounded by

$\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}$

where $I(\theta)$ is the Fisher information:

$I(\theta) = -E\!\left[\frac{\partial^2 \ln p(\mathbf{y}; \theta)} {\partial \theta^2}\right] = E\!\left[\left(\frac{\partial \ln p(\mathbf{y}; \theta)} {\partial \theta}\right)^2\right]$

For a vector parameter $\boldsymbol{\theta} \in \mathbb{R}^p$ , the CRLB generalises to the matrix inequality:

$\text{Cov}(\hat{\boldsymbol{\theta}}) \succeq \mathbf{I}^{-1}(\boldsymbol{\theta})$

where $[\mathbf{I}(\boldsymbol{\theta})]_{ij} = -E[\partial^2 \ln p / \partial \theta_i \partial \theta_j]$ is the Fisher information matrix (FIM).

The Fisher information measures how "peaky" the likelihood function is around $\theta$ : high curvature means the data are informative about $\theta$ , so the estimation variance can be small. Low curvature means the data weakly constrain $\theta$ , and the variance must be large.

Proof

Score function

Define the score $S(\theta) = \partial \ln p(\mathbf{y};\theta)/\partial\theta$ . Under regularity conditions, $E[S(\theta)] = \frac{\partial}{\partial\theta} \int p(\mathbf{y};\theta)\, d\mathbf{y} = 0$ .

Covariance inequality

For an unbiased estimator, $E[\hat{\theta}] = \theta$ , so $\frac{\partial}{\partial\theta} E[\hat{\theta}] = 1$ , giving

$\text{Cov}(\hat{\theta}, S) = 1$

By the Cauchy-Schwarz inequality: $1 = \text{Cov}^2(\hat{\theta}, S) \leq \text{Var}(\hat{\theta}) \cdot \text{Var}(S)$

Since $\text{Var}(S) = I(\theta)$ : $\text{Var}(\hat{\theta}) \geq 1/I(\theta)$ . $\blacksquare$

,

Definition:
Fisher Information

The Fisher information about a parameter $\theta$ contained in an observation $\mathbf{y}$ is

$I(\theta) = E\!\left[\left(\frac{\partial \ln p(\mathbf{y}; \theta)} {\partial \theta}\right)^2\right] = -E\!\left[\frac{\partial^2 \ln p(\mathbf{y}; \theta)} {\partial \theta^2}\right]$

Key properties:

Additivity: for $N$ i.i.d. observations, $I_N(\theta) = N \cdot I_1(\theta)$
For $y = \theta + w$ with $w \sim \mathcal{N}(0, \sigma^2)$ : $I(\theta) = 1/\sigma^2$
The CRLB for $N$ observations becomes $\text{Var}(\hat{\theta}) \geq \sigma^2/N$

The Fisher information determines the fundamental precision achievable for a given measurement model and noise level.

Definition:
Maximum Likelihood (ML) Estimator

The ML estimator maximises the likelihood of the observed data:

$\hat{\theta}_{\text{ML}} = \arg\max_{\theta}\; p(\mathbf{y}; \theta) = \arg\max_{\theta}\; \ln p(\mathbf{y}; \theta)$

Properties of the ML estimator:

Consistent: $\hat{\theta}_{\text{ML}} \xrightarrow{p} \theta$ as $N \to \infty$
Asymptotically efficient: achieves the CRLB as $N \to \infty$
Asymptotically Gaussian: $\hat{\theta}_{\text{ML}} \sim \mathcal{N}(\theta, 1/I(\theta))$ for large $N$
Invariant: if $\hat{\theta}_{\text{ML}}$ is the ML estimate of $\theta$ , then $g(\hat{\theta}_{\text{ML}})$ is the ML estimate of $g(\theta)$

The ML estimator does not require prior knowledge of $\theta$ (frequentist viewpoint) and is often computationally tractable via gradient methods.

For finite $N$ , the ML estimator may be biased (e.g., the ML estimate of variance uses $1/N$ instead of $1/(N-1)$ ), but the bias vanishes as $N \to \infty$ .

Definition:
MMSE Estimator (Bayesian)

The minimum mean-square error (MMSE) estimator minimises the Bayesian MSE $E[(\hat{\theta} - \theta)^2]$ where the expectation is over both $\mathbf{y}$ and $\theta$ :

$\hat{\theta}_{\text{MMSE}} = E[\theta \mid \mathbf{y}]$

The MMSE estimator is the conditional mean of $\theta$ given the observations. It requires a prior distribution $p(\theta)$ .

The MMSE is

$\text{MMSE} = E[\text{Var}(\theta \mid \mathbf{y})]$

the expected posterior variance.

When $\theta$ and $\mathbf{y}$ are jointly Gaussian, the conditional mean is a linear function of $\mathbf{y}$ , and the MMSE estimator coincides with the LMMSE estimator.

The MMSE estimator is optimal in the MSE sense among all estimators (linear and nonlinear). The cost is that it requires knowledge of the prior $p(\theta)$ and computation of the posterior $p(\theta \mid \mathbf{y})$ , which may be intractable for complex models.

Theorem: LMMSE Estimator for Jointly Gaussian Case

For the linear observation model

$\mathbf{y} = \mathbf{X}\boldsymbol{\theta} + \mathbf{w}$

where $\boldsymbol{\theta} \sim \mathcal{N}(\boldsymbol{\mu}_\theta, \mathbf{C}_\theta)$ and $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \mathbf{C}_w)$ are independent, the LMMSE estimator is

$\hat{\boldsymbol{\theta}}_{\text{LMMSE}} = \boldsymbol{\mu}_\theta + \mathbf{C}_\theta \mathbf{X}^H (\mathbf{X} \mathbf{C}_\theta \mathbf{X}^H + \mathbf{C}_w)^{-1} (\mathbf{y} - \mathbf{X}\boldsymbol{\mu}_\theta)$

The MSE matrix is

$\mathbf{C}_e = \mathbf{C}_\theta - \mathbf{C}_\theta \mathbf{X}^H (\mathbf{X}\mathbf{C}_\theta\mathbf{X}^H + \mathbf{C}_w)^{-1} \mathbf{X}\mathbf{C}_\theta$

For the jointly Gaussian case, this equals the MMSE estimator.

Scalar case: $y = x\theta + w$ with $\theta \sim \mathcal{N}(0, \sigma_\theta^2)$ and $w \sim \mathcal{N}(0, \sigma_w^2)$ :

$\hat{\theta}_{\text{LMMSE}} = \frac{x\sigma_\theta^2}{|x|^2 \sigma_\theta^2 + \sigma_w^2}\, y$

The LMMSE estimator is a regularised version of the LS estimator. At high SNR ( $\sigma_w^2 \to 0$ ), it reduces to LS. At low SNR ( $\sigma_w^2 \to \infty$ ), it shrinks toward the prior mean $\boldsymbol{\mu}_\theta$ , relying more on prior knowledge than on the noisy data.

Proof

Orthogonality principle

The LMMSE estimator satisfies the orthogonality condition:

$E[(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}) \mathbf{y}^H] = \mathbf{0}$

This means the estimation error is orthogonal to the observation.

Derivation

Writing $\hat{\boldsymbol{\theta}} = \mathbf{A}\mathbf{y} + \mathbf{b}$ and applying the orthogonality principle:

$\mathbf{C}_{\theta y} = \mathbf{A}\mathbf{C}_{yy}$

$\mathbf{A} = \mathbf{C}_{\theta y}\mathbf{C}_{yy}^{-1} = \mathbf{C}_\theta\mathbf{X}^H (\mathbf{X}\mathbf{C}_\theta\mathbf{X}^H + \mathbf{C}_w)^{-1}$

and $\mathbf{b} = \boldsymbol{\mu}_\theta - \mathbf{A}\mathbf{X}\boldsymbol{\mu}_\theta$ . $\blacksquare$

Example: ML Estimation of Signal Amplitude in AWGN

A constant signal $\theta$ is observed $N$ times in AWGN:

$y_n = \theta + w_n, \qquad w_n \sim \mathcal{N}(0, \sigma^2), \quad n = 1, \ldots, N$

(a) Find the ML estimate of $\theta$ .

(b) Is it unbiased? Compute its variance.

(c) Does it achieve the CRLB?

Solution

ML estimate

The log-likelihood is

$\ln p(\mathbf{y}; \theta) = -\frac{N}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{n=1}^{N}(y_n - \theta)^2$

Setting $\partial/\partial\theta = 0$ :

$\hat{\theta}_{\text{ML}} = \frac{1}{N}\sum_{n=1}^{N} y_n = \bar{y}$

The ML estimate is the sample mean.

Bias and variance

$E[\hat{\theta}_{\text{ML}}] = E[\bar{y}] = \theta$ : unbiased.

$\text{Var}(\hat{\theta}_{\text{ML}}) = \text{Var}(\bar{y}) = \sigma^2/N$ .

CRLB comparison

Fisher information: $I(\theta) = N/\sigma^2$ .

CRLB: $\text{Var}(\hat{\theta}) \geq 1/I(\theta) = \sigma^2/N$ .

Since $\text{Var}(\hat{\theta}_{\text{ML}}) = \sigma^2/N$ , the ML estimator achieves the CRLB exactly. It is efficient. $\blacksquare$

Example: LMMSE Channel Estimation

A single-tap channel $h \sim \mathcal{CN}(0, \sigma_h^2)$ is estimated from $N$ pilot observations:

$y_n = x_n h + w_n, \qquad n = 1, \ldots, N$

where $x_n$ are known pilots with $|x_n|^2 = E_p$ and $w_n \sim \mathcal{CN}(0, \sigma_w^2)$ .

(a) Find the LS and LMMSE estimates of $h$ .

(b) Compare their MSE.

Solution

LS estimate

$\hat{h}_{\text{LS}} = \frac{\mathbf{x}^H \mathbf{y}}{\|\mathbf{x}\|^2} = \frac{\sum_n x_n^* y_n}{N E_p}$ $MSE:$ E[|\hat{h}_{\text{LS}} - h|^2] = \sigma_w^2 / (N E_p)$.

LMMSE estimate

$\hat{h}_{\text{LMMSE}} = \frac{\sigma_h^2}{\sigma_h^2 + \sigma_w^2/(N E_p)}\, \hat{h}_{\text{LS}}$ $This is a **shrinkage** of the LS estimate toward zero (the prior mean). The shrinkage factor$ \alpha = \sigma_h^2 / (\sigma_h^2 + \sigma_w^2/(N E_p))$ approaches 1 at high SNR and 0 at low SNR.

MSE comparison

$\text{MSE}_{\text{LMMSE}} = \frac{\sigma_h^2 \sigma_w^2 / (N E_p)} {\sigma_h^2 + \sigma_w^2 / (N E_p)} = \frac{\sigma_h^2}{1 + N E_p \sigma_h^2 / \sigma_w^2}$ $At SNR$ = N E_p \sigma_h^2 / \sigma_w^2 = 10 $dB: - MSE$ _{\text{LS}} = \sigma_w^2/(N E_p) $- MSE$ _{\text{LMMSE}} = \sigma_h^2/11 \approx 0.091, \sigma_h^2 $The LMMSE is always better: MSE$ _{\text{LMMSE}} \leq $MSE$ _{\text{LS}} $. The gain is largest at low SNR.$ \blacksquare$

MMSE vs LS Estimation

Compare the LS and MMSE estimators for a frequency-selective channel. The LS estimator is unbiased but noisy; the MMSE estimator uses channel correlation to smooth the estimate. Observe how the MSE gap between LS and MMSE increases at low SNR, where prior knowledge has the greatest value.

Parameters

SNR (dB)10

Number of pilots16

Channel taps4

Quick Check

The Fisher information for estimating a channel gain $h$ from $N$ pilot observations at SNR $= E_p/\sigma_w^2$ is $I(h) = N E_p/\sigma_w^2$ . What happens to the CRLB as the number of pilots doubles?

The CRLB halves (estimation variance floor decreases by 3 dB)

The CRLB doubles

The CRLB remains unchanged

The CRLB decreases to zero

Correction:

The CRLB halves (estimation variance floor decreases by 3 dB)

CRLB $= 1/I(h) = \sigma_w^2/(N E_p)$ . Doubling $N$ halves the CRLB. In dB: the MSE floor decreases by $10\log_{10}(2) = 3$ dB. This is the familiar " $\sqrt{N}$ improvement" in the standard deviation.

Common Mistake: Biased Estimators Can Have Lower MSE Than Unbiased Ones

Mistake:

Always preferring unbiased estimators because "bias is bad."

Correction:

The MSE decomposes as MSE $=$ variance $+$ bias $^2$ . A biased estimator with significantly lower variance can achieve lower MSE than the minimum-variance unbiased estimator (MVUE).

Example: the LMMSE estimator of a zero-mean channel gain is biased (it shrinks toward zero), yet it has lower MSE than the unbiased LS estimator at every SNR.

This is the essence of the bias-variance trade-off: accepting some bias can dramatically reduce variance, especially when data are limited or noisy. The MMSE criterion explicitly optimises the total MSE, not just the variance.

Bayesian vs Frequentist Estimation

Frequentist (classical) estimation treats $\theta$ as a fixed but unknown constant. The CRLB and ML estimator belong to this paradigm. Performance is measured by worst-case or average behaviour over the data distribution $p(\mathbf{y}; \theta)$ .

Bayesian estimation treats $\theta$ as a random variable with a known prior $p(\theta)$ . The MMSE and MAP estimators belong to this paradigm. Performance is measured by averaging over both $p(\mathbf{y} \mid \theta)$ and $p(\theta)$ .

In wireless communications, the Bayesian viewpoint is natural: the channel is indeed random (due to fading), and its statistics (delay spread, Doppler, correlation) are often known from measurements or standards. The LMMSE channel estimator is the most prominent example of Bayesian estimation in practice.

Why This Matters: LMMSE Channel Estimation in OFDM

In OFDM systems (4G LTE, 5G NR, Wi-Fi), the channel is estimated at pilot subcarrier locations and then interpolated to data subcarriers. The LS estimator at pilot positions is

$\hat{H}_{\text{LS}}[k] = Y[k] / X_p[k]$

where $X_p[k]$ is the known pilot symbol. The LMMSE estimator exploits the frequency correlation of the channel:

$\hat{\mathbf{H}}_{\text{LMMSE}} = \mathbf{R}_{HH} (\mathbf{R}_{HH} + \sigma_w^2 (\mathbf{X}_p^H\mathbf{X}_p)^{-1})^{-1} \hat{\mathbf{H}}_{\text{LS}}$

where $\mathbf{R}_{HH}$ is the channel frequency correlation matrix, determined by the power delay profile. The MSE gain of LMMSE over LS is typically 3-5 dB in practical scenarios, translating directly to improved detection performance.

Why This Matters: Full Estimation Theory in the FSI Book

This section provides a condensed treatment of estimation theory sufficient for channel estimation and detection. For the complete theory — including sufficient statistics, Rao-Blackwell theorem, exponential families, MMSE with non-Gaussian priors, and the connection to Wiener and Kalman filtering — see the FSI book (Fundamentals of Statistical Inference), which is based on Caire's FSI course at TU Berlin.

Key extensions in the FSI book:

Ch 2: MMSE estimation with general priors, Wiener filter
Ch 3: MLE for complex models, EM algorithm, sufficient statistics
Chs 8-10: Compressed sensing and sparse estimation
Chs 11-13: Factor graphs, belief propagation, AMP/OAMP

Cramer-Rao Lower Bound (CRLB)

A lower bound on the variance of any unbiased estimator: $\text{Var}(\hat{\theta}) \geq 1/I(\theta)$ . It is the fundamental limit on estimation precision for a given statistical model.

Fisher Information

A measure of the information that an observation carries about an unknown parameter. Defined as the expected curvature of the log-likelihood: $I(\theta) = -E[\partial^2 \ln p / \partial\theta^2]$ . Higher Fisher information means more precise estimation is possible.

LMMSE Estimator

The linear minimum mean-square error estimator: the best estimator of the form $\hat{\theta} = \mathbf{a}^H \mathbf{y} + b$ that minimises $E[|\hat{\theta} - \theta|^2]$ . For jointly Gaussian variables, the LMMSE coincides with the (nonlinear) MMSE estimator.

Estimation Theory Fundamentals