Ferkans — Interactive Telecom Tutor

Why Estimation Theory?

Before we can equalize a channel, synchronize a receiver, localize a user, or form a sensing beam, we have to estimate something from noisy data. Every receiver in every wireless system is an estimator in disguise. The question this chapter answers is: given a set of observations $\mathbf{y}$ drawn from a family of distributions $\{f_\theta\}$ , what is the best we can hope to learn about $\theta$ , and which estimator realizes that limit?

Two separate worlds live inside this question. In the Bayesian world the parameter is a random variable with a prior and the natural loss is a Bayes risk. In the frequentist world the parameter is a fixed but unknown deterministic quantity, and we quantify performance through the sampling distribution of the estimator. This chapter develops the frequentist framework: bias, variance, MSE, sufficiency, the Cramer--Rao bound, and the Rao--Blackwell procedure. The Bayesian picture is the subject of Chapter 7.

Definition:
The Estimation Problem

Let $\mathbf{Y}$ be an observation taking values in $\mathcal{Y}$ , distributed according to one member of a parametric family $\{f_\theta(\mathbf{y}) : \theta \in \Lambda\}$ , with parameter domain $\Lambda \subseteq \mathbb{R}^m$ . An estimator is any measurable function $g : \mathcal{Y} \to \Lambda, \qquad \hat{\theta}(\mathbf{y}) \triangleq g(\mathbf{y}).$ In the frequentist (non-Bayesian) setting $\theta$ is a fixed unknown; we write $\mathbb{E}_\theta[\cdot]$ for expectation when $\mathbf{Y} \sim f_\theta$ and design $g$ so that the random variable $\hat{\theta}(\mathbf{Y})$ is concentrated near $\theta$ for every $\theta \in \Lambda$ .

The parameter $\theta$ does not have a distribution. Probabilities and expectations are taken over the observation noise only, with $\theta$ held fixed. This is what distinguishes frequentist from Bayesian inference.

Definition:
Bias, Variance, Mean-Squared Error

For a scalar parameter $\theta \in \Lambda \subseteq \mathbb{R}$ and estimator $\hat{\theta}(\mathbf{Y}) = g(\mathbf{Y})$ , define $b(\theta) \triangleq \mathbb{E}_\theta[\hat{\theta}(\mathbf{Y})] - \theta \qquad \text{(bias)},$ $\text{Var}_\theta(\hat{\theta}) \triangleq \mathbb{E}_\theta\!\bigl[\,(\hat{\theta}(\mathbf{Y}) - \mathbb{E}_\theta[\hat{\theta}(\mathbf{Y})])^2\bigr] \qquad \text{(variance)},$ $\mathrm{MSE}_\theta(\hat{\theta}) \triangleq \mathbb{E}_\theta\!\bigl[\,(\hat{\theta}(\mathbf{Y}) - \theta)^2\bigr] \qquad \text{(mean-squared error)}.$ The estimator is unbiased if $b(\theta) = 0$ for every $\theta \in \Lambda$ . For vector parameters, replace $\text{Var}_\theta$ with the covariance matrix $\text{Cov}_\theta(\hat{\boldsymbol{\theta}})$ and the scalar MSE by the MSE matrix $\mathbb{E}_\theta[(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}) (\hat{\boldsymbol{\theta}} - \boldsymbol{\theta})^T]$ .

Theorem: MSE Decomposition: Bias-Variance Identity

For any estimator $\hat{\theta}(\mathbf{Y})$ of a scalar parameter $\theta$ , $\mathrm{MSE}_\theta(\hat{\theta}) \;=\; b(\theta)^2 + \text{Var}_\theta(\hat{\theta}).$ For a vector estimator $\hat{\boldsymbol{\theta}}$ , $\mathbb{E}_\theta\!\bigl[(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}) (\hat{\boldsymbol{\theta}} - \boldsymbol{\theta})^T\bigr] = \text{Cov}_\theta(\hat{\boldsymbol{\theta}}) + \mathbf{b}(\boldsymbol{\theta}) \mathbf{b}(\boldsymbol{\theta})^T.$

Error at a realization has two sources. One is the systematic offset of the estimator's mean from the true value (bias). The other is the stochastic fluctuation of the estimator around its own mean (variance). The two add in quadrature. This is why an unbiased estimator need not be the best: a little bias may be worth a lot less variance. We return to this tension throughout the book.

Proof

Add and subtract the mean

Let $\mu(\theta) \triangleq \mathbb{E}_\theta[\hat{\theta}(\mathbf{Y})]$ and $b(\theta) = \mu(\theta) - \theta$ . Write $\hat{\theta}(\mathbf{Y}) - \theta = \bigl(\hat{\theta}(\mathbf{Y}) - \mu(\theta)\bigr) + \bigl(\mu(\theta) - \theta\bigr).$

Square and take expectation

Squaring and taking $\mathbb{E}_\theta$ , the cross term $2 \bigl(\mu(\theta) - \theta\bigr)\, \mathbb{E}_\theta[\hat{\theta}(\mathbf{Y}) - \mu(\theta)]$ vanishes because $\mathbb{E}_\theta[\hat{\theta}(\mathbf{Y}) - \mu(\theta)] = 0$ . What remains is $\text{Var}_\theta(\hat{\theta}) + b(\theta)^2$ . $\blacksquare$

,

Key Takeaway

Two knobs, one budget. $\mathrm{MSE} = \text{bias}^2 + \text{variance}$ . Tuning an estimator is a trade between these two terms, not a pursuit of either in isolation. Shrinkage, regularization, and ridge-type estimators all exploit this identity.

Example: Sample Mean and Sample Variance of a Gaussian

Let $Y_1, \ldots, Y_n$ be i.i.d. $\mathcal{N}(\mu, \sigma^2)$ with both parameters unknown. Consider the sample mean $\bar{Y} = \tfrac{1}{n}\sum_{i=1}^n Y_i$ and the (biased) sample variance $S_n^2 = \tfrac{1}{n}\sum_{i=1}^n (Y_i - \bar{Y})^2$ . Compute the bias and variance of each.

Solution

Sample mean: unbiased, variance $\sigma^2/n$

By linearity $\mathbb{E}_{\mu,\sigma^2}[\bar{Y}] = \mu$ , so $\bar{Y}$ is unbiased for $\mu$ . Independence of the summands gives $\text{Var}(\bar{Y}) = \sigma^2/n$ . Hence $\mathrm{MSE}(\bar{Y}) = \sigma^2/n$ .

Sample variance: biased by $-\sigma^2/n$

A standard computation (expand $(Y_i - \bar{Y})^2 = (Y_i - \mu)^2 - 2(Y_i - \mu)(\bar{Y} - \mu) + (\bar{Y} - \mu)^2$ , sum over $i$ , and take expectation) yields $\mathbb{E}[S_n^2] = \tfrac{n-1}{n}\sigma^2$ , so the bias is $b(\sigma^2) = -\sigma^2/n$ . The unbiased version is $S^{2}_{n-1} = \tfrac{1}{n-1}\sum_i (Y_i - \bar{Y})^2$ . Its variance is $\text{Var}(S^{2}_{n-1}) = 2\sigma^4/(n-1)$ .

Why $S_n^2$ sometimes beats $S_{n-1}^2$ on MSE

Rescaling by $(n-1)/n$ introduces a small bias but shrinks variance by a factor $((n-1)/n)^2$ . The MSE of $S_n^2$ equals $\tfrac{(n-1)^2}{n^2}\!\cdot\!\tfrac{2\sigma^4}{n-1} + \tfrac{\sigma^4}{n^2} = \tfrac{(2n-1)}{n^2}\sigma^4$ , which is strictly smaller than $\mathrm{MSE}(S^2_{n-1}) = 2\sigma^4/(n-1)$ for every finite $n$ . This is a first concrete instance of the shrinkage principle.

Historical Note: Fisher and the Birth of Estimation Theory (1922--1925)

1922--1925

The modern framework of parametric estimation --- likelihood, score, information, sufficiency, efficiency --- was laid down in two remarkable papers by R. A. Fisher: On the Mathematical Foundations of Theoretical Statistics (1922) and Theory of Statistical Estimation (1925). Fisher invented the words likelihood, sufficiency, consistency, and efficiency, and proposed maximum likelihood as the canonical method. His notion of "information in a sample" is exactly what we now call Fisher information. The word parameter, in its statistical sense, is also Fisher's. An unusually high density of permanent ideas per paper.

Bias--Variance Trade-off: James--Stein-type Shrinkage

We estimate $\theta$ from $\bar{Y} \sim \mathcal{N}(\theta, \sigma^2/n)$ and compare the unbiased mean $\hat{\theta}_{\text{ub}} = \bar{Y}$ with the shrinkage estimator $\hat{\theta}_\alpha = \alpha \bar{Y}$ for $\alpha \in [0, 1]$ . Move the shrinkage slider to see MSE, bias $^2$ , and variance trade against each other.

Parameters

\theta

1

\sigma

1

n

10

Unbiased Estimator

An estimator $\hat{\theta}(\mathbf{Y})$ is unbiased for $\theta$ if $\mathbb{E}_\theta[\hat{\theta}(\mathbf{Y})] = \theta$ for every $\theta \in \Lambda$ .

Minimum-Variance Unbiased Estimator (MVUE)

An unbiased estimator whose variance is no larger than that of any other unbiased estimator, simultaneously for every $\theta \in \Lambda$ . The MVUE is unique (almost surely) when it exists, and Lehmann--Scheffe gives a constructive route to it through complete sufficient statistics.

Common Mistake: Unbiased Does Not Mean Best

Mistake:

It is tempting to restrict attention to unbiased estimators and then pick the one with smallest variance, treating the resulting MVUE as "the" optimal estimator.

Correction:

Unbiasedness is a constraint we impose for analytical tractability and interpretability --- not a criterion that the physical problem demands. Biased estimators routinely beat the MVUE in MSE (ridge regression, shrinkage, maximum-a-posteriori estimators, empirical Bayes). Use the MVUE as a benchmark, not a mandate.

Why This Matters: Every Receiver is an Estimator

Channel estimation, timing acquisition, carrier-frequency offset correction, angle-of-arrival estimation --- each problem presents a receiver with noisy observations $\mathbf{y}$ and asks for an unknown parameter (a complex gain, a delay, a frequency, an angle). Matched filters, correlators, and FFT bins are not ad-hoc circuits: they are the maximum-likelihood or MVU estimators for specific parametric models. The bias, variance, and Cramer--Rao bounds we compute in this chapter are the hardware specifications of these blocks.

Quick Check

An estimator $\hat{\theta}_1$ is unbiased with variance $9$ . Another estimator $\hat{\theta}_2$ has bias $b = 1$ and variance $4$ . Which has smaller MSE?

$\hat{\theta}_1$ (lower variance)

$\hat{\theta}_2$ , because $1^2 + 4 = 5 < 9$

They have the same MSE

Cannot decide without knowing $\theta$