The Estimation Problem

Why Estimation Theory?

Before we can equalize a channel, synchronize a receiver, localize a user, or form a sensing beam, we have to estimate something from noisy data. Every receiver in every wireless system is an estimator in disguise. The question this chapter answers is: given a set of observations y\mathbf{y} drawn from a family of distributions {fθ}\{f_\theta\}, what is the best we can hope to learn about θ\theta, and which estimator realizes that limit?

Two separate worlds live inside this question. In the Bayesian world the parameter is a random variable with a prior and the natural loss is a Bayes risk. In the frequentist world the parameter is a fixed but unknown deterministic quantity, and we quantify performance through the sampling distribution of the estimator. This chapter develops the frequentist framework: bias, variance, MSE, sufficiency, the Cramer--Rao bound, and the Rao--Blackwell procedure. The Bayesian picture is the subject of Chapter 7.

Definition:

The Estimation Problem

Let Y\mathbf{Y} be an observation taking values in Y\mathcal{Y}, distributed according to one member of a parametric family {fθ(y):θΛ}\{f_\theta(\mathbf{y}) : \theta \in \Lambda\}, with parameter domain ΛRm\Lambda \subseteq \mathbb{R}^m. An estimator is any measurable function g:YΛ,θ^(y)g(y).g : \mathcal{Y} \to \Lambda, \qquad \hat{\theta}(\mathbf{y}) \triangleq g(\mathbf{y}). In the frequentist (non-Bayesian) setting θ\theta is a fixed unknown; we write Eθ[]\mathbb{E}_\theta[\cdot] for expectation when Yfθ\mathbf{Y} \sim f_\theta and design gg so that the random variable θ^(Y)\hat{\theta}(\mathbf{Y}) is concentrated near θ\theta for every θΛ\theta \in \Lambda.

The parameter θ\theta does not have a distribution. Probabilities and expectations are taken over the observation noise only, with θ\theta held fixed. This is what distinguishes frequentist from Bayesian inference.

Definition:

Bias, Variance, Mean-Squared Error

For a scalar parameter θΛR\theta \in \Lambda \subseteq \mathbb{R} and estimator θ^(Y)=g(Y)\hat{\theta}(\mathbf{Y}) = g(\mathbf{Y}), define b(θ)Eθ[θ^(Y)]θ(bias),b(\theta) \triangleq \mathbb{E}_\theta[\hat{\theta}(\mathbf{Y})] - \theta \qquad \text{(bias)}, Varθ(θ^)Eθ ⁣[(θ^(Y)Eθ[θ^(Y)])2](variance),\text{Var}_\theta(\hat{\theta}) \triangleq \mathbb{E}_\theta\!\bigl[\,(\hat{\theta}(\mathbf{Y}) - \mathbb{E}_\theta[\hat{\theta}(\mathbf{Y})])^2\bigr] \qquad \text{(variance)}, MSEθ(θ^)Eθ ⁣[(θ^(Y)θ)2](mean-squared error).\mathrm{MSE}_\theta(\hat{\theta}) \triangleq \mathbb{E}_\theta\!\bigl[\,(\hat{\theta}(\mathbf{Y}) - \theta)^2\bigr] \qquad \text{(mean-squared error)}. The estimator is unbiased if b(θ)=0b(\theta) = 0 for every θΛ\theta \in \Lambda. For vector parameters, replace Varθ\text{Var}_\theta with the covariance matrix Covθ(θ^)\text{Cov}_\theta(\hat{\boldsymbol{\theta}}) and the scalar MSE by the MSE matrix Eθ[(θ^θ)(θ^θ)T]\mathbb{E}_\theta[(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}) (\hat{\boldsymbol{\theta}} - \boldsymbol{\theta})^T].

Theorem: MSE Decomposition: Bias-Variance Identity

For any estimator θ^(Y)\hat{\theta}(\mathbf{Y}) of a scalar parameter θ\theta, MSEθ(θ^)  =  b(θ)2+Varθ(θ^).\mathrm{MSE}_\theta(\hat{\theta}) \;=\; b(\theta)^2 + \text{Var}_\theta(\hat{\theta}). For a vector estimator θ^\hat{\boldsymbol{\theta}}, Eθ ⁣[(θ^θ)(θ^θ)T]=Covθ(θ^)+b(θ)b(θ)T.\mathbb{E}_\theta\!\bigl[(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}) (\hat{\boldsymbol{\theta}} - \boldsymbol{\theta})^T\bigr] = \text{Cov}_\theta(\hat{\boldsymbol{\theta}}) + \mathbf{b}(\boldsymbol{\theta}) \mathbf{b}(\boldsymbol{\theta})^T.

Error at a realization has two sources. One is the systematic offset of the estimator's mean from the true value (bias). The other is the stochastic fluctuation of the estimator around its own mean (variance). The two add in quadrature. This is why an unbiased estimator need not be the best: a little bias may be worth a lot less variance. We return to this tension throughout the book.

,

Key Takeaway

Two knobs, one budget. MSE=bias2+variance\mathrm{MSE} = \text{bias}^2 + \text{variance}. Tuning an estimator is a trade between these two terms, not a pursuit of either in isolation. Shrinkage, regularization, and ridge-type estimators all exploit this identity.

Example: Sample Mean and Sample Variance of a Gaussian

Let Y1,,YnY_1, \ldots, Y_n be i.i.d. N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with both parameters unknown. Consider the sample mean Yˉ=1ni=1nYi\bar{Y} = \tfrac{1}{n}\sum_{i=1}^n Y_i and the (biased) sample variance Sn2=1ni=1n(YiYˉ)2S_n^2 = \tfrac{1}{n}\sum_{i=1}^n (Y_i - \bar{Y})^2. Compute the bias and variance of each.

Historical Note: Fisher and the Birth of Estimation Theory (1922--1925)

1922--1925

The modern framework of parametric estimation --- likelihood, score, information, sufficiency, efficiency --- was laid down in two remarkable papers by R. A. Fisher: On the Mathematical Foundations of Theoretical Statistics (1922) and Theory of Statistical Estimation (1925). Fisher invented the words likelihood, sufficiency, consistency, and efficiency, and proposed maximum likelihood as the canonical method. His notion of "information in a sample" is exactly what we now call Fisher information. The word parameter, in its statistical sense, is also Fisher's. An unusually high density of permanent ideas per paper.

Bias--Variance Trade-off: James--Stein-type Shrinkage

We estimate θ\theta from YˉN(θ,σ2/n)\bar{Y} \sim \mathcal{N}(\theta, \sigma^2/n) and compare the unbiased mean θ^ub=Yˉ\hat{\theta}_{\text{ub}} = \bar{Y} with the shrinkage estimator θ^α=αYˉ\hat{\theta}_\alpha = \alpha \bar{Y} for α[0,1]\alpha \in [0, 1]. Move the shrinkage slider to see MSE, bias2^2, and variance trade against each other.

Parameters
1
1
10

Unbiased Estimator

An estimator θ^(Y)\hat{\theta}(\mathbf{Y}) is unbiased for θ\theta if Eθ[θ^(Y)]=θ\mathbb{E}_\theta[\hat{\theta}(\mathbf{Y})] = \theta for every θΛ\theta \in \Lambda.

Related: Bias, Variance, Mean-Squared Error, A Procedure for Building the MVUE

Minimum-Variance Unbiased Estimator (MVUE)

An unbiased estimator whose variance is no larger than that of any other unbiased estimator, simultaneously for every θΛ\theta \in \Lambda. The MVUE is unique (almost surely) when it exists, and Lehmann--Scheffe gives a constructive route to it through complete sufficient statistics.

Related: Unbiased Estimator, Sufficient Statistic, Rao--Blackwell Theorem

Common Mistake: Unbiased Does Not Mean Best

Mistake:

It is tempting to restrict attention to unbiased estimators and then pick the one with smallest variance, treating the resulting MVUE as "the" optimal estimator.

Correction:

Unbiasedness is a constraint we impose for analytical tractability and interpretability --- not a criterion that the physical problem demands. Biased estimators routinely beat the MVUE in MSE (ridge regression, shrinkage, maximum-a-posteriori estimators, empirical Bayes). Use the MVUE as a benchmark, not a mandate.

Why This Matters: Every Receiver is an Estimator

Channel estimation, timing acquisition, carrier-frequency offset correction, angle-of-arrival estimation --- each problem presents a receiver with noisy observations y\mathbf{y} and asks for an unknown parameter (a complex gain, a delay, a frequency, an angle). Matched filters, correlators, and FFT bins are not ad-hoc circuits: they are the maximum-likelihood or MVU estimators for specific parametric models. The bias, variance, and Cramer--Rao bounds we compute in this chapter are the hardware specifications of these blocks.

Quick Check

An estimator θ^1\hat{\theta}_1 is unbiased with variance 99. Another estimator θ^2\hat{\theta}_2 has bias b=1b = 1 and variance 44. Which has smaller MSE?

θ^1\hat{\theta}_1 (lower variance)

θ^2\hat{\theta}_2, because 12+4=5<91^2 + 4 = 5 < 9

They have the same MSE

Cannot decide without knowing θ\theta