The MMSE Estimator
The Central Question of Estimation Theory
Suppose we observe and want to predict . We are free to choose any function as our estimator. Which is best?
The answer depends on what "best" means. If we choose the mean square error as our loss function β and this is the most common choice in signal processing β then the answer is remarkably clean: the optimal is the conditional expectation .
This section proves this fact and develops its most important consequence: the orthogonality principle.
Definition: Minimum Mean Square Error (MMSE) Estimator
Minimum Mean Square Error (MMSE) Estimator
Given random variables and , the MMSE estimator of given is
where the minimization is over all measurable functions .
The resulting minimum MSE is called the MMSE:
The minimization is over all functions, not just linear ones. This is what makes MMSE estimation powerful β and what makes it hard to compute in general.
Theorem: MMSE Estimator Is the Conditional Expectation
For any random variables and with :
That is, the function minimizes the mean square error over all measurable functions .
The conditional mean is the "center of mass" of given each value of . Any other estimator deviates from this center, adding unnecessary error. This is the estimation-theoretic analogue of the fact that the sample mean minimizes the sum of squared deviations.
Decompose the MSE
For any function , write
Expanding the square:
Show the cross-term vanishes
The cross-term is where . By the tower property:
Now . So the cross-term is zero.
Conclude
We are left with
The second term is non-negative and equals zero if and only if almost surely. Hence is the unique minimizer.
Theorem: The Orthogonality Principle
The estimation error is orthogonal to every function of . That is, for any measurable :
Equivalently, the error is uncorrelated with every function of : .
The orthogonality principle says that the MMSE estimator extracts all the information in that is relevant to predicting . Whatever error remains cannot be reduced by any further processing of . Geometrically, is the projection of onto the "subspace" of all functions of , and the error is perpendicular to that subspace.
Apply the tower property
Let . For any measurable :
We used the "pulling out known" property since is a function of .
Show the conditional mean of the error is zero
Therefore .
Orthogonality Principle: Error vs. Function of
Generate samples from a joint distribution , compute the MMSE estimate , and scatter-plot the error against for various choices of . The sample correlation should be near zero β confirming orthogonality.
Parameters
Example: MMSE Estimation of a Binary Signal in Gaussian Noise
Let with equal probability, and where is independent of . Find .
Compute the posterior probabilities
By Bayes' rule:
Similarly, .
Compute the conditional mean
$
Interpret
The MMSE estimator is β a nonlinear function of , even though the channel model is linear. At high SNR (), , recovering the MAP/ML detector. At low SNR, the estimator is nearly linear.
MMSE vs. Linear Estimator for Non-Gaussian Model
Compare the nonlinear MMSE estimator with the best linear estimator for different joint distributions. For Gaussian data, they coincide; for non-Gaussian data, the MMSE can be significantly better.
Parameters
Noise variance
Common Mistake: MMSE Is Not the Same as MAP
Mistake:
Confusing the MMSE estimator (conditional mean) with the MAP estimator (mode of the posterior). For symmetric unimodal posteriors they happen to coincide, but in general they differ.
Correction:
The MMSE estimator minimizes the mean square error: . The MAP estimator maximizes the posterior: . For example, with a skewed posterior, the mean (MMSE) and mode (MAP) are different. The MMSE is optimal under squared-error loss; the MAP is optimal under 0-1 loss (for continuous parameters, this becomes a delta-function loss).
The Geometric Picture
Think of the space of all square-integrable random variables as a Hilbert space with inner product . The set of all functions of β β forms a closed subspace. The conditional expectation is the orthogonal projection of onto this subspace.
The orthogonality principle is then literally the statement that the projection error is perpendicular to the subspace β exactly as in finite-dimensional linear algebra.
Definition: MMSE and Conditional Variance
MMSE and Conditional Variance
The minimum mean square error achieved by the MMSE estimator is
This identity connects the MMSE to the average conditional variance: the remaining uncertainty in after observing is exactly the conditional variance, averaged over .
Quick Check
For jointly Gaussian with correlation , what is the MMSE (i.e., )?
The conditional variance is for all (constant for Gaussians). So .
MMSE Channel Estimation in 5G NR
In 5G NR, channel estimation for OFDM subcarriers uses pilot symbols to obtain noisy observations of the channel frequency response. The MMSE estimator requires knowledge of the channel covariance structure, which is estimated from second-order statistics. When the channel is modeled as Gaussian (Rayleigh fading), the MMSE estimator is linear β the LMMSE estimator of Section 12.3.
- β’
Requires channel covariance matrix, typically estimated from data
- β’
Complexity scales with number of pilots; reduced-rank methods used in practice
MMSE (Minimum Mean Square Error)
The estimator that minimizes over all measurable functions . It equals the conditional expectation .
Related: Conditional Expectation, LMMSE Estimator (Vector Case: Wiener-Hopf Equation)
Orthogonality Principle
The property that the MMSE estimation error is uncorrelated with (orthogonal to) every function of the observation .
Related: Minimum Mean Square Error (MMSE) Estimator, LMMSE Requires Inversion of