The MMSE Estimator

The Central Question of Estimation Theory

Suppose we observe YY and want to predict XX. We are free to choose any function g(Y)g(Y) as our estimator. Which gg is best?

The answer depends on what "best" means. If we choose the mean square error E[(Xβˆ’g(Y))2]\mathbb{E}[(X - g(Y))^2] as our loss function β€” and this is the most common choice in signal processing β€” then the answer is remarkably clean: the optimal gg is the conditional expectation E[X∣Y]\mathbb{E}[X|Y].

This section proves this fact and develops its most important consequence: the orthogonality principle.

Definition:

Minimum Mean Square Error (MMSE) Estimator

Given random variables XX and YY, the MMSE estimator of XX given YY is

X^MMSE=arg⁑min⁑g E[(Xβˆ’g(Y))2]\hat{X}_{\text{MMSE}} = \arg\min_{g} \, \mathbb{E}\bigl[(X - g(Y))^2\bigr]

where the minimization is over all measurable functions g:R→Rg: \mathbb{R} \to \mathbb{R}.

The resulting minimum MSE is called the MMSE:

MMSE=E[(Xβˆ’X^MMSE)2]=E[Var(X∣Y)].\text{MMSE} = \mathbb{E}\bigl[(X - \hat{X}_{\text{MMSE}})^2\bigr] = \mathbb{E}\bigl[\text{Var}(X|Y)\bigr].

The minimization is over all functions, not just linear ones. This is what makes MMSE estimation powerful β€” and what makes it hard to compute in general.

,

Theorem: MMSE Estimator Is the Conditional Expectation

For any random variables XX and YY with E[X2]<∞\mathbb{E}[X^2] < \infty:

X^MMSE=E[X∣Y].\hat{X}_{\text{MMSE}} = \mathbb{E}[X|Y].

That is, the function gβˆ—(y)=E[X∣Y=y]g^*(y) = \mathbb{E}[X|Y=y] minimizes the mean square error E[(Xβˆ’g(Y))2]\mathbb{E}[(X - g(Y))^2] over all measurable functions gg.

The conditional mean is the "center of mass" of XX given each value of YY. Any other estimator deviates from this center, adding unnecessary error. This is the estimation-theoretic analogue of the fact that the sample mean minimizes the sum of squared deviations.

,

Theorem: The Orthogonality Principle

The estimation error X~=Xβˆ’E[X∣Y]\tilde{X} = X - \mathbb{E}[X|Y] is orthogonal to every function of YY. That is, for any measurable hh:

E[(Xβˆ’E[X∣Y])β‹…h(Y)]=0.\mathbb{E}\bigl[(X - \mathbb{E}[X|Y]) \cdot h(Y)\bigr] = 0.

Equivalently, the error X~\tilde{X} is uncorrelated with every function of YY: Cov(X~,h(Y))=0\text{Cov}(\tilde{X}, h(Y)) = 0.

The orthogonality principle says that the MMSE estimator extracts all the information in YY that is relevant to predicting XX. Whatever error remains cannot be reduced by any further processing of YY. Geometrically, E[X∣Y]\mathbb{E}[X|Y] is the projection of XX onto the "subspace" of all functions of YY, and the error is perpendicular to that subspace.

,

Orthogonality Principle: Error vs. Function of YY

Generate samples from a joint distribution (X,Y)(X,Y), compute the MMSE estimate X^=E[X∣Y]\hat{X} = \mathbb{E}[X|Y], and scatter-plot the error X~=Xβˆ’X^\tilde{X} = X - \hat{X} against h(Y)h(Y) for various choices of hh. The sample correlation should be near zero β€” confirming orthogonality.

Parameters

Example: MMSE Estimation of a Binary Signal in Gaussian Noise

Let X∈{βˆ’1,+1}X \in \{-1, +1\} with equal probability, and Y=X+ZY = X + Z where Z∼N(0,Οƒ2)Z \sim \mathcal{N}(0, \sigma^2) is independent of XX. Find X^MMSE=E[X∣Y]\hat{X}_{\text{MMSE}} = \mathbb{E}[X|Y].

MMSE vs. Linear Estimator for Non-Gaussian Model

Compare the nonlinear MMSE estimator E[X∣Y]\mathbb{E}[X|Y] with the best linear estimator for different joint distributions. For Gaussian data, they coincide; for non-Gaussian data, the MMSE can be significantly better.

Parameters
1

Noise variance

Common Mistake: MMSE Is Not the Same as MAP

Mistake:

Confusing the MMSE estimator (conditional mean) with the MAP estimator (mode of the posterior). For symmetric unimodal posteriors they happen to coincide, but in general they differ.

Correction:

The MMSE estimator minimizes the mean square error: X^MMSE=E[X∣Y]\hat{X}_{\text{MMSE}} = \mathbb{E}[X|Y]. The MAP estimator maximizes the posterior: X^MAP=arg⁑max⁑xf(x∣y)\hat{X}_{\text{MAP}} = \arg\max_x f(x|y). For example, with a skewed posterior, the mean (MMSE) and mode (MAP) are different. The MMSE is optimal under squared-error loss; the MAP is optimal under 0-1 loss (for continuous parameters, this becomes a delta-function loss).

The Geometric Picture

Think of the space of all square-integrable random variables as a Hilbert space with inner product ⟨U,V⟩=E[UV]\langle U, V \rangle = \mathbb{E}[UV]. The set of all functions of YY β€” {h(Y):E[h(Y)2]<∞}\{h(Y) : \mathbb{E}[h(Y)^2] < \infty\} β€” forms a closed subspace. The conditional expectation E[X∣Y]\mathbb{E}[X|Y] is the orthogonal projection of XX onto this subspace.

The orthogonality principle is then literally the statement that the projection error is perpendicular to the subspace β€” exactly as in finite-dimensional linear algebra.

Definition:

MMSE and Conditional Variance

The minimum mean square error achieved by the MMSE estimator is

MMSE=E[(Xβˆ’E[X∣Y])2]=E[Var(X∣Y)].\text{MMSE} = \mathbb{E}[(X - \mathbb{E}[X|Y])^2] = \mathbb{E}[\text{Var}(X|Y)].

This identity connects the MMSE to the average conditional variance: the remaining uncertainty in XX after observing YY is exactly the conditional variance, averaged over YY.

Quick Check

For jointly Gaussian (X,Y)(X,Y) with correlation ρ\rho, what is the MMSE (i.e., E[(Xβˆ’E[X∣Y])2]\mathbb{E}[(X - \mathbb{E}[X|Y])^2])?

ΟƒX2\sigma_X^2

ΟƒX2(1βˆ’Ο2)\sigma_X^2(1 - \rho^2)

ΟƒX2ρ2\sigma_X^2 \rho^2

ΟƒX2/(1+SNR)\sigma_X^2 / (1 + \text{SNR})

⚠️Engineering Note

MMSE Channel Estimation in 5G NR

In 5G NR, channel estimation for OFDM subcarriers uses pilot symbols to obtain noisy observations of the channel frequency response. The MMSE estimator H^k=E[Hk∣Ypilot,…]\hat{H}_k = \mathbb{E}[H_k | Y_{\text{pilot}}, \ldots] requires knowledge of the channel covariance structure, which is estimated from second-order statistics. When the channel is modeled as Gaussian (Rayleigh fading), the MMSE estimator is linear β€” the LMMSE estimator of Section 12.3.

Practical Constraints
  • β€’

    Requires channel covariance matrix, typically estimated from data

  • β€’

    Complexity scales with number of pilots; reduced-rank methods used in practice

πŸ“‹ Ref: 3GPP TS 38.211, Section 7.4.1

MMSE (Minimum Mean Square Error)

The estimator that minimizes E[(Xβˆ’g(Y))2]\mathbb{E}[(X - g(Y))^2] over all measurable functions gg. It equals the conditional expectation E[X∣Y]\mathbb{E}[X|Y].

Related: Conditional Expectation, LMMSE Estimator (Vector Case: Wiener-Hopf Equation)

Orthogonality Principle

The property that the MMSE estimation error Xβˆ’E[X∣Y]X - \mathbb{E}[X|Y] is uncorrelated with (orthogonal to) every function of the observation YY.

Related: Minimum Mean Square Error (MMSE) Estimator, LMMSE Requires Inversion of CYY\mathbf{C}_{YY}