Ferkans — Interactive Telecom Tutor

The Central Question of Estimation Theory

Suppose we observe $Y$ and want to predict $X$ . We are free to choose any function $g(Y)$ as our estimator. Which $g$ is best?

The answer depends on what "best" means. If we choose the mean square error $\mathbb{E}[(X - g(Y))^2]$ as our loss function — and this is the most common choice in signal processing — then the answer is remarkably clean: the optimal $g$ is the conditional expectation $\mathbb{E}[X|Y]$ .

This section proves this fact and develops its most important consequence: the orthogonality principle.

Definition:
Minimum Mean Square Error (MMSE) Estimator

Given random variables $X$ and $Y$ , the MMSE estimator of $X$ given $Y$ is

$\hat{X}_{\text{MMSE}} = \arg\min_{g} \, \mathbb{E}\bigl[(X - g(Y))^2\bigr]$

where the minimization is over all measurable functions $g: \mathbb{R} \to \mathbb{R}$ .

The resulting minimum MSE is called the MMSE:

$\text{MMSE} = \mathbb{E}\bigl[(X - \hat{X}_{\text{MMSE}})^2\bigr] = \mathbb{E}\bigl[\text{Var}(X|Y)\bigr].$

The minimization is over all functions, not just linear ones. This is what makes MMSE estimation powerful — and what makes it hard to compute in general.

,

Theorem: MMSE Estimator Is the Conditional Expectation

For any random variables $X$ and $Y$ with $\mathbb{E}[X^2] < \infty$ :

$\hat{X}_{\text{MMSE}} = \mathbb{E}[X|Y].$

That is, the function $g^*(y) = \mathbb{E}[X|Y=y]$ minimizes the mean square error $\mathbb{E}[(X - g(Y))^2]$ over all measurable functions $g$ .

The conditional mean is the "center of mass" of $X$ given each value of $Y$ . Any other estimator deviates from this center, adding unnecessary error. This is the estimation-theoretic analogue of the fact that the sample mean minimizes the sum of squared deviations.

Proof

Decompose the MSE

For any function $g$ , write

$\mathbb{E}[(X - g(Y))^2] = \mathbb{E}\bigl[(X - \mathbb{E}[X|Y] + \mathbb{E}[X|Y] - g(Y))^2\bigr].$

Expanding the square: $= \mathbb{E}[(X - \mathbb{E}[X|Y])^2] + 2\,\mathbb{E}[(X - \mathbb{E}[X|Y])(\mathbb{E}[X|Y] - g(Y))] + \mathbb{E}[(\mathbb{E}[X|Y] - g(Y))^2].$

Show the cross-term vanishes

The cross-term is $2\,\mathbb{E}[(X - \mathbb{E}[X|Y]) \cdot h(Y)]$ where $h(Y) = \mathbb{E}[X|Y] - g(Y)$ . By the tower property:

$\mathbb{E}[(X - \mathbb{E}[X|Y]) \cdot h(Y)] = \mathbb{E}\bigl[h(Y) \cdot \mathbb{E}[X - \mathbb{E}[X|Y] | Y]\bigr].$

Now $\mathbb{E}[X - \mathbb{E}[X|Y] | Y] = \mathbb{E}[X|Y] - \mathbb{E}[X|Y] = 0$ . So the cross-term is zero.

Conclude

We are left with

$\mathbb{E}[(X - g(Y))^2] = \underbrace{\mathbb{E}[(X - \mathbb{E}[X|Y])^2]}_{\text{MMSE (does not depend on } g)} + \underbrace{\mathbb{E}[(\mathbb{E}[X|Y] - g(Y))^2]}_{\geq\, 0}.$

The second term is non-negative and equals zero if and only if $g(Y) = \mathbb{E}[X|Y]$ almost surely. Hence $g^*(Y) = \mathbb{E}[X|Y]$ is the unique minimizer. $\blacksquare$

,

Theorem: The Orthogonality Principle

The estimation error $\tilde{X} = X - \mathbb{E}[X|Y]$ is orthogonal to every function of $Y$ . That is, for any measurable $h$ :

$\mathbb{E}\bigl[(X - \mathbb{E}[X|Y]) \cdot h(Y)\bigr] = 0.$

Equivalently, the error $\tilde{X}$ is uncorrelated with every function of $Y$ : $\text{Cov}(\tilde{X}, h(Y)) = 0$ .

The orthogonality principle says that the MMSE estimator extracts all the information in $Y$ that is relevant to predicting $X$ . Whatever error remains cannot be reduced by any further processing of $Y$ . Geometrically, $\mathbb{E}[X|Y]$ is the projection of $X$ onto the "subspace" of all functions of $Y$ , and the error is perpendicular to that subspace.

Proof

Apply the tower property

Let $\tilde{X} = X - \mathbb{E}[X|Y]$ . For any measurable $h$ :

$\mathbb{E}[\tilde{X} \cdot h(Y)] = \mathbb{E}\bigl[\mathbb{E}[\tilde{X} \cdot h(Y) | Y]\bigr] = \mathbb{E}\bigl[h(Y) \cdot \mathbb{E}[\tilde{X} | Y]\bigr].$

We used the "pulling out known" property since $h(Y)$ is a function of $Y$ .

Show the conditional mean of the error is zero

$\mathbb{E}[\tilde{X} | Y] = \mathbb{E}[X - \mathbb{E}[X|Y] | Y] = \mathbb{E}[X|Y] - \mathbb{E}[X|Y] = 0.$

Therefore $\mathbb{E}[\tilde{X} \cdot h(Y)] = \mathbb{E}[h(Y) \cdot 0] = 0$ . $\blacksquare$

,

Orthogonality Principle: Error vs. Function of $Y$

Generate samples from a joint distribution $(X,Y)$ , compute the MMSE estimate $\hat{X} = \mathbb{E}[X|Y]$ , and scatter-plot the error $\tilde{X} = X - \hat{X}$ against $h(Y)$ for various choices of $h$ . The sample correlation should be near zero — confirming orthogonality.

Parameters

Joint distribution

h(Y)

Example: MMSE Estimation of a Binary Signal in Gaussian Noise

Let $X \in \{-1, +1\}$ with equal probability, and $Y = X + Z$ where $Z \sim \mathcal{N}(0, \sigma^2)$ is independent of $X$ . Find $\hat{X}_{\text{MMSE}} = \mathbb{E}[X|Y]$ .

Solution

Compute the posterior probabilities

By Bayes' rule: $\mathbb{P}(X=1|Y=y) = \frac{f_{Z}(y-1)}{f_{Z}(y-1) + f_{Z}(y+1)} = \frac{1}{1 + e^{-2y/\sigma^2}}.$

Similarly, $\mathbb{P}(X=-1|Y=y) = 1/(1 + e^{2y/\sigma^2})$ .

Compute the conditional mean

$\mathbb{E}[X|Y=y] = (+1)\,\mathbb{P}(X=1|Y=y) + (-1)\,\mathbb{P}(X=-1|Y=y) = \tanh\!\left(\frac{y}{\sigma^2}\right).$ $

Interpret

The MMSE estimator is $\hat{X}_{\text{MMSE}} = \tanh(Y/\sigma^2)$ — a nonlinear function of $Y$ , even though the channel model is linear. At high SNR ( $\sigma^2 \to 0$ ), $\tanh(Y/\sigma^2) \to \text{sign}(Y)$ , recovering the MAP/ML detector. At low SNR, the estimator is nearly linear.

MMSE vs. Linear Estimator for Non-Gaussian Model

Compare the nonlinear MMSE estimator $\mathbb{E}[X|Y]$ with the best linear estimator for different joint distributions. For Gaussian data, they coincide; for non-Gaussian data, the MMSE can be significantly better.

Parameters

Joint distribution

\sigma^2

1

Noise variance

Common Mistake: MMSE Is Not the Same as MAP

Mistake:

Confusing the MMSE estimator (conditional mean) with the MAP estimator (mode of the posterior). For symmetric unimodal posteriors they happen to coincide, but in general they differ.

Correction:

The MMSE estimator minimizes the mean square error: $\hat{X}_{\text{MMSE}} = \mathbb{E}[X|Y]$ . The MAP estimator maximizes the posterior: $\hat{X}_{\text{MAP}} = \arg\max_x f(x|y)$ . For example, with a skewed posterior, the mean (MMSE) and mode (MAP) are different. The MMSE is optimal under squared-error loss; the MAP is optimal under 0-1 loss (for continuous parameters, this becomes a delta-function loss).

The Geometric Picture

Think of the space of all square-integrable random variables as a Hilbert space with inner product $\langle U, V \rangle = \mathbb{E}[UV]$ . The set of all functions of $Y$ — $\{h(Y) : \mathbb{E}[h(Y)^2] < \infty\}$ — forms a closed subspace. The conditional expectation $\mathbb{E}[X|Y]$ is the orthogonal projection of $X$ onto this subspace.

The orthogonality principle is then literally the statement that the projection error is perpendicular to the subspace — exactly as in finite-dimensional linear algebra.

Definition:
MMSE and Conditional Variance

The minimum mean square error achieved by the MMSE estimator is

$\text{MMSE} = \mathbb{E}[(X - \mathbb{E}[X|Y])^2] = \mathbb{E}[\text{Var}(X|Y)].$

This identity connects the MMSE to the average conditional variance: the remaining uncertainty in $X$ after observing $Y$ is exactly the conditional variance, averaged over $Y$ .

Quick Check

For jointly Gaussian $(X,Y)$ with correlation $\rho$ , what is the MMSE (i.e., $\mathbb{E}[(X - \mathbb{E}[X|Y])^2]$ )?

$\sigma_X^2$

$\sigma_X^2(1 - \rho^2)$

$\sigma_X^2 \rho^2$

$\sigma_X^2 / (1 + \text{SNR})$

Correction:

\sigma_X^2(1 - \rho^2)

The conditional variance is $\text{Var}(X|Y) = \sigma_X^2(1 - \rho^2)$ for all $Y$ (constant for Gaussians). So $\text{MMSE} = \sigma_X^2(1 - \rho^2)$ .

⚠️Engineering Note

MMSE Channel Estimation in 5G NR

In 5G NR, channel estimation for OFDM subcarriers uses pilot symbols to obtain noisy observations of the channel frequency response. The MMSE estimator $\hat{H}_k = \mathbb{E}[H_k | Y_{\text{pilot}}, \ldots]$ requires knowledge of the channel covariance structure, which is estimated from second-order statistics. When the channel is modeled as Gaussian (Rayleigh fading), the MMSE estimator is linear — the LMMSE estimator of Section 12.3.

Practical Constraints

•
Requires channel covariance matrix, typically estimated from data
•
Complexity scales with number of pilots; reduced-rank methods used in practice

📋 Ref: 3GPP TS 38.211, Section 7.4.1

MMSE (Minimum Mean Square Error)

The estimator that minimizes $\mathbb{E}[(X - g(Y))^2]$ over all measurable functions $g$ . It equals the conditional expectation $\mathbb{E}[X|Y]$ .

Orthogonality Principle

The property that the MMSE estimation error $X - \mathbb{E}[X|Y]$ is uncorrelated with (orthogonal to) every function of the observation $Y$ .

The MMSE Estimator

The Central Question of Estimation Theory

Definition: Minimum Mean Square Error (MMSE) Estimator

Theorem: MMSE Estimator Is the Conditional Expectation

Decompose the MSE

Show the cross-term vanishes

Conclude

Theorem: The Orthogonality Principle

Apply the tower property

Show the conditional mean of the error is zero

Orthogonality Principle: Error vs. Function of YYY

Parameters

Example: MMSE Estimation of a Binary Signal in Gaussian Noise

Compute the posterior probabilities

Compute the conditional mean

Interpret

MMSE vs. Linear Estimator for Non-Gaussian Model

Parameters

Common Mistake: MMSE Is Not the Same as MAP

The Geometric Picture

Definition: MMSE and Conditional Variance

Quick Check

MMSE Channel Estimation in 5G NR

MMSE (Minimum Mean Square Error)

Orthogonality Principle

Definition:
Minimum Mean Square Error (MMSE) Estimator

Orthogonality Principle: Error vs. Function of $Y$

Definition:
MMSE and Conditional Variance