The Orthogonality Principle

Estimation as Geometric Projection

The previous section proved that the MMSE estimator is the conditional mean by a direct calculation. A more powerful viewpoint recasts estimation as a projection in an inner-product space of random variables. In this geometry, "distance" is root-mean-square error and "perpendicular" means "uncorrelated". The MMSE estimator becomes the orthogonal projection of θ\boldsymbol\theta onto the subspace of functions of Y\mathbf{Y}, and the optimality condition becomes the statement that the residual is perpendicular to that subspace — the orthogonality principle.

Definition:

Inner Product of Random Variables

Let L2\mathcal{L}^2 denote the space of (real or complex) random variables with finite second moment. For X,YL2X, Y \in \mathcal{L}^2 define X,Y  =  E[XY],X  =  E[X2].\langle X, Y \rangle \;=\; \mathbb{E}[X\, \overline{Y}\,], \qquad \|X\| \;=\; \sqrt{\mathbb{E}[|X|^2]\,}. With this inner product, L2\mathcal{L}^2 is a Hilbert space; two random variables are orthogonal when E[XY]=0\mathbb{E}[X\overline{Y}] = 0.

For zero-mean random variables, orthogonality coincides with uncorrelatedness. The norm XX^2=E[XX^2]\|X - \hat{X}\|^2 = \mathbb{E}[|X - \hat{X}|^2] is precisely the mean-square error, so MMSE = squared distance in this space.

Theorem: Orthogonality Principle (Unrestricted MMSE)

Let (θ,Y)(\boldsymbol\theta, \mathbf{Y}) be jointly distributed with E[θ2]<\mathbb{E}[\|\boldsymbol\theta\|^2] < \infty. Then the MMSE estimator θ^(Y)\hat\theta^\star(\mathbf{Y}) is characterized by the following orthogonality condition: for every (measurable, L2\mathcal{L}^2) function ϕ:YRn\phi: \mathcal{Y} \to \mathbb{R}^n,   E ⁣[(θθ^(Y))ϕ(Y)]  =  0.  \boxed{\; \mathbb{E}\!\left[\, \big(\boldsymbol\theta - \hat\theta^\star(\mathbf{Y})\big)^\top \phi(\mathbf{Y})\,\right] \;=\; 0 .\; } That is, the estimation error is uncorrelated with every function of the observation.

Think of the subspace HY={ϕ(Y):ϕL2}\mathcal{H}_Y = \{\phi(\mathbf{Y}) : \phi \in \mathcal{L}^2\} of all square-integrable functions of the observation. The MMSE estimator is the orthogonal projection of θ\boldsymbol\theta onto HY\mathcal{H}_Y, so the residual lies in the orthogonal complement.

Key Takeaway

The orthogonality principle is a characterization: an estimator is MMSE-optimal if and only if its error is uncorrelated with every function of the observation. This dispenses with having to guess the optimal form — once you verify orthogonality, you have proved optimality.

Pythagoras for Estimation

Because the error is orthogonal to the estimator (both are in L2\mathcal{L}^2), the Pythagorean identity gives E[θ2]  =  E[θ^(Y)2]  +  E[θθ^(Y)2].\mathbb{E}[\|\boldsymbol\theta\|^2] \;=\; \mathbb{E}[\|\hat\theta^\star(\mathbf{Y})\|^2] \;+\; \mathbb{E}[\|\boldsymbol\theta - \hat\theta^\star(\mathbf{Y})\|^2]. In words, the energy of the parameter decomposes into the energy captured by the estimator plus the residual MMSE. This is exactly the bias–variance decomposition specialized to Bayesian estimation with the mean removed.

Example: Verifying Orthogonality

In the scalar Gaussian example (EMMSE for the Scalar Gaussian Model), verify the orthogonality condition E[(θθ^MMSE(Y))Y]=0\mathbb{E}[(\theta - \hat\theta_{\text{MMSE}}(Y)) \cdot Y] = 0 directly.

Orthogonality of the MMSE Residual

Monte Carlo samples of (Y,e)(Y, e) where e=θθ^MMSE(Y)e = \theta - \hat\theta_{\text{MMSE}}(Y). Compare the empirical correlation coefficient to zero. The cloud is uncorrelated even though ee and YY are not, in general, independent when θ\theta is non-Gaussian.

Parameters
1

Orthogonality Principle

The characterization of MMSE estimators: an estimator θ^(Y)\hat\theta(\mathbf{Y}) is MMSE-optimal if and only if the residual θθ^(Y)\boldsymbol\theta - \hat\theta(\mathbf{Y}) is uncorrelated with every (measurable, L2\mathcal{L}^2) function of the observation.

Related: Minimum Mean-Square Error (MMSE) Estimator, Conditional Expectation

Common Mistake: Uncorrelated ≠ Independent

Mistake:

"If the MMSE residual is uncorrelated with mathbfY\\mathbf{Y}, it must be independent of mathbfY\\mathbf{Y} — so I can treat it as noise."

Correction:

Uncorrelated means E[(θθ^)ϕ(Y)]=0\mathbb{E}[(\boldsymbol\theta - \hat\theta)^\top \phi(\mathbf{Y})] = 0 for every function ϕ\phi. Higher-order statistics (e.g., the conditional variance Var(eY=y)\text{Var}(e|\mathbf{Y}=\mathbf{y})) can still depend on Y\mathbf{Y}. Only in the jointly Gaussian case is the residual actually independent of Y\mathbf{Y}. For the binary-signal example, Var(eY=y)\text{Var}(e|Y=y) varies dramatically with yy (it is large near y=0y=0 where the posterior is ambiguous, and small for y1|y| \gg 1 where the posterior is nearly concentrated).