LMMSE Estimation

Why Restrict to Linear Estimators?

The MMSE estimator E[θY]\mathbb{E}[\boldsymbol\theta | \mathbf{Y}] is typically nonlinear in Y\mathbf{Y} and requires knowledge of the full joint distribution of (θ,Y)(\boldsymbol\theta, \mathbf{Y}). Both are obstacles in practice: full joint distributions are hard to specify, and nonlinear conditional expectations rarely admit closed forms. If we settle for the best estimator among affine functions θ^=AY+b\hat\theta = \mathbf{A}\mathbf{Y} + \mathbf{b}, the optimization requires only second-order statistics — means and covariances — and yields a closed-form answer. The price is suboptimality whenever the true posterior mean is nonlinear. When (θ,Y)(\boldsymbol\theta, \mathbf{Y}) is jointly Gaussian, there is no price at all.

Definition:

Linear MMSE (LMMSE) Estimator

Given a joint distribution of (θ,Y)(\boldsymbol\theta, \mathbf{Y}) with known first and second moments mθ,my,Σθ,Σy,Σθy\mathbf{m}_\theta, \mathbf{m}_y, \boldsymbol\Sigma_\theta, \boldsymbol\Sigma_y, \boldsymbol\Sigma_{\theta y}, the LMMSE estimator is the affine function of Y\mathbf{Y} minimizing the mean-square error: θ^LMMSE(Y)  =  argminA,bE ⁣[θ(AY+b)2].\hat\theta_{\text{LMMSE}}(\mathbf{Y}) \;=\; \arg\min_{\mathbf{A}, \mathbf{b}}\, \mathbb{E}\!\left[\, \|\boldsymbol\theta - (\mathbf{A}\mathbf{Y} + \mathbf{b})\|^2 \,\right].

Theorem: The LMMSE Formula

Assume Σy\boldsymbol\Sigma_y is positive definite. The LMMSE estimator is   θ^LMMSE(Y)  =  mθ+A(Ymy),A=ΣθyΣy1.  \boxed{\; \hat\theta_{\text{LMMSE}}(\mathbf{Y}) \;=\; \mathbf{m}_\theta + \mathbf{A}\, (\mathbf{Y} - \mathbf{m}_y), \qquad \mathbf{A} = \boldsymbol\Sigma_{\theta y}\, \boldsymbol\Sigma_y^{-1}. \;} The LMMSE error covariance and total MSE are Σθy  =  ΣθΣθyΣy1Σyθ,MSELMMSE  =  tr(Σθy).\boldsymbol\Sigma_{\theta|y} \;=\; \boldsymbol\Sigma_\theta - \boldsymbol\Sigma_{\theta y}\, \boldsymbol\Sigma_y^{-1}\, \boldsymbol\Sigma_{y\theta}, \qquad \mathrm{MSE}_{\text{LMMSE}} \;=\; \text{tr}\big(\boldsymbol\Sigma_{\theta|y}\big).

The estimator has three parts: start from the prior mean mθ\mathbf{m}_\theta, form the innovation Ymy\mathbf{Y} - \mathbf{m}_y (the observation minus what we'd predict without seeing it), and update by the gain matrix A=ΣθyΣy1\mathbf{A} = \boldsymbol\Sigma_{\theta y}\boldsymbol\Sigma_y^{-1}, which is the correlation between θ\boldsymbol\theta and Y\mathbf{Y} normalized by the spread of Y\mathbf{Y}.

Vector LMMSE Computation

Complexity: O(m3)O(m^3) for the Cholesky factorization of Σy\boldsymbol\Sigma_y.
Input: Means mθ,my\mathbf{m}_\theta, \mathbf{m}_y, covariances
Σθ,Σy,Σθy\boldsymbol\Sigma_\theta, \boldsymbol\Sigma_y, \boldsymbol\Sigma_{\theta y},
observation y\mathbf{y}.
Output: Estimate θ^\hat{\boldsymbol\theta} and error covariance
Σθy\boldsymbol\Sigma_{\theta|y}.
1. Form the innovation: y~ymy\tilde{\mathbf{y}} \leftarrow \mathbf{y} - \mathbf{m}_y
2. Solve Σyz=y~\boldsymbol\Sigma_y \mathbf{z} = \tilde{\mathbf{y}} for z\mathbf{z} (e.g., via Cholesky)
3. Apply the gain: θ^mθ+Σθyz\hat{\boldsymbol\theta} \leftarrow \mathbf{m}_\theta + \boldsymbol\Sigma_{\theta y}\, \mathbf{z}
4. Solve ΣyM=Σyθ\boldsymbol\Sigma_y \mathbf{M} = \boldsymbol\Sigma_{y\theta} for M\mathbf{M}
5. ΣθyΣθΣθyM\boldsymbol\Sigma_{\theta|y} \leftarrow \boldsymbol\Sigma_\theta - \boldsymbol\Sigma_{\theta y}\, \mathbf{M}

Never form Σy1\boldsymbol\Sigma_y^{-1} explicitly in code — solve linear systems via Cholesky or LU. The posterior covariance does not depend on the observation y\mathbf{y}, so step 5 can be precomputed once and reused for all subsequent observations.

Theorem: Orthogonality for LMMSE

The LMMSE residual e=θθ^LMMSE(Y)\mathbf{e} = \boldsymbol\theta - \hat\theta_{\text{LMMSE}}(\mathbf{Y}) is zero-mean and uncorrelated with every affine function of Y\mathbf{Y}: E[e]=0,E[eY]=0.\mathbb{E}[\mathbf{e}] = \mathbf{0}, \qquad \mathbb{E}[\mathbf{e}\, \mathbf{Y}^\top] = \mathbf{0}.

LMMSE is the projection of θ\boldsymbol\theta onto the span of the components of Y\mathbf{Y} (plus the constant function). The residual is orthogonal to that subspace — hence uncorrelated with linear functions of Y\mathbf{Y}. Unlike the full MMSE, there is no guarantee that the residual is uncorrelated with nonlinear functions of Y\mathbf{Y}.

Example: Vector Gaussian Observation Model

Let θN(0,Σθ)\boldsymbol\theta \sim \mathcal{N}(\mathbf{0}, \boldsymbol\Sigma_\theta) and Y=Hθ+W\mathbf{Y} = \mathbf{H}\boldsymbol\theta + \mathbf{W}, with WN(0,σw2I)\mathbf{W} \sim \mathcal{N}(\mathbf{0}, \sigma_w^2 \mathbf{I}) independent of θ\boldsymbol\theta. Find the LMMSE estimator and its error covariance.

LMMSE MSE vs. SNR

For the model Y=θ+WY = \theta + W with θN(0,σθ2)\theta \sim \mathcal{N}(0,\sigma_\theta^2) and WN(0,σw2)W \sim \mathcal{N}(0,\sigma_w^2), compare the LMMSE MSE with the prior-only MSE (σθ2\sigma_\theta^2) and the noise floor (σw2\sigma_w^2) as SNR varies.

Parameters
1
30
⚠️Engineering Note

Numerical Conditioning of Σy\boldsymbol\Sigma_y

In practice the LMMSE requires solving Σyz=ymy\boldsymbol\Sigma_y \mathbf{z} = \mathbf{y} - \mathbf{m}_y. When pilots are correlated or the noise is small, Σy\boldsymbol\Sigma_y can be ill-conditioned, and a naive solve amplifies numerical noise. The standard remedy is a diagonal loading ΣyΣy+ϵI\boldsymbol\Sigma_y \leftarrow \boldsymbol\Sigma_y + \epsilon \mathbf{I} with ϵ\epsilon chosen so that κ(Σy+ϵI)\kappa(\boldsymbol\Sigma_y + \epsilon \mathbf{I}) stays within a safe range (e.g., below 10610^6 in double precision). This is equivalent to adding ϵI\epsilon \mathbf{I} to the assumed noise covariance, which trades a small increase in MSE for large gains in numerical stability.

Practical Constraints
  • Double precision loses accuracy when κ(Σy)>108\kappa(\boldsymbol\Sigma_y) > 10^8

  • Real-time systems use Cholesky updates instead of refactoring from scratch

Innovation

The zero-mean quantity YE[Y]\mathbf{Y} - \mathbb{E}[\mathbf{Y}] (or, more generally, YY^\mathbf{Y} - \hat{\mathbf{Y}} where Y^\hat{\mathbf{Y}} is a linear predictor of Y\mathbf{Y} from past information). The innovation carries the "new" information in the observation that could not be predicted from what was already known.

Related: Orthogonality Principle (Unrestricted MMSE)

LMMSE ≠ BLUE (in general)

In classical (non-Bayesian) estimation, the best linear unbiased estimator (BLUE) is the minimum-variance estimator among unbiased linear functions of the observation. In the Bayesian setting we do not constrain unbiasedness; the LMMSE can be biased (and usually is, via the shrinkage toward mθ\mathbf{m}_\theta), trading a little bias for a lot of variance reduction. When the prior is non-informative (Σθ10\boldsymbol\Sigma_\theta^{-1} \to \mathbf{0}), the two coincide.

Quick Check

Which of the following second-order quantities does the LMMSE estimator not need?

E[θ]\mathbb{E}[\boldsymbol\theta]

Σy\boldsymbol\Sigma_y

Σθy\boldsymbol\Sigma_{\theta y}

The third moment E[θ3]\mathbb{E}[\theta^3]