Ferkans — Interactive Telecom Tutor

Why Restrict to Linear Estimators?

The MMSE estimator $\mathbb{E}[\boldsymbol\theta | \mathbf{Y}]$ is typically nonlinear in $\mathbf{Y}$ and requires knowledge of the full joint distribution of $(\boldsymbol\theta, \mathbf{Y})$ . Both are obstacles in practice: full joint distributions are hard to specify, and nonlinear conditional expectations rarely admit closed forms. If we settle for the best estimator among affine functions $\hat\theta = \mathbf{A}\mathbf{Y} + \mathbf{b}$ , the optimization requires only second-order statistics — means and covariances — and yields a closed-form answer. The price is suboptimality whenever the true posterior mean is nonlinear. When $(\boldsymbol\theta, \mathbf{Y})$ is jointly Gaussian, there is no price at all.

Definition:
Linear MMSE (LMMSE) Estimator

Given a joint distribution of $(\boldsymbol\theta, \mathbf{Y})$ with known first and second moments $\mathbf{m}_\theta, \mathbf{m}_y, \boldsymbol\Sigma_\theta, \boldsymbol\Sigma_y, \boldsymbol\Sigma_{\theta y}$ , the LMMSE estimator is the affine function of $\mathbf{Y}$ minimizing the mean-square error: $\hat\theta_{\text{LMMSE}}(\mathbf{Y}) \;=\; \arg\min_{\mathbf{A}, \mathbf{b}}\, \mathbb{E}\!\left[\, \|\boldsymbol\theta - (\mathbf{A}\mathbf{Y} + \mathbf{b})\|^2 \,\right].$

Theorem: The LMMSE Formula

Assume $\boldsymbol\Sigma_y$ is positive definite. The LMMSE estimator is $\boxed{\; \hat\theta_{\text{LMMSE}}(\mathbf{Y}) \;=\; \mathbf{m}_\theta + \mathbf{A}\, (\mathbf{Y} - \mathbf{m}_y), \qquad \mathbf{A} = \boldsymbol\Sigma_{\theta y}\, \boldsymbol\Sigma_y^{-1}. \;}$ The LMMSE error covariance and total MSE are $\boldsymbol\Sigma_{\theta|y} \;=\; \boldsymbol\Sigma_\theta - \boldsymbol\Sigma_{\theta y}\, \boldsymbol\Sigma_y^{-1}\, \boldsymbol\Sigma_{y\theta}, \qquad \mathrm{MSE}_{\text{LMMSE}} \;=\; \text{tr}\big(\boldsymbol\Sigma_{\theta|y}\big).$

The estimator has three parts: start from the prior mean $\mathbf{m}_\theta$ , form the innovation $\mathbf{Y} - \mathbf{m}_y$ (the observation minus what we'd predict without seeing it), and update by the gain matrix $\mathbf{A} = \boldsymbol\Sigma_{\theta y}\boldsymbol\Sigma_y^{-1}$ , which is the correlation between $\boldsymbol\theta$ and $\mathbf{Y}$ normalized by the spread of $\mathbf{Y}$ .

Show Hint

Apply the orthogonality principle, restricted to the subspace of affine functions of $\mathbf{Y}$ .

Setting $\mathbf{b}$ : choose it so the residual has zero mean.

Setting $\mathbf{A}$ : require that the residual is orthogonal to each component of $\mathbf{Y} - \mathbf{m}_y$ .

Proof

Fix the offset $\mathbf{b}$

Write $\hat\theta = \mathbf{A}\mathbf{Y} + \mathbf{b}$ . The residual has mean $\mathbf{m}_\theta - \mathbf{A}\mathbf{m}_y - \mathbf{b}$ ; this must be zero for the MSE not to contain a squared-bias term $\|\mathbf{m}_\theta - \mathbf{A}\mathbf{m}_y - \mathbf{b}\|^2$ . Hence $\mathbf{b} = \mathbf{m}_\theta - \mathbf{A}\mathbf{m}_y$ , giving the affine form $\hat\theta = \mathbf{m}_\theta + \mathbf{A}(\mathbf{Y} - \mathbf{m}_y)$ .

Orthogonality on the affine subspace

Substitute $\tilde{\boldsymbol\theta} = \boldsymbol\theta - \mathbf{m}_\theta$ and $\tilde{\mathbf{Y}} = \mathbf{Y} - \mathbf{m}_y$ (zero-mean versions). The orthogonality principle on the affine subspace requires the residual $\tilde{\boldsymbol\theta} - \mathbf{A}\tilde{\mathbf{Y}}$ to be uncorrelated with every linear function $\mathbf{B}\tilde{\mathbf{Y}}$ : $\mathbb{E}\!\left[(\tilde{\boldsymbol\theta} - \mathbf{A}\tilde{\mathbf{Y}}) \tilde{\mathbf{Y}}^\top\right] = \mathbf{0}.$

Solve the normal equations

Expanding yields $\boldsymbol\Sigma_{\theta y} - \mathbf{A}\boldsymbol\Sigma_y = \mathbf{0}$ , i.e. the normal equations $\mathbf{A}\boldsymbol\Sigma_y = \boldsymbol\Sigma_{\theta y}$ . Since $\boldsymbol\Sigma_y \succ 0$ , the unique solution is $\mathbf{A} = \boldsymbol\Sigma_{\theta y}\boldsymbol\Sigma_y^{-1}$ .

Compute the error covariance

The residual $\mathbf{e} = \tilde{\boldsymbol\theta} - \mathbf{A}\tilde{\mathbf{Y}}$ has $\text{Cov}(\mathbf{e}) = \mathbb{E}[\mathbf{e}\tilde{\boldsymbol\theta}^\top] = \boldsymbol\Sigma_\theta - \mathbf{A}\boldsymbol\Sigma_{y\theta} = \boldsymbol\Sigma_\theta - \boldsymbol\Sigma_{\theta y}\boldsymbol\Sigma_y^{-1}\boldsymbol\Sigma_{y\theta},$ using $\mathbb{E}[\mathbf{e}\tilde{\mathbf{Y}}^\top] = \mathbf{0}$ to simplify. The MSE is the trace. $\blacksquare$

Vector LMMSE Computation

Complexity:

O(m^3)

for the Cholesky factorization of

\boldsymbol\Sigma_y

.

Input: Means

\mathbf{m}_\theta, \mathbf{m}_y

, covariances

\boldsymbol\Sigma_\theta, \boldsymbol\Sigma_y, \boldsymbol\Sigma_{\theta y}

,

observation

\mathbf{y}

.

Output: Estimate

\hat{\boldsymbol\theta}

and error covariance

\boldsymbol\Sigma_{\theta|y}

.

1. Form the innovation:

\tilde{\mathbf{y}} \leftarrow \mathbf{y} - \mathbf{m}_y

2. Solve

\boldsymbol\Sigma_y \mathbf{z} = \tilde{\mathbf{y}}

for

\mathbf{z}

(e.g., via Cholesky)

3. Apply the gain:

\hat{\boldsymbol\theta} \leftarrow \mathbf{m}_\theta + \boldsymbol\Sigma_{\theta y}\, \mathbf{z}

4. Solve

\boldsymbol\Sigma_y \mathbf{M} = \boldsymbol\Sigma_{y\theta}

for

\mathbf{M}

5.

\boldsymbol\Sigma_{\theta|y} \leftarrow \boldsymbol\Sigma_\theta - \boldsymbol\Sigma_{\theta y}\, \mathbf{M}

Never form $\boldsymbol\Sigma_y^{-1}$ explicitly in code — solve linear systems via Cholesky or LU. The posterior covariance does not depend on the observation $\mathbf{y}$ , so step 5 can be precomputed once and reused for all subsequent observations.

Theorem: Orthogonality for LMMSE

The LMMSE residual $\mathbf{e} = \boldsymbol\theta - \hat\theta_{\text{LMMSE}}(\mathbf{Y})$ is zero-mean and uncorrelated with every affine function of $\mathbf{Y}$ : $\mathbb{E}[\mathbf{e}] = \mathbf{0}, \qquad \mathbb{E}[\mathbf{e}\, \mathbf{Y}^\top] = \mathbf{0}.$

LMMSE is the projection of $\boldsymbol\theta$ onto the span of the components of $\mathbf{Y}$ (plus the constant function). The residual is orthogonal to that subspace — hence uncorrelated with linear functions of $\mathbf{Y}$ . Unlike the full MMSE, there is no guarantee that the residual is uncorrelated with nonlinear functions of $\mathbf{Y}$ .

Proof

Zero mean

$\mathbb{E}[\mathbf{e}] = \mathbf{m}_\theta - \mathbf{m}_\theta - \mathbf{A}(\mathbb{E}[\mathbf{Y}] - \mathbf{m}_y) = \mathbf{0}$ .

Uncorrelated with $\mathbf{Y}$

$\mathbb{E}[\mathbf{e}\,\tilde{\mathbf{Y}}^\top] = \boldsymbol\Sigma_{\theta y} - \mathbf{A}\boldsymbol\Sigma_y = \mathbf{0}$ by the normal equations. Since $\mathbf{Y} - \tilde{\mathbf{Y}} = \mathbf{m}_y$ is deterministic and $\mathbf{e}$ has zero mean, $\mathbb{E}[\mathbf{e}\mathbf{Y}^\top] = \mathbf{0}$ as well. $\blacksquare$

Example: Vector Gaussian Observation Model

Let $\boldsymbol\theta \sim \mathcal{N}(\mathbf{0}, \boldsymbol\Sigma_\theta)$ and $\mathbf{Y} = \mathbf{H}\boldsymbol\theta + \mathbf{W}$ , with $\mathbf{W} \sim \mathcal{N}(\mathbf{0}, \sigma_w^2 \mathbf{I})$ independent of $\boldsymbol\theta$ . Find the LMMSE estimator and its error covariance.

Solution

Second-order statistics

$\mathbb{E}[\boldsymbol\theta] = \mathbf{0}$ , $\mathbb{E}[\mathbf{Y}] = \mathbf{0}$ , $\boldsymbol\Sigma_y = \mathbf{H}\boldsymbol\Sigma_\theta \mathbf{H}^\top + \sigma_w^2 \mathbf{I}$ , $\boldsymbol\Sigma_{\theta y} = \mathbb{E}[\boldsymbol\theta\mathbf{Y}^\top] = \boldsymbol\Sigma_\theta \mathbf{H}^\top$ .

Apply the LMMSE formula

$\hat\theta_{\text{LMMSE}}(\mathbf{y}) = \boldsymbol\Sigma_\theta \mathbf{H}^\top (\mathbf{H}\boldsymbol\Sigma_\theta \mathbf{H}^\top + \sigma_w^2 \mathbf{I})^{-1}\, \mathbf{y}$ .

Use the matrix inversion lemma

By the Woodbury identity, an equivalent form is $\hat\theta_{\text{LMMSE}}(\mathbf{y}) = (\boldsymbol\Sigma_\theta^{-1} + \sigma_w^{-2}\mathbf{H}^\top\mathbf{H})^{-1} \sigma_w^{-2}\mathbf{H}^\top \mathbf{y}.$ This is the regularized least-squares form: the Tikhonov regularizer $\boldsymbol\Sigma_\theta^{-1}$ is precisely the inverse prior covariance.

Error covariance

$\boldsymbol\Sigma_{\theta|y} = \boldsymbol\Sigma_\theta - \boldsymbol\Sigma_\theta \mathbf{H}^\top (\mathbf{H}\boldsymbol\Sigma_\theta \mathbf{H}^\top + \sigma_w^2\mathbf{I})^{-1}\mathbf{H}\boldsymbol\Sigma_\theta = (\boldsymbol\Sigma_\theta^{-1} + \sigma_w^{-2}\mathbf{H}^\top\mathbf{H})^{-1}$ , again by Woodbury. $\blacksquare$

LMMSE MSE vs. SNR

For the model $Y = \theta + W$ with $\theta \sim \mathcal{N}(0,\sigma_\theta^2)$ and $W \sim \mathcal{N}(0,\sigma_w^2)$ , compare the LMMSE MSE with the prior-only MSE ( $\sigma_\theta^2$ ) and the noise floor ( $\sigma_w^2$ ) as SNR varies.

Parameters

Prior std

\sigma_\theta

1

Max SNR (dB)30

⚠️Engineering Note

Numerical Conditioning of $\boldsymbol\Sigma_y$

In practice the LMMSE requires solving $\boldsymbol\Sigma_y \mathbf{z} = \mathbf{y} - \mathbf{m}_y$ . When pilots are correlated or the noise is small, $\boldsymbol\Sigma_y$ can be ill-conditioned, and a naive solve amplifies numerical noise. The standard remedy is a diagonal loading $\boldsymbol\Sigma_y \leftarrow \boldsymbol\Sigma_y + \epsilon \mathbf{I}$ with $\epsilon$ chosen so that $\kappa(\boldsymbol\Sigma_y + \epsilon \mathbf{I})$ stays within a safe range (e.g., below $10^6$ in double precision). This is equivalent to adding $\epsilon \mathbf{I}$ to the assumed noise covariance, which trades a small increase in MSE for large gains in numerical stability.

Practical Constraints

•
Double precision loses accuracy when $\kappa(\boldsymbol\Sigma_y) > 10^8$
•
Real-time systems use Cholesky updates instead of refactoring from scratch

Innovation

The zero-mean quantity $\mathbf{Y} - \mathbb{E}[\mathbf{Y}]$ (or, more generally, $\mathbf{Y} - \hat{\mathbf{Y}}$ where $\hat{\mathbf{Y}}$ is a linear predictor of $\mathbf{Y}$ from past information). The innovation carries the "new" information in the observation that could not be predicted from what was already known.

LMMSE ≠ BLUE (in general)

In classical (non-Bayesian) estimation, the best linear unbiased estimator (BLUE) is the minimum-variance estimator among unbiased linear functions of the observation. In the Bayesian setting we do not constrain unbiasedness; the LMMSE can be biased (and usually is, via the shrinkage toward $\mathbf{m}_\theta$ ), trading a little bias for a lot of variance reduction. When the prior is non-informative ( $\boldsymbol\Sigma_\theta^{-1} \to \mathbf{0}$ ), the two coincide.

Quick Check

Which of the following second-order quantities does the LMMSE estimator not need?

$\mathbb{E}[\boldsymbol\theta]$

$\boldsymbol\Sigma_y$

$\boldsymbol\Sigma_{\theta y}$

The third moment $\mathbb{E}[\theta^3]$

Correction:

The third moment

\mathbb{E}[\theta^3]

The LMMSE formula uses only first and second moments — means, covariances, and cross-covariances. It is invariant to all higher-order statistics.

LMMSE Estimation