Ferkans — Interactive Telecom Tutor

From Posterior to Estimate

A posterior density is not yet a usable estimate — a transmitter cannot decode a density, it needs a number. Converting a posterior into a single estimate requires choosing a cost function. Two cost functions dominate practice: the zero–one cost (credit only for hitting the right value exactly) and the squared-error cost (credit decays quadratically with the miss). These two choices lead to the MAP and MMSE estimators respectively.

Definition:
Maximum a Posteriori (MAP) Estimator

Given a Bayesian model with posterior $f_{\theta|Y}(\theta|y)$ , the maximum a posteriori estimator is $\hat\theta_{\text{MAP}}(y) \;=\; \arg\max_{\theta \in \Theta}\, f_{\theta|Y}(\theta|y) \;=\; \arg\max_{\theta \in \Theta}\, f_{Y|\theta}(y|\theta)\, f_\theta(\theta),$ where the second equality follows from Bayes' rule because the marginal $f_Y(y)$ does not depend on $\theta$ .

With a flat prior (or in the limit $\sigma_0 \to \infty$ ), the MAP estimator reduces to the MLE: $\hat\theta_{\text{MAP}} = g_{\text{ml}}(y)$ . MAP is the Bayesian estimator for the degenerate zero–one cost $c(\theta, \hat\theta) = \mathbb{1}\{|\theta - \hat\theta| > \epsilon\}$ in the limit $\epsilon \to 0$ , so it is the "most likely parameter" answer but not the "best on average" answer.

Definition:
Minimum Mean-Square Error (MMSE) Estimator

The minimum mean-square error estimator is the function $\hat\theta: \mathcal{Y} \to \mathbb{R}^n$ minimizing the Bayes risk under squared-error cost: $\hat\theta_{\text{MMSE}} \;=\; \arg\min_{\hat\theta(\cdot)}\; \mathbb{E}\!\left[\, \|\boldsymbol\theta - \hat\theta(\mathbf{Y})\|^2 \,\right].$ The minimization ranges over all (measurable) functions of the observation. We write the optimal estimator as $g_{\text{mmse}}(y)$ .

The squared-error cost penalizes large deviations disproportionately and enjoys clean algebraic properties (quadratic, convex, differentiable) that make closed-form analysis possible.

Theorem: MMSE Estimator Equals the Conditional Mean

For any jointly distributed $(\boldsymbol\theta, \mathbf{Y})$ with $\mathbb{E}[\|\boldsymbol\theta\|^2] < \infty$ , the MMSE estimator is $\boxed{\; g_{\text{mmse}}(\mathbf{Y}) \;=\; \mathbb{E}[\boldsymbol\theta \mid \mathbf{Y}].\; }$ In particular, the MMSE estimator depends on the posterior only through its mean.

For each fixed value $\mathbf{Y} = \mathbf{y}$ , the MMSE task reduces to choosing a constant $\hat\theta$ to minimize $\mathbb{E}[\|\boldsymbol\theta - \hat\theta\|^2 \mid \mathbf{Y} = \mathbf{y}]$ . A standard calculation shows that the minimizer of $\mathbb{E}[(X - a)^2]$ over scalars $a$ is $a = \mathbb{E}[X]$ ; the vector case is identical component by component.

Show Hint

Use the tower property: $\mathbb{E}[\cdot] = \mathbb{E}[\mathbb{E}[\cdot|\mathbf{Y}]]$ .

For a fixed value of $\mathbf{Y}$ , the inner expectation is minimized pointwise.

Add and subtract $\mathbb{E}[\boldsymbol\theta|\mathbf{Y}]$ inside the norm and expand.

Proof

Condition on the observation

By the tower property, $\mathbb{E}[\|\boldsymbol\theta - \hat\theta(\mathbf{Y})\|^2] = \mathbb{E}\!\left[\, \mathbb{E}\!\left[\|\boldsymbol\theta - \hat\theta(\mathbf{Y})\|^2 \,\middle|\, \mathbf{Y}\right]\,\right].$ The outer expectation is a positive-weighted average over $\mathbf{Y}$ , so minimizing the full expectation is equivalent to minimizing the inner conditional expectation for $\mathbf{Y} = \mathbf{y}$ almost surely.

Reduce to a constant-minimization problem

Fix $\mathbf{Y} = \mathbf{y}$ . The inner problem is $\min_{\mathbf{a} \in \mathbb{R}^n}\, \mathbb{E}[\|\boldsymbol\theta - \mathbf{a}\|^2 \mid \mathbf{Y} = \mathbf{y}]$ where $\mathbf{a}$ stands in for $\hat\theta(\mathbf{y})$ .

Complete the square in $\mathbf{a}$

Let $\mathbf{m}(\mathbf{y}) = \mathbb{E}[\boldsymbol\theta \mid \mathbf{Y} = \mathbf{y}]$ . Adding and subtracting $\mathbf{m}(\mathbf{y})$ , $\mathbb{E}\!\left[\|\boldsymbol\theta - \mathbf{a}\|^2 \,\middle|\, \mathbf{y}\right] = \mathbb{E}\!\left[\|\boldsymbol\theta - \mathbf{m}(\mathbf{y})\|^2 \,\middle|\, \mathbf{y}\right] + 2\, (\mathbf{m}(\mathbf{y}) - \mathbf{a})^\top \mathbb{E}\!\left[\boldsymbol\theta - \mathbf{m}(\mathbf{y}) \,\middle|\, \mathbf{y}\right] + \|\mathbf{m}(\mathbf{y}) - \mathbf{a}\|^2 .$ The cross term vanishes because $\mathbb{E}[\boldsymbol\theta - \mathbf{m}(\mathbf{y}) | \mathbf{y}] = \mathbf{0}$ .

Identify the minimizer

The first term is independent of $\mathbf{a}$ . The last term $\|\mathbf{m}(\mathbf{y}) - \mathbf{a}\|^2$ is nonnegative and equals zero iff $\mathbf{a} = \mathbf{m}(\mathbf{y})$ . Hence $\hat\theta_{\text{MMSE}}(\mathbf{y}) = \mathbf{m}(\mathbf{y}) = \mathbb{E}[\boldsymbol\theta | \mathbf{Y} = \mathbf{y}]$ . $\blacksquare$

Theorem: Value of the MMSE

With $\hat\theta_{\text{MMSE}}(\mathbf{Y}) = \mathbb{E}[\boldsymbol\theta| \mathbf{Y}]$ , the minimum achievable mean-square error equals the expected posterior variance: $\text{MMSE} \;=\; \mathbb{E}\!\left[\, \text{tr}\big(\,\text{Cov}(\boldsymbol\theta \mid \mathbf{Y})\,\big)\,\right] \;=\; \mathbb{E}\!\left[\|\boldsymbol\theta\|^2\right] - \mathbb{E}\!\left[\|\mathbb{E}[\boldsymbol\theta|\mathbf{Y}]\|^2\right].$

Observing $\mathbf{Y}$ removes all variability of $\boldsymbol\theta$ that is predictable from $\mathbf{Y}$ ; what remains is the residual variance inside each "slice" $\mathbf{Y} = \mathbf{y}$ . Averaging these residual variances over the marginal distribution of $\mathbf{Y}$ gives the MMSE.

Proof

Apply the law of total variance

For each component $\theta_i$ , $\text{Var}(\theta_i) = \mathbb{E}[\text{Var}(\theta_i|\mathbf{Y})] + \text{Var}(\mathbb{E}[\theta_i|\mathbf{Y}])$ .

Sum over components

Summing gives $\mathbb{E}[\|\boldsymbol\theta - \mathbf{m}_\theta\|^2] = \mathbb{E}[\text{tr}\,\text{Cov}(\boldsymbol\theta|\mathbf{Y})] + \mathbb{E}[\|\mathbb{E}[\boldsymbol\theta|\mathbf{Y}] - \mathbf{m}_\theta\|^2]$ .

Identify the residual term

The MMSE is exactly the first summand: the average over $\mathbf{Y}$ of the residual posterior variance. $\blacksquare$

Example: MMSE for the Scalar Gaussian Model

Let $\theta \sim \mathcal{N}(0, \sigma_\theta^2)$ and $Y = \theta + W$ with $W \sim \mathcal{N}(0, \sigma_w^2)$ independent of $\theta$ . Compute $\hat\theta_{\text{MMSE}}(y)$ , $\hat\theta_{\text{MAP}}(y)$ , and the MMSE.

Solution

Posterior from Example 1

From EGaussian Prior, Gaussian Likelihood with $\mu_0 = 0$ , the posterior is Gaussian with mean $\mu_{\text{post}}(y) = \frac{\sigma_\theta^2}{\sigma_\theta^2 + \sigma_w^2}\, y$ and variance $\sigma_{\text{post}}^2 = \frac{\sigma_\theta^2 \sigma_w^2}{\sigma_\theta^2 + \sigma_w^2}$ .

MMSE estimator

The MMSE estimator is the posterior mean, $\hat\theta_{\text{MMSE}}(y) = \alpha\, y$ , with $\alpha = \sigma_\theta^2/(\sigma_\theta^2 + \sigma_w^2) \in (0,1)$ . This is a shrinkage estimator: the observation is pulled toward the prior mean by the factor $\alpha$ .

MAP estimator

The Gaussian posterior is symmetric and unimodal, so the MAP equals the posterior mean: $\hat\theta_{\text{MAP}}(y) = \alpha\, y$ as well. For Gaussian priors and Gaussian likelihoods, MAP = MMSE.

Achieved MMSE

The MMSE is the (constant) posterior variance, $\text{MMSE} = \sigma_{\text{post}}^2 = \sigma_\theta^2 \sigma_w^2 / (\sigma_\theta^2 + \sigma_w^2) = \sigma_\theta^2 \cdot (1 - \alpha)$ . As $\text{SNR} = \sigma_\theta^2/\sigma_w^2 \to \infty$ , $\alpha \to 1$ and the MMSE tends to $\sigma_w^2$ (noise-limited). As $\text{SNR} \to 0$ , $\alpha \to 0$ and the MMSE tends to $\sigma_\theta^2$ (prior-limited). $\blacksquare$

Example: Binary Signal in Gaussian Noise: MMSE is a Sigmoid

Let $\theta \in \{+1, -1\}$ equiprobable and $Y = h\theta + W$ with $W \sim \mathcal{N}(0, \sigma_w^2)$ . Compute the MMSE estimator $\hat\theta_{\text{MMSE}}(y) = \mathbb{E}[\theta|Y=y]$ .

Solution

Posterior probabilities

By Bayes' rule, $\Pr(\theta = +1 | Y=y) = \frac{\exp(-\tfrac{(y-h)^2}{2\sigma_w^2})} {\exp(-\tfrac{(y-h)^2}{2\sigma_w^2}) + \exp(-\tfrac{(y+h)^2}{2\sigma_w^2})}$ .

Simplify via a hyperbolic identity

Dividing numerator and denominator by the numerator and expanding the difference of squares gives $\Pr(\theta=+1|y) - \Pr(\theta=-1|y) = \tanh\!\big(\tfrac{h y}{\sigma_w^2}\big)$ .

Conclude

Since $\theta \in \{\pm 1\}$ , the conditional mean equals that difference: $\hat\theta_{\text{MMSE}}(y) = \tanh\!\big(\tfrac{h y}{\sigma_w^2}\big).$ The MMSE estimator is nonlinear in $y$ and automatically saturates in $[-1,+1]$ , unlike any linear estimator. $\blacksquare$

MMSE vs. Linear Estimator for Binary $\theta$

Compare the true MMSE estimator $\tanh(hy/\sigma_w^2)$ (sigmoid) with the best linear estimator $\frac{h}{h^2+\sigma_w^2}\,y$ for $\theta \in \{\pm 1\}$ in Gaussian noise. The linear curve extrapolates outside $[-1,1]$ ; the true MMSE saturates.

Parameters

Channel

h

2

Noise std

\sigma_w

1

MAP vs. MMSE Estimators

Aspect	MAP	MMSE
Cost function	Zero–one (peak of posterior)	Squared error
Formula	$\arg\max_\theta f_{\theta\|Y}(\theta\|y)$	$\mathbb{E}[\theta\|Y=y]$
Shape of output	A mode of the posterior	The mean of the posterior
Flat-prior limit	Reduces to MLE	Conditional mean under flat prior
Computation	Optimization problem	Integral $\int \theta f_{\theta\|Y}\,d\theta$
Unimodal symmetric posterior	Coincides with mean	Coincides with mode
Multi-modal posterior	Can jump discontinuously in $y$	Smooth in $y$ (averaging)
Discrete $\theta$	Natural (a single element)	Unnatural (returns non-integer mean)

MAP as Regularized MLE

Taking the logarithm of the MAP objective, $\hat\theta_{\text{MAP}}(y) = \arg\max_\theta \big[\, \log f_{Y|\theta}(y|\theta) + \log f_\theta(\theta) \,\big].$ The first term is the log-likelihood from Chapter 6; the second is a regularizer that penalizes parameter values the prior considers unlikely. This is exactly the structure of ridge regression (Gaussian prior, quadratic penalty) and of the LASSO (Laplace prior, $\ell_1$ penalty), and it is why MAP is sometimes called "regularized ML".

Quick Check

The posterior $f_{\theta|Y}(\theta|y)$ is the uniform density on the interval $[0, 2y]$ (for some fixed $y > 0$ ). Which estimator returns the value $\theta = y$ ?

MAP only

MMSE only

Both MAP and MMSE

Neither

Correction:

MMSE only

The MMSE estimator is the posterior mean, which for a uniform density on $[0,2y]$ is exactly $y$ . The MAP is any point in $[0,2y]$ since the posterior is flat, so it is not uniquely defined.

Common Mistake: Don't Use MMSE on a Discrete Parameter

Mistake:

A student applies the MMSE formula $\hat\theta = \mathbb{E}[\theta|Y]$ to estimate a symbol $\theta \in \{+1, -1\}$ and reports an answer like $\hat\theta = 0.37$ for a hard-decision receiver.

Correction:

The MMSE estimator minimizes squared error, not symbol error. For a discrete parameter taking finitely many values, the MMSE estimate generally lies outside the parameter set, which is meaningless for a decoder that must output a symbol. For discrete $\theta$ , use the MAP estimator (maximum posterior probability) or, equivalently under uniform priors, the MLE. The MMSE value of a discrete parameter is still useful as a soft information statistic (e.g., for belief propagation), but it is not itself a decoded symbol.

MAP and MMSE Estimators

From Posterior to Estimate

Definition: Maximum a Posteriori (MAP) Estimator

Definition: Minimum Mean-Square Error (MMSE) Estimator

Theorem: MMSE Estimator Equals the Conditional Mean

Condition on the observation

Reduce to a constant-minimization problem

Complete the square in $\mathbf{a}$

Identify the minimizer

Theorem: Value of the MMSE

Apply the law of total variance

Sum over components

Identify the residual term

Example: MMSE for the Scalar Gaussian Model

Posterior from Example 1

MMSE estimator

MAP estimator

Achieved MMSE

Example: Binary Signal in Gaussian Noise: MMSE is a Sigmoid

Posterior probabilities

Simplify via a hyperbolic identity

Conclude

MMSE vs. Linear Estimator for Binary θ\thetaθ

Parameters

MAP vs. MMSE Estimators

MAP as Regularized MLE

Quick Check

Common Mistake: Don't Use MMSE on a Discrete Parameter

Definition:
Maximum a Posteriori (MAP) Estimator

Definition:
Minimum Mean-Square Error (MMSE) Estimator

MMSE vs. Linear Estimator for Binary $\theta$