MAP and MMSE Estimators

From Posterior to Estimate

A posterior density is not yet a usable estimate — a transmitter cannot decode a density, it needs a number. Converting a posterior into a single estimate requires choosing a cost function. Two cost functions dominate practice: the zero–one cost (credit only for hitting the right value exactly) and the squared-error cost (credit decays quadratically with the miss). These two choices lead to the MAP and MMSE estimators respectively.

Definition:

Maximum a Posteriori (MAP) Estimator

Given a Bayesian model with posterior fθY(θy)f_{\theta|Y}(\theta|y), the maximum a posteriori estimator is θ^MAP(y)  =  argmaxθΘfθY(θy)  =  argmaxθΘfYθ(yθ)fθ(θ),\hat\theta_{\text{MAP}}(y) \;=\; \arg\max_{\theta \in \Theta}\, f_{\theta|Y}(\theta|y) \;=\; \arg\max_{\theta \in \Theta}\, f_{Y|\theta}(y|\theta)\, f_\theta(\theta), where the second equality follows from Bayes' rule because the marginal fY(y)f_Y(y) does not depend on θ\theta.

With a flat prior (or in the limit σ0\sigma_0 \to \infty), the MAP estimator reduces to the MLE: θ^MAP=gml(y)\hat\theta_{\text{MAP}} = g_{\text{ml}}(y). MAP is the Bayesian estimator for the degenerate zero–one cost c(θ,θ^)=1{θθ^>ϵ}c(\theta, \hat\theta) = \mathbb{1}\{|\theta - \hat\theta| > \epsilon\} in the limit ϵ0\epsilon \to 0, so it is the "most likely parameter" answer but not the "best on average" answer.

Definition:

Minimum Mean-Square Error (MMSE) Estimator

The minimum mean-square error estimator is the function θ^:YRn\hat\theta: \mathcal{Y} \to \mathbb{R}^n minimizing the Bayes risk under squared-error cost: θ^MMSE  =  argminθ^()  E ⁣[θθ^(Y)2].\hat\theta_{\text{MMSE}} \;=\; \arg\min_{\hat\theta(\cdot)}\; \mathbb{E}\!\left[\, \|\boldsymbol\theta - \hat\theta(\mathbf{Y})\|^2 \,\right]. The minimization ranges over all (measurable) functions of the observation. We write the optimal estimator as gmmse(y)g_{\text{mmse}}(y).

The squared-error cost penalizes large deviations disproportionately and enjoys clean algebraic properties (quadratic, convex, differentiable) that make closed-form analysis possible.

Theorem: MMSE Estimator Equals the Conditional Mean

For any jointly distributed (θ,Y)(\boldsymbol\theta, \mathbf{Y}) with E[θ2]<\mathbb{E}[\|\boldsymbol\theta\|^2] < \infty, the MMSE estimator is   gmmse(Y)  =  E[θY].  \boxed{\; g_{\text{mmse}}(\mathbf{Y}) \;=\; \mathbb{E}[\boldsymbol\theta \mid \mathbf{Y}].\; } In particular, the MMSE estimator depends on the posterior only through its mean.

For each fixed value Y=y\mathbf{Y} = \mathbf{y}, the MMSE task reduces to choosing a constant θ^\hat\theta to minimize E[θθ^2Y=y]\mathbb{E}[\|\boldsymbol\theta - \hat\theta\|^2 \mid \mathbf{Y} = \mathbf{y}]. A standard calculation shows that the minimizer of E[(Xa)2]\mathbb{E}[(X - a)^2] over scalars aa is a=E[X]a = \mathbb{E}[X]; the vector case is identical component by component.

Theorem: Value of the MMSE

With θ^MMSE(Y)=E[θY]\hat\theta_{\text{MMSE}}(\mathbf{Y}) = \mathbb{E}[\boldsymbol\theta| \mathbf{Y}], the minimum achievable mean-square error equals the expected posterior variance: MMSE  =  E ⁣[tr(Cov(θY))]  =  E ⁣[θ2]E ⁣[E[θY]2].\text{MMSE} \;=\; \mathbb{E}\!\left[\, \text{tr}\big(\,\text{Cov}(\boldsymbol\theta \mid \mathbf{Y})\,\big)\,\right] \;=\; \mathbb{E}\!\left[\|\boldsymbol\theta\|^2\right] - \mathbb{E}\!\left[\|\mathbb{E}[\boldsymbol\theta|\mathbf{Y}]\|^2\right].

Observing Y\mathbf{Y} removes all variability of θ\boldsymbol\theta that is predictable from Y\mathbf{Y}; what remains is the residual variance inside each "slice" Y=y\mathbf{Y} = \mathbf{y}. Averaging these residual variances over the marginal distribution of Y\mathbf{Y} gives the MMSE.

Example: MMSE for the Scalar Gaussian Model

Let θN(0,σθ2)\theta \sim \mathcal{N}(0, \sigma_\theta^2) and Y=θ+WY = \theta + W with WN(0,σw2)W \sim \mathcal{N}(0, \sigma_w^2) independent of θ\theta. Compute θ^MMSE(y)\hat\theta_{\text{MMSE}}(y), θ^MAP(y)\hat\theta_{\text{MAP}}(y), and the MMSE.

Example: Binary Signal in Gaussian Noise: MMSE is a Sigmoid

Let θ{+1,1}\theta \in \{+1, -1\} equiprobable and Y=hθ+WY = h\theta + W with WN(0,σw2)W \sim \mathcal{N}(0, \sigma_w^2). Compute the MMSE estimator θ^MMSE(y)=E[θY=y]\hat\theta_{\text{MMSE}}(y) = \mathbb{E}[\theta|Y=y].

MMSE vs. Linear Estimator for Binary θ\theta

Compare the true MMSE estimator tanh(hy/σw2)\tanh(hy/\sigma_w^2) (sigmoid) with the best linear estimator hh2+σw2y\frac{h}{h^2+\sigma_w^2}\,y for θ{±1}\theta \in \{\pm 1\} in Gaussian noise. The linear curve extrapolates outside [1,1][-1,1]; the true MMSE saturates.

Parameters
2
1

MAP vs. MMSE Estimators

AspectMAPMMSE
Cost functionZero–one (peak of posterior)Squared error
FormulaargmaxθfθY(θy)\arg\max_\theta f_{\theta|Y}(\theta|y)E[θY=y]\mathbb{E}[\theta|Y=y]
Shape of outputA mode of the posteriorThe mean of the posterior
Flat-prior limitReduces to MLEConditional mean under flat prior
ComputationOptimization problemIntegral θfθYdθ\int \theta f_{\theta|Y}\,d\theta
Unimodal symmetric posteriorCoincides with meanCoincides with mode
Multi-modal posteriorCan jump discontinuously in yySmooth in yy (averaging)
Discrete θ\thetaNatural (a single element)Unnatural (returns non-integer mean)

MAP as Regularized MLE

Taking the logarithm of the MAP objective, θ^MAP(y)=argmaxθ[logfYθ(yθ)+logfθ(θ)].\hat\theta_{\text{MAP}}(y) = \arg\max_\theta \big[\, \log f_{Y|\theta}(y|\theta) + \log f_\theta(\theta) \,\big]. The first term is the log-likelihood from Chapter 6; the second is a regularizer that penalizes parameter values the prior considers unlikely. This is exactly the structure of ridge regression (Gaussian prior, quadratic penalty) and of the LASSO (Laplace prior, 1\ell_1 penalty), and it is why MAP is sometimes called "regularized ML".

Quick Check

The posterior fθY(θy)f_{\theta|Y}(\theta|y) is the uniform density on the interval [0,2y][0, 2y] (for some fixed y>0y > 0). Which estimator returns the value θ=y\theta = y?

MAP only

MMSE only

Both MAP and MMSE

Neither

Common Mistake: Don't Use MMSE on a Discrete Parameter

Mistake:

A student applies the MMSE formula θ^=E[θY]\hat\theta = \mathbb{E}[\theta|Y] to estimate a symbol θ{+1,1}\theta \in \{+1, -1\} and reports an answer like θ^=0.37\hat\theta = 0.37 for a hard-decision receiver.

Correction:

The MMSE estimator minimizes squared error, not symbol error. For a discrete parameter taking finitely many values, the MMSE estimate generally lies outside the parameter set, which is meaningless for a decoder that must output a symbol. For discrete θ\theta, use the MAP estimator (maximum posterior probability) or, equivalently under uniform priors, the MLE. The MMSE value of a discrete parameter is still useful as a soft information statistic (e.g., for belief propagation), but it is not itself a decoded symbol.