Ferkans — Interactive Telecom Tutor

Why Go Bayesian?

In Chapter 6 we treated the parameter $\theta$ as a deterministic but unknown constant and produced the maximum likelihood estimator. That worldview is powerful when we have no prior knowledge, but it is also wasteful when we do. In a communication system we know, for instance, that a Rayleigh fading coefficient is circularly symmetric complex Gaussian with known variance — ignoring that information and running the MLE throws away statistical structure that the receiver could exploit.

The Bayesian framework treats $\theta$ itself as a random variable with a prior distribution reflecting what we know before seeing the data. The observation then updates the prior into the posterior distribution, which encodes everything the data tell us about $\theta$ . Every Bayesian estimator — MAP, MMSE, or any other — is nothing more than a summary statistic of the posterior chosen to minimize a specific cost.

Definition:
Bayesian Estimation Model

A Bayesian estimation problem is specified by a joint distribution on the parameter–observation pair $(\theta, Y)$ , where $\theta \in \Theta \subseteq \mathbb{R}^n$ is the quantity of interest and $Y \in \mathcal{Y} \subseteq \mathbb{R}^m$ is the observation. The joint distribution is determined by:

a prior density $f_\theta(\theta)$ on $\Theta$ ,
a likelihood $f_{Y|\theta}(y|\theta)$ , the conditional density of the observation given $\theta$ .

An estimator is a measurable function $\hat\theta: \mathcal{Y} \to \Theta$ . A Bayes risk associates a cost $c(\theta, \hat\theta(y)) \geq 0$ to each outcome, and the Bayes-optimal estimator minimizes the expected cost $\mathcal{R}(\hat\theta) \;=\; \mathbb{E}[\, c(\theta, \hat\theta(Y)) \,] \;=\; \int\!\!\int c(\theta, \hat\theta(y))\, f_{Y|\theta}(y|\theta)\, f_\theta(\theta)\, dy\, d\theta .$

Both $\theta$ and $Y$ are random here. Expectations are taken with respect to the joint distribution $f_\theta f_{Y|\theta}$ .

Theorem: Bayes' Rule for the Posterior

For any Bayesian model with strictly positive marginal $f_Y(y) > 0$ , the posterior density is $f_{\theta|Y}(\theta|y) \;=\; \frac{f_{Y|\theta}(y|\theta)\, f_\theta(\theta)}{f_Y(y)}, \qquad f_Y(y) = \int f_{Y|\theta}(y|\theta')\, f_\theta(\theta')\, d\theta'.$ The denominator is independent of $\theta$ , so up to this normalizing constant, $f_{\theta|Y}(\theta|y) \;\propto\; f_{Y|\theta}(y|\theta)\, f_\theta(\theta).$

The posterior is the prior reweighted by how well each value of $\theta$ explains the observed data. High-likelihood values of $\theta$ that were already likely a priori dominate; values that disagree with either the prior or the data are suppressed.

Proof

Definition of conditional density

By definition, $f_{\theta|Y}(\theta|y) = f_{\theta, Y}(\theta, y) / f_Y(y)$ whenever $f_Y(y) > 0$ .

Factor the joint density

The joint density factors as $f_{\theta, Y}(\theta, y) = f_{Y|\theta}(y|\theta)\, f_\theta(\theta)$ . Substituting gives the stated formula.

Normalization

Integrating numerator and denominator over $\theta$ shows $\int f_{\theta|Y}(\theta|y)\, d\theta = 1$ , so the posterior is a proper density. $\blacksquare$

Example: Gaussian Prior, Gaussian Likelihood

Let $\theta \sim \mathcal{N}(\mu_0, \sigma_0^2)$ (prior) and let the observation be $Y = \theta + W$ with $W \sim \mathcal{N}(0, \sigma_w^2)$ independent of $\theta$ . Compute the posterior density of $\theta$ given $Y = y$ .

Solution

Write out the likelihood and prior

The likelihood is $f_{Y|\theta}(y|\theta) = \frac{1}{\sqrt{2\pi}\sigma_w} \exp\!\big(-\tfrac{(y-\theta)^2}{2\sigma_w^2}\big)$ . The prior is $f_\theta(\theta) = \frac{1}{\sqrt{2\pi}\sigma_0} \exp\!\big(-\tfrac{(\theta-\mu_0)^2}{2\sigma_0^2}\big)$ .

Multiply and collect quadratic terms

The posterior is proportional to $\exp\!\big(-\tfrac{(y-\theta)^2}{2\sigma_w^2} - \tfrac{(\theta-\mu_0)^2}{2\sigma_0^2}\big)$ . Expanding the exponent and completing the square in $\theta$ produces a Gaussian in $\theta$ with mean and variance $\mu_{\text{post}} \;=\; \frac{\sigma_0^2\, y + \sigma_w^2\, \mu_0}{\sigma_0^2 + \sigma_w^2}, \qquad \sigma_{\text{post}}^2 \;=\; \frac{\sigma_0^2\, \sigma_w^2}{\sigma_0^2 + \sigma_w^2}.$

Interpret

The posterior mean is a convex combination of the prior mean $\mu_0$ and the observation $y$ , weighted by the precisions $1/\sigma_0^2$ and $1/\sigma_w^2$ . The posterior variance is the harmonic sum of prior and noise variances, i.e. $1/\sigma_{\text{post}}^2 = 1/\sigma_0^2 + 1/\sigma_w^2$ : precisions add. $\blacksquare$

Prior, Likelihood, and Posterior

Change the prior mean $\mu_0$ , prior variance $\sigma_0^2$ , noise variance $\sigma_w^2$ , and observation $y$ , and watch how the posterior interpolates between the prior and the likelihood.

Parameters

Prior mean

\mu_0

0

Prior std

\sigma_0

1

Noise std

\sigma_w

0.5

Observation

y

2

Key Takeaway

Every Bayesian estimator depends on the data only through the posterior density $f_{\theta|Y}(\theta|y)$ . Once the posterior is known, the choice of estimator is reduced to choosing which summary statistic of the posterior best serves the application.

Prior distribution

The distribution $f_\theta(\theta)$ assigned to the parameter before any observation is made. The prior encodes side information (physical constraints, statistical models of the environment, previous measurements).

Posterior distribution

The conditional distribution $f_{\theta|Y}(\theta|y)$ of $\theta$ given the observation $Y=y$ . It summarizes everything the data, combined with the prior, tell us about the parameter.

Common Mistake: Flat Priors Are Not Innocent

Mistake:

A newcomer argues that choosing a "flat" prior $f_\theta(\theta) \propto 1$ on an unbounded parameter space is the neutral, assumption-free choice — after all, it seems to treat every value of $\theta$ equally.

Correction:

A uniform density on an unbounded set is improper (it cannot be normalized to integrate to one), so it is not a valid prior in the strict probabilistic sense. Improper priors sometimes yield valid posteriors — but not always, and whether they do must be checked. Moreover, a flat prior in one parameterization becomes non-flat after a nonlinear change of variables: "uniform on $\\theta$ " and "uniform on $\\theta^2$ " are not the same prior. When the likelihood is strong, the prior matters little and flat priors are harmless; when the likelihood is weak, the implicit parameterization choice can dominate the inference.

Historical Note: Bayes, Price, and Laplace

1763–1812

The posterior formula is named for Thomas Bayes, whose 1763 essay An Essay towards solving a Problem in the Doctrine of Chances was published posthumously by his friend Richard Price. Bayes worked out the special case of a uniform prior on the parameter of a Bernoulli distribution. The general form of the rule, and the full machinery of inverse probability, is due to Pierre-Simon Laplace, who rediscovered it independently in 1774 and developed it into a working inferential calculus over the following four decades. The modern name "Bayesian" was coined only in the 1950s.

The Bayesian Framework