The Bayesian Framework

Why Go Bayesian?

In Chapter 6 we treated the parameter θ\theta as a deterministic but unknown constant and produced the maximum likelihood estimator. That worldview is powerful when we have no prior knowledge, but it is also wasteful when we do. In a communication system we know, for instance, that a Rayleigh fading coefficient is circularly symmetric complex Gaussian with known variance — ignoring that information and running the MLE throws away statistical structure that the receiver could exploit.

The Bayesian framework treats θ\theta itself as a random variable with a prior distribution reflecting what we know before seeing the data. The observation then updates the prior into the posterior distribution, which encodes everything the data tell us about θ\theta. Every Bayesian estimator — MAP, MMSE, or any other — is nothing more than a summary statistic of the posterior chosen to minimize a specific cost.

Definition:

Bayesian Estimation Model

A Bayesian estimation problem is specified by a joint distribution on the parameter–observation pair (θ,Y)(\theta, Y), where θΘRn\theta \in \Theta \subseteq \mathbb{R}^n is the quantity of interest and YYRmY \in \mathcal{Y} \subseteq \mathbb{R}^m is the observation. The joint distribution is determined by:

  • a prior density fθ(θ)f_\theta(\theta) on Θ\Theta,
  • a likelihood fYθ(yθ)f_{Y|\theta}(y|\theta), the conditional density of the observation given θ\theta.

An estimator is a measurable function θ^:YΘ\hat\theta: \mathcal{Y} \to \Theta. A Bayes risk associates a cost c(θ,θ^(y))0c(\theta, \hat\theta(y)) \geq 0 to each outcome, and the Bayes-optimal estimator minimizes the expected cost R(θ^)  =  E[c(θ,θ^(Y))]  =   ⁣ ⁣c(θ,θ^(y))fYθ(yθ)fθ(θ)dydθ.\mathcal{R}(\hat\theta) \;=\; \mathbb{E}[\, c(\theta, \hat\theta(Y)) \,] \;=\; \int\!\!\int c(\theta, \hat\theta(y))\, f_{Y|\theta}(y|\theta)\, f_\theta(\theta)\, dy\, d\theta .

Both θ\theta and YY are random here. Expectations are taken with respect to the joint distribution fθfYθf_\theta f_{Y|\theta}.

Theorem: Bayes' Rule for the Posterior

For any Bayesian model with strictly positive marginal fY(y)>0f_Y(y) > 0, the posterior density is fθY(θy)  =  fYθ(yθ)fθ(θ)fY(y),fY(y)=fYθ(yθ)fθ(θ)dθ.f_{\theta|Y}(\theta|y) \;=\; \frac{f_{Y|\theta}(y|\theta)\, f_\theta(\theta)}{f_Y(y)}, \qquad f_Y(y) = \int f_{Y|\theta}(y|\theta')\, f_\theta(\theta')\, d\theta'. The denominator is independent of θ\theta, so up to this normalizing constant, fθY(θy)    fYθ(yθ)fθ(θ).f_{\theta|Y}(\theta|y) \;\propto\; f_{Y|\theta}(y|\theta)\, f_\theta(\theta).

The posterior is the prior reweighted by how well each value of θ\theta explains the observed data. High-likelihood values of θ\theta that were already likely a priori dominate; values that disagree with either the prior or the data are suppressed.

Example: Gaussian Prior, Gaussian Likelihood

Let θN(μ0,σ02)\theta \sim \mathcal{N}(\mu_0, \sigma_0^2) (prior) and let the observation be Y=θ+WY = \theta + W with WN(0,σw2)W \sim \mathcal{N}(0, \sigma_w^2) independent of θ\theta. Compute the posterior density of θ\theta given Y=yY = y.

Prior, Likelihood, and Posterior

Change the prior mean μ0\mu_0, prior variance σ02\sigma_0^2, noise variance σw2\sigma_w^2, and observation yy, and watch how the posterior interpolates between the prior and the likelihood.

Parameters
0
1
0.5
2

Key Takeaway

Every Bayesian estimator depends on the data only through the posterior density fθY(θy)f_{\theta|Y}(\theta|y). Once the posterior is known, the choice of estimator is reduced to choosing which summary statistic of the posterior best serves the application.

Prior distribution

The distribution fθ(θ)f_\theta(\theta) assigned to the parameter before any observation is made. The prior encodes side information (physical constraints, statistical models of the environment, previous measurements).

Related: Posterior distribution, Likelihood and Log-Likelihood

Posterior distribution

The conditional distribution fθY(θy)f_{\theta|Y}(\theta|y) of θ\theta given the observation Y=yY=y. It summarizes everything the data, combined with the prior, tell us about the parameter.

Related: Prior distribution, Likelihood and Log-Likelihood

Common Mistake: Flat Priors Are Not Innocent

Mistake:

A newcomer argues that choosing a "flat" prior fθ(θ)1f_\theta(\theta) \propto 1 on an unbounded parameter space is the neutral, assumption-free choice — after all, it seems to treat every value of θ\theta equally.

Correction:

A uniform density on an unbounded set is improper (it cannot be normalized to integrate to one), so it is not a valid prior in the strict probabilistic sense. Improper priors sometimes yield valid posteriors — but not always, and whether they do must be checked. Moreover, a flat prior in one parameterization becomes non-flat after a nonlinear change of variables: "uniform on theta\\theta" and "uniform on theta2\\theta^2" are not the same prior. When the likelihood is strong, the prior matters little and flat priors are harmless; when the likelihood is weak, the implicit parameterization choice can dominate the inference.

Historical Note: Bayes, Price, and Laplace

1763–1812

The posterior formula is named for Thomas Bayes, whose 1763 essay An Essay towards solving a Problem in the Doctrine of Chances was published posthumously by his friend Richard Price. Bayes worked out the special case of a uniform prior on the parameter of a Bernoulli distribution. The general form of the rule, and the full machinery of inverse probability, is due to Pierre-Simon Laplace, who rediscovered it independently in 1774 and developed it into a working inferential calculus over the following four decades. The modern name "Bayesian" was coined only in the 1950s.