The Law of Total Probability and Bayes' Theorem

Partitioning the Unknown

Many probabilities are hard to compute directly but become tractable when we condition on an exhaustive set of mutually exclusive scenarios. If we know the probability of an event AA under each scenario, and we know the probability of each scenario, we can recover P(A)\mathbb{P}(A) by a weighted average. This is the law of total probability β€” one of the workhorses of applied probability.

Bayes' theorem is the other side of the same coin: given that AA occurred, it tells us how to update our beliefs about which scenario was in play. This prior-to-posterior update is the mathematical engine of Bayesian inference, detection theory, and all probabilistic decoding algorithms.

Definition:

Partition of the Sample Space

A finite (or countable) collection of events {B1,B2,…,Bn}\{B_1, B_2, \ldots, B_n\} is a partition of Ξ©\Omega if:

  1. Exhaustive: ⋃i=1nBi=Ξ©\bigcup_{i=1}^{n} B_i = \Omega.
  2. Mutually exclusive: Bi∩Bj=βˆ…B_i \cap B_j = \emptyset for all iβ‰ ji \neq j.

Every outcome Ο‰βˆˆΞ©\omega \in \Omega belongs to exactly one BiB_i.

Theorem: Law of Total Probability

Let {B1,…,Bn}\{B_1, \ldots, B_n\} be a partition of Ξ©\Omega with P(Bi)>0\mathbb{P}(B_i) > 0 for all ii. For any event AA: P(A)=βˆ‘i=1nP(A∣Bi) P(Bi).\mathbb{P}(A) = \sum_{i=1}^{n} \mathbb{P}(A \mid B_i)\,\mathbb{P}(B_i).

The event AA is split into disjoint pieces A∩BiA \cap B_i, each contained in exactly one scenario BiB_i. The probability of each piece is P(A∩Bi)=P(A∣Bi)P(Bi)\mathbb{P}(A \cap B_i) = \mathbb{P}(A \mid B_i)\mathbb{P}(B_i), and summing over all scenarios recovers P(A)\mathbb{P}(A).

,

Theorem: Bayes' Theorem

Let {B1,…,Bn}\{B_1, \ldots, B_n\} be a partition of Ξ©\Omega with P(Bi)>0\mathbb{P}(B_i) > 0 for all ii. For any event AA with P(A)>0\mathbb{P}(A) > 0: P(Bk∣A)=P(A∣Bk) P(Bk)βˆ‘i=1nP(A∣Bi) P(Bi).\mathbb{P}(B_k \mid A) = \frac{\mathbb{P}(A \mid B_k)\,\mathbb{P}(B_k)} {\displaystyle\sum_{i=1}^{n} \mathbb{P}(A \mid B_i)\,\mathbb{P}(B_i)}. The terms have canonical names in Bayesian inference:

  • P(Bk)\mathbb{P}(B_k) β€” the prior probability of scenario kk.
  • P(A∣Bk)\mathbb{P}(A \mid B_k) β€” the likelihood of observation AA under scenario kk.
  • P(Bk∣A)\mathbb{P}(B_k \mid A) β€” the posterior probability of scenario kk given AA.
  • P(A)\mathbb{P}(A) β€” the evidence (normalizing constant).

Bayes' theorem reverses the direction of conditioning. We know how to go from scenario to observation (the forward channel P(A∣Bk)\mathbb{P}(A \mid B_k)). Bayes tells us how to go the other direction: from observation back to scenario. The prior encodes what we believed before observing AA; the posterior encodes what we believe after.

,

Historical Note: Thomas Bayes and the Inverse Probability Problem

1763

Thomas Bayes (1702–1761), an English minister and amateur mathematician, posed the following question: given that an event has occurred some number of times, what can be inferred about the underlying probability? His posthumous 1763 essay, edited and communicated by Richard Price to the Royal Society, introduced what we now call Bayes' theorem in the context of a billiard-ball model on a square table.

Bayes' contribution was primarily philosophical: the idea that probability could represent degree of belief rather than mere frequency, and that this belief should be updated rationally in response to evidence. The formalization was refined by Pierre-Simon Laplace, who independently developed the same ideas around 1774. The modern Bayesian-versus-frequentist debate can be traced directly to this 18th-century dispute over the nature of probability.

Example: Binary Symmetric Channel: Posterior Computation

A binary symmetric channel flips each transmitted bit with probability ϡ∈(0,1/2)\epsilon \in (0, 1/2). The transmitter sends 00 or 11 with equal prior probabilities P(X=0)=P(X=1)=1/2\mathbb{P}(X=0) = \mathbb{P}(X=1) = 1/2. The receiver observes Y=1Y = 1. Compute the posterior P(X=0∣Y=1)\mathbb{P}(X = 0 \mid Y = 1).

Example: Two Factories and a Defective Chip

Factory A produces 60% of all chips; factory B produces 40%. Factory A's defect rate is 2%; factory B's defect rate is 5%. A randomly chosen chip is found to be defective. What is the probability that it came from factory A?

Bayesian Posterior Updating

Explore how the posterior P(B1∣A)\mathbb{P}(B_1 \mid A) evolves as the prior P(B1)\mathbb{P}(B_1) and likelihoods P(A∣B1)\mathbb{P}(A \mid B_1), P(A∣B2)\mathbb{P}(A \mid B_2) vary (two-hypothesis model).

Parameters
0.5
0.8
0.2

Bayesian Updating: Prior to\\to Posterior

Watch the posterior distribution evolve as sequential observations arrive and the prior is updated one observation at a time.
A binary model with two hypotheses. Each observation shifts the posterior. After many observations the posterior concentrates on the true hypothesis β€” regardless of the initial prior (as long as it is non-zero).

Law of Total Probability: Partition Visualization

Visualize how P(A)\mathbb{P}(A) is decomposed over a partition {B1,B2,B3}\{B_1, B_2, B_3\}. Adjust the scenario probabilities and the conditional probabilities P(A∣Bi)\mathbb{P}(A \mid B_i) to see the weighted average.

Parameters
0.4
0.35
0.7
0.3
0.1

Why This Matters: Bayes' Theorem in Digital Detection

In digital communications, the receiver observes yy and must decide which symbol sks_k was transmitted. Bayes' theorem gives the maximum a posteriori (MAP) decoder: k^=arg⁑max⁑kP(X=sk∣Y=y)=arg⁑max⁑kf(y∣X=sk) P(X=sk),\hat{k} = \arg\max_k \mathbb{P}(X = s_k \mid Y = y) = \arg\max_k f(y \mid X = s_k)\,\mathbb{P}(X = s_k), where the total probability f(y)f(y) cancels in the arg⁑max⁑\arg\max. When symbols are equally likely (P(X=sk)=1/m\mathbb{P}(X = s_k) = 1/m for all kk), MAP reduces to maximum likelihood (ML): k^=arg⁑max⁑kf(y∣X=sk)\hat{k} = \arg\max_k f(y \mid X = s_k). Bayes' theorem is the precise reason why equal priors make MAP and ML coincide.

Common Mistake: The Prosecutor's Fallacy

Mistake:

In forensic science (and sometimes in wireless network analysis), evidence is presented as: "The probability of observing this evidence if the defendant is innocent is only 0.0010.001." This is then (incorrectly) interpreted as: "The probability that the defendant is innocent given this evidence is 0.0010.001."

Correction:

The first quantity is P(evidence∣innocent)\mathbb{P}(\text{evidence} \mid \text{innocent}) β€” the likelihood. The second is P(innocent∣evidence)\mathbb{P}(\text{innocent} \mid \text{evidence}) β€” the posterior. They are related by Bayes' theorem: P(innocent∣evidence)=P(evidence∣innocent) P(innocent)P(evidence).\mathbb{P}(\text{innocent} \mid \text{evidence}) = \frac{\mathbb{P}(\text{evidence} \mid \text{innocent})\,\mathbb{P}(\text{innocent})} {\mathbb{P}(\text{evidence})}. If the prior P(innocent)\mathbb{P}(\text{innocent}) is high (most people are not criminals), the posterior can remain large even when the likelihood is small. The base rate (prior) is crucial and must not be ignored.

Prior and Posterior

In Bayesian inference, the prior P(Bk)\mathbb{P}(B_k) encodes belief about scenario kk before observing any data. The posterior P(Bk∣A)\mathbb{P}(B_k \mid A) encodes belief after observing event AA. Bayes' theorem is the update rule that converts prior into posterior via the likelihood P(A∣Bk)\mathbb{P}(A \mid B_k).

Related: Conditional Probability, Bayes' Theorem

Quick Check

A medical test has sensitivity P(positive∣disease)=0.99\mathbb{P}(\text{positive} \mid \text{disease}) = 0.99 and specificity P(negative∣no disease)=0.95\mathbb{P}(\text{negative} \mid \text{no disease}) = 0.95. The disease prevalence is P(disease)=0.01\mathbb{P}(\text{disease}) = 0.01. A patient tests positive. Which expression correctly gives P(disease∣positive)\mathbb{P}(\text{disease} \mid \text{positive})?

0.99Γ—0.01/(0.99Γ—0.01+0.05Γ—0.99)0.99 \times 0.01 / (0.99 \times 0.01 + 0.05 \times 0.99)

0.99Γ—0.01/(0.99Γ—0.01+0.05Γ—0.99)0.99 \times 0.01 / (0.99 \times 0.01 + 0.05 \times 0.99)

0.99Γ—0.01/(0.99Γ—0.01+0.05Γ—0.99)0.99 \times 0.01 / (0.99 \times 0.01 + 0.05 \times 0.99)

0.99Γ—0.01/(0.99Γ—0.01+0.05Γ—0.99)β‰ˆ0.1660.99 \times 0.01 / (0.99 \times 0.01 + 0.05 \times 0.99) \approx 0.166

Key Takeaway

Bayes' theorem converts likelihoods into posteriors. The prior is what we believed before; the likelihood is how consistent the observation is with each hypothesis; the posterior is what we believe after. In detection theory (Book FSI), this update rule is the MAP decoder. In channel estimation (Book MIMO), it is the Bayesian estimator. In message-passing algorithms (belief propagation), it runs on every edge of the factor graph. Bayes' theorem is not a formula β€” it is a way of thinking.