Kolmogorov's Axioms

Axioms: Formalizing Frequency as Measure

Before Kolmogorov's 1933 monograph, probability was defined via frequencies: run an experiment NN times, count occurrences N(A)N(A) of event AA, define P(A)β‰ˆN(A)/N\mathbb{P}(A) \approx N(A)/N. This frequentist intuition is correct but not a definition β€” it merely says what probability approximates, not what it is.

Kolmogorov's insight was that the empirical frequency satisfies a small set of algebraic rules that can be elevated to axioms. The result is a clean, axiomatic foundation: three rules that characterize every valid probability model, from coin flips to Rayleigh fading to quantum measurements.

The three axioms below are the ones you will use in every proof in this book. Everything else β€” conditional probability, independence, expectation, the law of large numbers β€” follows as a consequence.

Historical Note: Andrei Kolmogorov and the Foundations of Probability (1933)

1933

Andrei Nikolaevich Kolmogorov (1903–1987) was one of the most prolific mathematicians of the 20th century, with major contributions to analysis, topology, turbulence, information theory, and algorithmic complexity. His 1933 monograph Grundbegriffe der Wahrscheinlichkeitsrechnung (Foundations of the Theory of Probability) placed probability on the same rigorous axiomatic footing as geometry and algebra, grounding it in Lebesgue's measure theory.

The axioms he proposed β€” non-negativity, normalization, and countable additivity β€” are now so universally accepted that we simply call them "the axioms of probability." Kolmogorov later made foundational contributions to information theory (Kolmogorov complexity) and to the ergodic theory of stochastic processes.

Definition:

Probability Measure (Kolmogorov's Axioms)

A probability measure on (Ω,F)(\Omega, \mathcal{F}) is a set function P:F→[0,1]\mathbb{P}: \mathcal{F} \to [0,1] satisfying:

Axiom 1 (Non-negativity): P(A)β‰₯0\mathbb{P}(A) \geq 0 for all A∈FA \in \mathcal{F}.

Axiom 2 (Normalization): P(Ξ©)=1\mathbb{P}(\Omega) = 1.

Axiom 3 (Countable Additivity): If A1,A2,β€¦βˆˆFA_1, A_2, \ldots \in \mathcal{F} are pairwise disjoint (Ai∩Aj=βˆ…A_i \cap A_j = \emptyset for iβ‰ ji \neq j), then P ⁣(⋃i=1∞Ai)=βˆ‘i=1∞P(Ai).\mathbb{P}\!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} \mathbb{P}(A_i).

The triple (Ξ©,F,P)(\Omega, \mathcal{F}, \mathbb{P}) is called a probability space.

Axiom 3 with the union restricted to finitely many terms is called finite additivity. Countable additivity is the stronger requirement β€” it extends the sum to infinity, which is what makes continuity of probability (Lemma 2 below) possible. There are philosophical schools (notably de Finetti) that accept only finite additivity, but for the mathematical development in this book, countable additivity is indispensable.

,

Example: Constructing Probability Spaces

For each of the following, verify that the proposed P\mathbb{P} satisfies Kolmogorov's axioms: (a) Ξ©={H,T}\Omega = \{H,T\}, P(H)=p\mathbb{P}(H) = p, P(T)=1βˆ’p\mathbb{P}(T) = 1-p for p∈[0,1]p \in [0,1]. (b) Ξ©={T,HT,HHT,…}\Omega = \{T, HT, HHT, \ldots\} (toss until first Tail), P(HkT)=(1βˆ’p)kp\mathbb{P}(H^k T) = (1-p)^k p. (c) Ξ©=[0,1]\Omega = [0,1], F=B([0,1])\mathcal{F} = \mathcal{B}([0,1]), P(A)=ΞΌ(A)\mathbb{P}(A) = \mu(A) (Lebesgue measure).

Theorem: Elementary Consequences of the Axioms

Let (Ω,F,P)(\Omega, \mathcal{F}, \mathbb{P}) be a probability space and A,B∈FA, B \in \mathcal{F}. Then:

  1. P(βˆ…)=0\mathbb{P}(\emptyset) = 0.
  2. (Complementation) P(Ac)=1βˆ’P(A)\mathbb{P}(A^c) = 1 - \mathbb{P}(A).
  3. (Monotonicity) If AβŠ†BA \subseteq B then P(A)≀P(B)\mathbb{P}(A) \leq \mathbb{P}(B).
  4. 0≀P(A)≀10 \leq \mathbb{P}(A) \leq 1.
  5. (Inclusion-Exclusion, n=2n = 2) P(AβˆͺB)=P(A)+P(B)βˆ’P(A∩B)\mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A \cap B).

These are the workhorse identities of probability. The complementation rule is used constantly: if P(A)\mathbb{P}(A) is hard to compute directly, compute P(Ac)\mathbb{P}(A^c) and subtract from 1. Monotonicity is the foundation of all probability bounds.

,

Theorem: General Inclusion-Exclusion Formula

For events A1,…,An∈FA_1, \ldots, A_n \in \mathcal{F}: P ⁣(⋃i=1nAi)=βˆ‘iP(Ai)βˆ’βˆ‘i<jP(Ai∩Aj)+βˆ‘i<j<kP(Ai∩Aj∩Ak)βˆ’β‹―+(βˆ’1)n+1P ⁣(β‹‚i=1nAi).\mathbb{P}\!\left(\bigcup_{i=1}^{n} A_i\right) = \sum_{i} \mathbb{P}(A_i) - \sum_{i < j} \mathbb{P}(A_i \cap A_j) + \sum_{i < j < k} \mathbb{P}(A_i \cap A_j \cap A_k) - \cdots + (-1)^{n+1} \mathbb{P}\!\left(\bigcap_{i=1}^{n} A_i\right).

The formula corrects for overcounting: when we add P(Ai)\mathbb{P}(A_i) for all ii, points in two-way intersections are counted twice, so we subtract them; but then points in three-way intersections are subtracted once too many, so we add them back β€” and so on. The resulting alternating sum counts each point in the union exactly once.

,

Theorem: Union Bound (Boole's Inequality)

For any finite or countably infinite collection of events {Ak}\{A_k\}: P ⁣(⋃kAk)β‰€βˆ‘kP(Ak).\mathbb{P}\!\left(\bigcup_{k} A_k\right) \leq \sum_{k} \mathbb{P}(A_k).

The union bound is the most-used probability inequality in communications engineering. It says that the probability that at least one of many bad events occurs is at most the sum of their individual probabilities. In error probability analysis, it bounds the block error probability by the sum of pairwise error probabilities β€” a simple but powerful tool that leads to the Bhattacharyya and Chernoff bounds in Book FSI.

Theorem: Continuity of Probability

Let {An}\{A_n\} be a monotone sequence of events.

Increasing: If A1βŠ†A2βŠ†β‹―A_1 \subseteq A_2 \subseteq \cdots, then lim⁑nβ†’βˆžP(An)=P ⁣(⋃n=1∞An)\displaystyle\lim_{n\to\infty} \mathbb{P}(A_n) = \mathbb{P}\!\left(\bigcup_{n=1}^{\infty} A_n\right).

Decreasing: If A1βŠ‡A2βŠ‡β‹―A_1 \supseteq A_2 \supseteq \cdots, then lim⁑nβ†’βˆžP(An)=P ⁣(β‹‚n=1∞An)\displaystyle\lim_{n\to\infty} \mathbb{P}(A_n) = \mathbb{P}\!\left(\bigcap_{n=1}^{\infty} A_n\right).

Probability behaves like a continuous function with respect to monotone limits of events: the probability of the limiting event equals the limit of the probabilities. This mirrors the statement that for increasing/decreasing real sequences, lim⁑f(An)=f(lim⁑An)\lim f(A_n) = f(\lim A_n) β€” probability "commutes with limits" along monotone sequences.

,

Theorem: First Borel-Cantelli Lemma

Let {An}\{A_n\} be a sequence of events with βˆ‘n=1∞P(An)<∞\sum_{n=1}^{\infty} \mathbb{P}(A_n) < \infty. Define the event "AnA_n occurs infinitely often" as {AnΒ i.o.}β‰œβ‹‚n=1βˆžβ‹ƒk=n∞Ak.\{A_n \text{ i.o.}\} \triangleq \bigcap_{n=1}^{\infty} \bigcup_{k=n}^{\infty} A_k. Then P(AnΒ i.o.)=0\mathbb{P}(A_n \text{ i.o.}) = 0.

If the sum of probabilities converges, then the probability of infinitely many events occurring is zero. In reliability theory: if component kk fails with probability pkp_k and βˆ‘pk<∞\sum p_k < \infty, almost surely only finitely many components ever fail. In coding theory: if codeword nn is decoded incorrectly with probability pnp_n and βˆ‘pn<∞\sum p_n < \infty, almost surely only finitely many codewords are in error.

,

Why This Matters: Union Bound in Error Probability Analysis

In digital communications, a message from a codebook of size MM is transmitted over a noisy channel. A decoding error occurs if the received signal is closer (in some metric) to a codeword other than the one transmitted. Letting Ej\mathcal{E}_j be the event "the receiver prefers codeword jj over the true codeword," the block error probability is Pe=P ⁣(⋃jβ‰ jβˆ—Ej)β‰€βˆ‘jβ‰ jβˆ—P(Ej).P_e = \mathbb{P}\!\left(\bigcup_{j \neq j^*} \mathcal{E}_j\right) \leq \sum_{j \neq j^*} \mathbb{P}(\mathcal{E}_j). This is the union bound applied to error events. The bound is tight when errors are rare (small PeP_e regime) and is the starting point for the Bhattacharyya bound and the Chernoff–Gallager error exponent. Every error-probability derivation in Book FSI (Chapters 6–7) uses exactly this step.

Inclusion-Exclusion for Three Events

Visualize how the inclusion-exclusion formula counts the probability of AβˆͺBβˆͺCA \cup B \cup C by alternately adding and subtracting intersection probabilities. Adjust the individual and pairwise probabilities to see the formula in action.

Parameters
0.35
0.4
0.3
0.1
0.08
0.07

Law of Large Numbers: Empirical Frequency Convergence

Watch the empirical frequency N(A)/NN(A)/N converge to P(A)\mathbb{P}(A) as the number of trials NN grows. This visualizes the frequentist motivation for Kolmogorov's axioms.
The empirical frequency N(A)/NN(A)/N (blue) oscillates initially but converges to the true probability P(A)=p\mathbb{P}(A) = p (red dashed) as Nβ†’βˆžN \to \infty. This convergence is the Law of Large Numbers, proved rigorously in Chapter 11.

Empirical Frequency of Coin Flips

Simulate nn flips of a biased coin with P(Heads)=p\mathbb{P}(\text{Heads}) = p. Watch the empirical frequency converge to pp as nn increases.

Parameters
0.3
500
42

Common Mistake: Finite Additivity Is Not Enough

Mistake:

A weaker version of Axiom 3 requires additivity only for finitely many disjoint events. It is tempting to assume this is sufficient β€” after all, every experiment in practice involves finitely many outcomes, doesn't it?

Correction:

Finite additivity is not sufficient to derive the continuity of probability, the Borel-Cantelli lemmas, or any convergence theorem. Without countable additivity, one cannot speak meaningfully about limits of events or define the probability of an event described as the intersection of a countably infinite family (e.g., the event that a random walk never exceeds a threshold). All of probability theory for stochastic processes and convergence β€” the backbone of communications analysis β€” requires Axiom 3 in its countably additive form.

πŸ”§Engineering Note

Choosing a Probability Space for a Channel Model

When modeling a wireless channel, the probability space is rarely written down explicitly β€” but it is always implicitly present. Consider a flat Rayleigh fading channel with received signal Y=Hβ‹…x+WY = H \cdot x + W where H∼CN(0,1)H \sim \mathcal{CN}(0,1) (channel gain) and W∼CN(0,Οƒ2)W \sim \mathcal{CN}(0,\sigma^2) (noise). The implicit probability space is:

  • Ξ©=C2\Omega = \mathbb{C}^2 (pairs of complex numbers (h,w)(h, w))
  • F=B(C2)\mathcal{F} = \mathcal{B}(\mathbb{C}^2) (Borel sets in the complex plane Γ—\times complex plane)
  • P\mathbb{P} = the product of two complex Gaussian distributions

Axiom 3 ensures we can compute P(∣Y∣2>γ)\mathbb{P}(|Y|^2 > \gamma) for any threshold γ\gamma by integrating over the corresponding Borel set. Without the sigma-algebra structure, the integral would have no meaning.

Practical Constraints
  • β€’

    The sample space must be large enough to describe all relevant quantities

  • β€’

    For analog channels, use Ξ©βŠ†Rn\Omega \subseteq \mathbb{R}^n or Cn\mathbb{C}^n with Borel sigma-algebra

  • β€’

    For discrete channels, the power set suffices

Probability Space

The triple (Ξ©,F,P)(\Omega, \mathcal{F}, \mathbb{P}) consisting of a sample space, a sigma-algebra of events, and a probability measure satisfying Kolmogorov's three axioms.

Related: Sample Space, Sigma-Algebra

Almost Surely

An event AA holds almost surely (a.s.) if P(A)=1\mathbb{P}(A) = 1. Equivalently, AcA^c is a null event: P(Ac)=0\mathbb{P}(A^c) = 0.

Related: Probability Space

Key Takeaway

The probability space (Ξ©,F,P)(\Omega, \mathcal{F}, \mathbb{P}) is the central object of probability theory. Kolmogorov's three axioms (non-negativity, normalization, countable additivity) are the minimal rules that make the entire theory work. Everything β€” conditional probability, random variables, expectations, the law of large numbers β€” is a derived concept built on this triad.

Quick Check

Which of the following is NOT a consequence of the three Kolmogorov axioms?

P(Ac)=1βˆ’P(A)\mathbb{P}(A^c) = 1 - \mathbb{P}(A)

P(AβˆͺB)≀P(A)+P(B)\mathbb{P}(A \cup B) \leq \mathbb{P}(A) + \mathbb{P}(B)

P(A∣B)=P(A∩B)/P(B)\mathbb{P}(A \mid B) = \mathbb{P}(A \cap B)/\mathbb{P}(B)

P(βˆ…)=0\mathbb{P}(\emptyset) = 0

Quick Check

If AnA_n is the event that the nn-th transmitted packet is corrupted, and P(An)=1/n2\mathbb{P}(A_n) = 1/n^2 for all nn, what does the first Borel-Cantelli lemma say?

With probability 1, infinitely many packets are corrupted.

With probability 1, only finitely many packets are corrupted.

No packets are ever corrupted.

The conclusion requires independence of the AnA_n.