Probability Spaces and Axioms

Why Probability for Wireless Communications?

Chapter 1 equipped us with the deterministic linear-algebraic machinery to write the MIMO input--output relation

y=Hx+n.\mathbf{y} = \mathbf{H}\mathbf{x} + \mathbf{n}.

But in any real system, the channel matrix H\mathbf{H}, the noise vector n\mathbf{n}, and often the transmit symbol x\mathbf{x} itself are random. Wireless channels fade because of multipath propagation; thermal noise is an inherently stochastic phenomenon described by statistical mechanics; and coded data streams are designed to look random to maximise entropy.

A deterministic analysis can tell us what happens for a particular channel realisation, but system design demands answers to questions such as:

  • What is the probability that the bit-error rate exceeds 10510^{-5}?
  • What average data rate can a fading channel support?
  • How does spatial diversity across independent antenna paths reduce outage probability?

These questions are unanswerable without a rigorous probability framework. This section lays the measure-theoretic foundation --- sample spaces, σ\sigma-algebras, and the Kolmogorov axioms --- that underlies every probabilistic statement in the remainder of this text. The payoff is immediate: by the end of this section we will already be applying Bayes' theorem to optimal symbol detection.

Definition:

Sample Space

The sample space, denoted Ω\Omega, is the set of all possible outcomes of a random experiment. Each element ωΩ\omega \in \Omega is called a sample point (or outcome).

The sample space may be:

  • Finite: e.g., Ω={+1,1}\Omega = \{+1, -1\} for a single BPSK symbol.
  • Countably infinite: e.g., Ω={0,1,2,}\Omega = \{0, 1, 2, \dots\} for the number of packet arrivals in a time slot.
  • Uncountable: e.g., Ω=R\Omega = \mathbb{R} for a continuous noise sample, or Ω=CNr×Nt\Omega = \mathbb{C}^{N_r \times N_t} for the set of all possible MIMO channel realisations.

In wireless communications, common sample spaces include:

Experiment Sample space Ω\Omega
Transmitted BPSK symbol {+1,1}\{+1, -1\}
Transmitted QPSK symbol {(±1±j)/2}\{(\pm 1 \pm j)/\sqrt{2}\}
Received baseband sample (AWGN) R\mathbb{R} or C\mathbb{C}
Flat-fading SISO channel gain C\mathbb{C}
MIMO channel matrix (Nr×NtN_r \times N_t) CNr×Nt\mathbb{C}^{N_r \times N_t}

The choice of sample space is a modelling decision. For BPSK over AWGN, one may take Ω={+1,1}×R\Omega = \{+1,-1\} \times \mathbb{R} (encoding both the transmitted symbol and the received signal) or, if the transmitted symbol is treated as deterministic, simply Ω=R\Omega = \mathbb{R}.

,

Definition:

σ\sigma-Algebra (Sigma-Algebra)

Let Ω\Omega be a sample space. A σ\sigma-algebra (or σ\sigma-field) on Ω\Omega is a collection F\mathcal{F} of subsets of Ω\Omega satisfying the following three axioms:

  1. Contains the sample space: ΩF\Omega \in \mathcal{F}.

  2. Closed under complementation: If AFA \in \mathcal{F}, then Ac=ΩAFA^c = \Omega \setminus A \in \mathcal{F}.

  3. Closed under countable unions: If A1,A2,A3,FA_1, A_2, A_3, \dots \in \mathcal{F}, then k=1AkF\displaystyle\bigcup_{k=1}^{\infty} A_k \in \mathcal{F}.

Each element AFA \in \mathcal{F} is called an event.

Immediate consequences of the axioms:

  • =ΩcF\varnothing = \Omega^c \in \mathcal{F} (by axioms 1 and 2).
  • F\mathcal{F} is closed under countable intersections (by De Morgan's laws and axioms 2--3): k=1Ak=(k=1Akc)cF\displaystyle\bigcap_{k=1}^{\infty} A_k = \Bigl(\bigcup_{k=1}^{\infty} A_k^c\Bigr)^c \in \mathcal{F}.
  • F\mathcal{F} is closed under set differences: AB=ABcFA \setminus B = A \cap B^c \in \mathcal{F}.

For finite or countable Ω\Omega, one typically takes F=2Ω\mathcal{F} = 2^{\Omega} (the power set). For uncountable sample spaces such as Ω=R\Omega = \mathbb{R}, the power set is too large and one uses the Borel σ\sigma-algebra B(R)\mathcal{B}(\mathbb{R}), generated by all open intervals. This subtlety becomes essential when defining continuous random variables in Section 2.2.

,

Definition:

Probability Measure (Kolmogorov Axioms)

Let (Ω,F)(\Omega, \mathcal{F}) be a measurable space (a sample space equipped with a σ\sigma-algebra). A probability measure is a function P ⁣:F[0,1]P \colon \mathcal{F} \to [0,1] satisfying the three Kolmogorov axioms:

(K1) Non-negativity. For every event AFA \in \mathcal{F},

P(A)0.P(A) \ge 0.

(K2) Normalization.

P(Ω)=1.P(\Omega) = 1.

(K3) Countable additivity (σ\sigma-additivity). If A1,A2,A3,FA_1, A_2, A_3, \dots \in \mathcal{F} are pairwise disjoint (i.e., AiAj=A_i \cap A_j = \varnothing for iji \neq j), then

P ⁣(k=1Ak)=k=1P(Ak).P\!\left(\bigcup_{k=1}^{\infty} A_k\right) = \sum_{k=1}^{\infty} P(A_k).

These three axioms, together with the σ\sigma-algebra structure, are sufficient to derive all standard rules of probability.

Countable additivity (K3) is strictly stronger than finite additivity. The distinction matters when taking limits of event sequences --- a situation that arises naturally in coding theory (block length nn \to \infty) and in ergodic arguments for stochastic processes.

,

Definition:

Probability Space

A probability space is a triple (Ω,F,P)(\Omega, \mathcal{F}, P) where:

Every probabilistic model in communications theory begins --- at least implicitly --- with the specification of a probability space. The triple (Ω,F,P)(\Omega, \mathcal{F}, P) provides:

  1. a complete catalogue of what can happen (Ω\Omega),
  2. a specification of which collections of outcomes are "observable" or "measurable" (F\mathcal{F}), and
  3. a consistent assignment of likelihoods (PP).
,

Example: Probability Space for BPSK over AWGN

Construct an explicit probability space for the experiment of transmitting a single BPSK symbol s{+1,1}s \in \{+1, -1\} over an AWGN channel with received signal r=s+nr = s + n, where nN(0,σ2)n \sim \mathcal{N}(0, \sigma^2).

Theorem: Basic Properties of Probability Measures

Let (Ω,F,P)(\Omega, \mathcal{F}, P) be a probability space. Then:

  1. P()=0P(\varnothing) = 0.
  2. Complement rule: P(Ac)=1P(A)P(A^c) = 1 - P(A) for all AFA \in \mathcal{F}.
  3. Monotonicity: If ABA \subseteq B, then P(A)P(B)P(A) \le P(B).
  4. Sub-additivity (union bound): P ⁣(k=1nAk)k=1nP(Ak)P\!\left(\displaystyle\bigcup_{k=1}^{n} A_k\right) \le \sum_{k=1}^{n} P(A_k).

These properties follow directly from the three Kolmogorov axioms. The union bound (property 4) is used extensively in communications for bounding error probabilities over constellations --- it is the foundation of the "nearest-neighbour union bound" on symbol-error rate.

,

Theorem: Inclusion-Exclusion Principle

Let (Ω,F,P)(\Omega, \mathcal{F}, P) be a probability space. For any two events A,BFA, B \in \mathcal{F},

P(AB)=P(A)+P(B)P(AB).P(A \cup B) = P(A) + P(B) - P(A \cap B).

More generally, for nn events A1,,AnA_1, \dots, A_n,

P ⁣(k=1nAk)=kP(Ak)k<P(AkA)+k<<mP(AkAAm)+(1)n+1P(A1An).P\!\left(\bigcup_{k=1}^{n} A_k\right) = \sum_{k} P(A_k) - \sum_{k < \ell} P(A_k \cap A_\ell) + \sum_{k < \ell < m} P(A_k \cap A_\ell \cap A_m) - \cdots + (-1)^{n+1} P(A_1 \cap \cdots \cap A_n).

Summing P(A)+P(B)P(A) + P(B) double-counts the overlap ABA \cap B. Subtracting P(AB)P(A \cap B) corrects this. The general formula alternately adds and subtracts to correct for over- and under-counting of higher-order overlaps.

,

Definition:

Conditional Probability

Let (Ω,F,P)(\Omega, \mathcal{F}, P) be a probability space and let BFB \in \mathcal{F} with P(B)>0P(B) > 0. The conditional probability of an event AFA \in \mathcal{F} given BB is

P(AB)=P(AB)P(B).P(A \mid B) = \frac{P(A \cap B)}{P(B)}.

Key properties:

  • For fixed BB with P(B)>0P(B) > 0, the map AP(AB)A \mapsto P(A \mid B) is itself a probability measure on (Ω,F)(\Omega, \mathcal{F}).

  • Multiplication rule: P(AB)=P(AB)P(B)P(A \cap B) = P(A \mid B)\,P(B).

  • Chain rule (general): For events A1,,AnA_1, \dots, A_n with P(A1An1)>0P(A_1 \cap \cdots \cap A_{n-1}) > 0,

    P(A1A2An)=P(A1)P(A2A1)P(A3A1A2)P(AnA1An1).P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1)\,P(A_2 \mid A_1)\,P(A_3 \mid A_1 \cap A_2) \cdots P(A_n \mid A_1 \cap \cdots \cap A_{n-1}).

In detection theory, conditioning is the fundamental operation: the receiver observes rr and must compute P(sr)P(s \mid r) --- the conditional probability of each hypothesis given the observation.

,

Theorem: Law of Total Probability

Let {B1,B2,,Bn}\{B_1, B_2, \dots, B_n\} be a partition of Ω\Omega: the BiB_i are pairwise disjoint, i=1nBi=Ω\bigcup_{i=1}^n B_i = \Omega, and P(Bi)>0P(B_i) > 0 for each ii. Then for any event AFA \in \mathcal{F},

P(A)=i=1nP(ABi)P(Bi).P(A) = \sum_{i=1}^{n} P(A \mid B_i)\,P(B_i).

The result extends to countable partitions provided the sum converges.

We "break up" the calculation of P(A)P(A) by conditioning on which element of the partition occurred. In communications, the partition is often the set of possible transmitted symbols {s1,,sM}\{s_1, \dots, s_M\}, and AA is the event of a detection error.

,

Theorem: Bayes' Theorem

Let {B1,B2,,Bn}\{B_1, B_2, \dots, B_n\} be a partition of Ω\Omega with P(Bi)>0P(B_i) > 0 for each ii, and let AFA \in \mathcal{F} with P(A)>0P(A) > 0. Then for each k=1,,nk = 1, \dots, n,

P(BkA)=P(ABk)P(Bk)i=1nP(ABi)P(Bi).P(B_k \mid A) = \frac{P(A \mid B_k)\,P(B_k)}{\displaystyle\sum_{i=1}^{n} P(A \mid B_i)\,P(B_i)}.

The terms have standard names:

Term Name Role
P(Bk)P(B_k) Prior Belief about BkB_k before observing AA
P(ABk)P(A \mid B_k) Likelihood How probable AA is under hypothesis BkB_k
P(BkA)P(B_k \mid A) Posterior Updated belief about BkB_k after observing AA
iP(ABi)P(Bi)\sum_i P(A \mid B_i)P(B_i) Evidence (marginal likelihood) Normalising constant

Bayes' theorem "inverts" the direction of conditioning. The channel model gives us P(observationsymbol)P(\text{observation} \mid \text{symbol}) --- the likelihood. The receiver needs P(symbolobservation)P(\text{symbol} \mid \text{observation}) --- the posterior. Bayes' theorem is the bridge.

,

Example: MAP Detection for BPSK via Bayes' Theorem

A BPSK transmitter sends s{+1,1}s \in \{+1, -1\} with equal prior probabilities P(s=+1)=P(s=1)=1/2P(s = +1) = P(s = -1) = 1/2 over an AWGN channel. The receiver observes

r=s+n,nN(0,σ2).r = s + n, \qquad n \sim \mathcal{N}(0, \sigma^2).

Using Bayes' theorem, compute the posterior probability P(s=+1r)P(s = +1 \mid r) and derive the Maximum A Posteriori (MAP) decision rule.

,

Bayes' Theorem in Action: BPSK Detection

A cinematic animation of Bayes' theorem applied to BPSK detection. Watch how the prior, likelihood, and posterior interact as the received signal sweeps across the decision boundary.
The two Gaussian likelihoods (conditioned on s=+1s = +1 and s=1s = -1) combine with the prior via Bayes' theorem to yield the posterior probability. The MAP decision rule selects the hypothesis with higher posterior.

Definition:

Independence of Events

Let (Ω,F,P)(\Omega, \mathcal{F}, P) be a probability space.

Pairwise independence. Two events A,BFA, B \in \mathcal{F} are independent if

P(AB)=P(A)P(B).P(A \cap B) = P(A)\,P(B).

Equivalently (when P(B)>0P(B) > 0), AA and BB are independent if and only if P(AB)=P(A)P(A \mid B) = P(A): knowing that BB occurred does not change the probability of AA.

Mutual independence. Events A1,A2,,AnA_1, A_2, \dots, A_n are mutually independent if for every subcollection {Ai1,Ai2,,Aik}\{A_{i_1}, A_{i_2}, \dots, A_{i_k}\} with 2kn2 \le k \le n,

P(Ai1Ai2Aik)=P(Ai1)P(Ai2)P(Aik).P(A_{i_1} \cap A_{i_2} \cap \cdots \cap A_{i_k}) = P(A_{i_1})\,P(A_{i_2}) \cdots P(A_{i_k}).

This requires 2nn12^n - n - 1 equalities to hold (one for each subset of size 2\ge 2), which is strictly stronger than pairwise independence alone (which only requires (n2)\binom{n}{2} equalities).

Independence is a modelling assumption, not something derived from the axioms. In wireless, the assumption that fading coefficients across well-separated antennas are independent is justified by physical arguments (sufficient antenna spacing λ/2\ge \lambda/2 in a rich scattering environment), but must always be stated explicitly.

,

Why This Matters: Independent Fading and Diversity Gain

The concept of independence is at the heart of diversity in wireless systems. Consider a receiver with LL antennas, each observing a faded copy of the transmitted signal:

r=hs+n,=1,2,,L,r_\ell = h_\ell\,s + n_\ell, \qquad \ell = 1, 2, \dots, L,

where hh_\ell is the fading coefficient on antenna \ell, ss is the transmitted symbol, and nn_\ell is additive noise.

If the fading coefficients h1,h2,,hLh_1, h_2, \dots, h_L are mutually independent (DIndependence of Events), the probability that all LL channels are simultaneously in a deep fade is

P ⁣(=1L{h2<γth})==1LP(h2<γth).P\!\left(\bigcap_{\ell=1}^L \{|h_\ell|^2 < \gamma_{\mathrm{th}}\}\right) = \prod_{\ell=1}^{L} P(|h_\ell|^2 < \gamma_{\mathrm{th}}).

For Rayleigh fading with P(h2<γth)γthP(|h_\ell|^2 < \gamma_{\mathrm{th}}) \approx \gamma_{\mathrm{th}} at low threshold, this product scales as γthL\gamma_{\mathrm{th}}^L, yielding a diversity order of LL. The error probability at high SNR then decays as PeSNRLP_e \propto \text{SNR}^{-L} --- each additional independent antenna path contributes one order of magnitude faster decay.

Key insight: Without independence, adding antennas may not help. If h1=h2==hLh_1 = h_2 = \cdots = h_L (fully correlated fading), all antennas fade together and the diversity order remains 11. Independence is the mechanism that converts extra hardware into reliability.

See full treatment in Chapter 4، Section 3

Historical Note: Kolmogorov's Axiomatization (1933)

Prior to the 20th century, probability was a collection of useful but ad hoc calculation rules. Attempts to ground it rigorously --- by Laplace (equally likely outcomes), von Mises (frequency limits), and others --- each covered only special cases.

In 1933, the Russian mathematician Andrey Nikolaevich Kolmogorov (1903--1987) published Grundbegriffe der Wahrscheinlichkeitsrechnung ("Foundations of the Theory of Probability"), in which he showed that the entire theory could be derived from just three axioms by identifying probability with a normalised measure on a σ\sigma-algebra of events. This measure-theoretic framework unified discrete and continuous probability, resolved paradoxes, and put limit theorems (law of large numbers, central limit theorem) on rigorous footing.

Kolmogorov's axioms are precisely the axioms (K1)--(K3) of DProbability Measure (Kolmogorov Axioms). Nearly a century later, they remain the universally accepted foundation of probability theory --- and, by extension, of all stochastic models in communications, information theory, and signal processing.

Historical irony: Kolmogorov initially trained in history before switching to mathematics. His axiomatization of probability was inspired by Lebesgue's theory of integration --- the same machinery that today underpins the definition of expectation, density functions, and information-theoretic integrals.

Common Mistake: Confusing Pairwise Independence with Mutual Independence

Mistake:

A common error is to assume that if events A1,A2,,AnA_1, A_2, \dots, A_n are pairwise independent (i.e., P(AiAj)=P(Ai)P(Aj)P(A_i \cap A_j) = P(A_i)P(A_j) for every pair iji \neq j), then they are automatically mutually independent.

Correction:

Pairwise independence does not imply mutual independence.

A classic counterexample uses two fair coin flips. Define:

  • A1A_1 = "first coin is heads,"
  • A2A_2 = "second coin is heads,"
  • A3A_3 = "the two coins show the same face."

One can verify:

  • P(A1)=P(A2)=P(A3)=1/2P(A_1) = P(A_2) = P(A_3) = 1/2.
  • P(A1A2)=1/4=P(A1)P(A2)P(A_1 \cap A_2) = 1/4 = P(A_1)P(A_2). (Pairwise independent.)
  • P(A1A3)=1/4=P(A1)P(A3)P(A_1 \cap A_3) = 1/4 = P(A_1)P(A_3). (Pairwise independent.)
  • P(A2A3)=1/4=P(A2)P(A3)P(A_2 \cap A_3) = 1/4 = P(A_2)P(A_3). (Pairwise independent.)
  • But P(A1A2A3)=1/41/8=P(A1)P(A2)P(A3)P(A_1 \cap A_2 \cap A_3) = 1/4 \neq 1/8 = P(A_1)P(A_2)P(A_3).

The triple intersection condition fails. Knowing both coins are heads (A1A2A_1 \cap A_2) makes A3A_3 certain, so the three events are not mutually independent.

In wireless context: When modelling fading across multiple antennas or subcarriers, one must verify (or assume) mutual independence, not merely pairwise independence, for diversity arguments to hold. Physical channel models based on independent scatterers typically guarantee mutual independence, but engineered systems with shared RF components may introduce subtle three-way (or higher) correlations.

Quick Check

In a binary communication system, P(s=+1)=0.6P(s = +1) = 0.6 and P(s=1)=0.4P(s = -1) = 0.4. The channel has crossover probabilities P(r=1s=+1)=0.1P(r = -1 \mid s = +1) = 0.1 and P(r=+1s=1)=0.2P(r = +1 \mid s = -1) = 0.2. What is P(s=+1r=+1)P(s = +1 \mid r = +1)?

0.600.60

0.540.54

0.540.620.871\frac{0.54}{0.62} \approx 0.871

0.540.54+0.08=0.90\frac{0.54}{0.54 + 0.08} = 0.90

Quick Check

Let AA and BB be independent events with P(A)=0.3P(A) = 0.3 and P(B)=0.5P(B) = 0.5. What is P(AcB)P(A^c \cap B)?

0.150.15

0.350.35

0.500.50

0.650.65

Quick Check

Events AA and BB satisfy P(A)=0.4P(A) = 0.4, P(B)=0.5P(B) = 0.5, and P(AB)=0.2P(A \cap B) = 0.2. What is P(AB)P(A \cup B)?

0.90.9

0.70.7

0.50.5

0.80.8

Sample space

The set Ω\Omega of all possible outcomes of a random experiment. Each element ωΩ\omega \in \Omega is a sample point. The sample space may be finite, countably infinite, or uncountable.

Related: Sample Space, Probability Space

σ\sigma-algebra

A collection F\mathcal{F} of subsets of Ω\Omega that contains Ω\Omega, is closed under complementation, and is closed under countable unions. Elements of F\mathcal{F} are called events. For Ω=R\Omega = \mathbb{R}, the standard choice is the Borel σ\sigma-algebra B(R)\mathcal{B}(\mathbb{R}).

Related: σ\sigma-Algebra (Sigma-Algebra), Probability Space

Probability measure

A function P ⁣:F[0,1]P \colon \mathcal{F} \to [0,1] satisfying the Kolmogorov axioms: non-negativity (P(A)0P(A) \ge 0), normalisation (P(Ω)=1P(\Omega) = 1), and countable additivity for disjoint events.

Related: Probability Measure (Kolmogorov Axioms), Probability Space, Kolmogorov's Axiomatization (1933)

Conditional probability

The probability of event AA given that event BB has occurred, defined as P(AB)=P(AB)/P(B)P(A \mid B) = P(A \cap B) / P(B) when P(B)>0P(B) > 0. Conditioning is the fundamental operation in Bayesian detection and estimation.

Related: Conditional Probability, Bayes' Theorem, MAP Detection for BPSK via Bayes' Theorem

Independence

Events AA and BB are independent if P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B). Mutual independence of nn events requires the product rule to hold for every subcollection of size 2\ge 2, which is strictly stronger than pairwise independence.

Related: Independence of Events, Confusing Pairwise Independence with Mutual Independence, Independent Fading and Diversity Gain

Key Takeaway

The core messages of this section:

  1. Everything starts with (Ω,F,P)(\Omega, \mathcal{F}, P). The probability space triple is the rigorous foundation for every stochastic statement in communications. The three Kolmogorov axioms --- non-negativity, normalisation, countable additivity --- are minimal yet sufficient to derive all of probability theory.

  2. Bayes' theorem inverts the channel. The channel model gives us P(observationhypothesis)P(\text{observation} \mid \text{hypothesis}); Bayes' theorem converts this into the posterior P(hypothesisobservation)P(\text{hypothesis} \mid \text{observation}), enabling optimal (MAP) detection. For BPSK with equal priors over AWGN, the MAP rule reduces to s^=sgn(r)\hat{s} = \operatorname{sgn}(r).

  3. Independence enables diversity. When fading paths are mutually independent, the probability of simultaneous deep fades decays as a product, yielding diversity order LL with LL antennas. Pairwise independence alone is insufficient --- mutual independence is required.