Ferkans — Interactive Telecom Tutor

Why Probability for Wireless Communications?

Chapter 1 equipped us with the deterministic linear-algebraic machinery to write the MIMO input--output relation

$\mathbf{y} = \mathbf{H}\mathbf{x} + \mathbf{n}.$

But in any real system, the channel matrix $\mathbf{H}$ , the noise vector $\mathbf{n}$ , and often the transmit symbol $\mathbf{x}$ itself are random. Wireless channels fade because of multipath propagation; thermal noise is an inherently stochastic phenomenon described by statistical mechanics; and coded data streams are designed to look random to maximise entropy.

A deterministic analysis can tell us what happens for a particular channel realisation, but system design demands answers to questions such as:

What is the probability that the bit-error rate exceeds $10^{-5}$ ?
What average data rate can a fading channel support?
How does spatial diversity across independent antenna paths reduce outage probability?

These questions are unanswerable without a rigorous probability framework. This section lays the measure-theoretic foundation --- sample spaces, $\sigma$ -algebras, and the Kolmogorov axioms --- that underlies every probabilistic statement in the remainder of this text. The payoff is immediate: by the end of this section we will already be applying Bayes' theorem to optimal symbol detection.

Definition:
Sample Space

The sample space, denoted $\Omega$ , is the set of all possible outcomes of a random experiment. Each element $\omega \in \Omega$ is called a sample point (or outcome).

The sample space may be:

Finite: e.g., $\Omega = \{+1, -1\}$ for a single BPSK symbol.
Countably infinite: e.g., $\Omega = \{0, 1, 2, \dots\}$ for the number of packet arrivals in a time slot.
Uncountable: e.g., $\Omega = \mathbb{R}$ for a continuous noise sample, or $\Omega = \mathbb{C}^{N_r \times N_t}$ for the set of all possible MIMO channel realisations.

In wireless communications, common sample spaces include:

Experiment	Sample space $\Omega$
Transmitted BPSK symbol	$\{+1, -1\}$
Transmitted QPSK symbol	$\{(\pm 1 \pm j)/\sqrt{2}\}$
Received baseband sample (AWGN)	$\mathbb{R}$ or $\mathbb{C}$
Flat-fading SISO channel gain	$\mathbb{C}$
MIMO channel matrix ( $N_r \times N_t$ )	$\mathbb{C}^{N_r \times N_t}$

The choice of sample space is a modelling decision. For BPSK over AWGN, one may take $\Omega = \{+1,-1\} \times \mathbb{R}$ (encoding both the transmitted symbol and the received signal) or, if the transmitted symbol is treated as deterministic, simply $\Omega = \mathbb{R}$ .

,

Definition:
$\sigma$ -Algebra (Sigma-Algebra)

Let $\Omega$ be a sample space. A $\sigma$ -algebra (or $\sigma$ -field) on $\Omega$ is a collection $\mathcal{F}$ of subsets of $\Omega$ satisfying the following three axioms:

Contains the sample space: $\Omega \in \mathcal{F}$ .
Closed under complementation: If $A \in \mathcal{F}$ , then $A^c = \Omega \setminus A \in \mathcal{F}$ .
Closed under countable unions: If $A_1, A_2, A_3, \dots \in \mathcal{F}$ , then $\displaystyle\bigcup_{k=1}^{\infty} A_k \in \mathcal{F}$ .

Each element $A \in \mathcal{F}$ is called an event.

Immediate consequences of the axioms:

$\varnothing = \Omega^c \in \mathcal{F}$ (by axioms 1 and 2).
$\mathcal{F}$ is closed under countable intersections (by De Morgan's laws and axioms 2--3): $\displaystyle\bigcap_{k=1}^{\infty} A_k = \Bigl(\bigcup_{k=1}^{\infty} A_k^c\Bigr)^c \in \mathcal{F}$ .
$\mathcal{F}$ is closed under set differences: $A \setminus B = A \cap B^c \in \mathcal{F}$ .

For finite or countable $\Omega$ , one typically takes $\mathcal{F} = 2^{\Omega}$ (the power set). For uncountable sample spaces such as $\Omega = \mathbb{R}$ , the power set is too large and one uses the Borel $\sigma$ -algebra $\mathcal{B}(\mathbb{R})$ , generated by all open intervals. This subtlety becomes essential when defining continuous random variables in Section 2.2.

,

Definition:
Probability Measure (Kolmogorov Axioms)

Let $(\Omega, \mathcal{F})$ be a measurable space (a sample space equipped with a $\sigma$ -algebra). A probability measure is a function $P \colon \mathcal{F} \to [0,1]$ satisfying the three Kolmogorov axioms:

(K1) Non-negativity. For every event $A \in \mathcal{F}$ ,

$P(A) \ge 0.$

(K2) Normalization.

$P(\Omega) = 1.$

(K3) Countable additivity ( $\sigma$ -additivity). If $A_1, A_2, A_3, \dots \in \mathcal{F}$ are pairwise disjoint (i.e., $A_i \cap A_j = \varnothing$ for $i \neq j$ ), then

$P\!\left(\bigcup_{k=1}^{\infty} A_k\right) = \sum_{k=1}^{\infty} P(A_k).$

These three axioms, together with the $\sigma$ -algebra structure, are sufficient to derive all standard rules of probability.

Countable additivity (K3) is strictly stronger than finite additivity. The distinction matters when taking limits of event sequences --- a situation that arises naturally in coding theory (block length $n \to \infty$ ) and in ergodic arguments for stochastic processes.

,

Definition:
Probability Space

A probability space is a triple $(\Omega, \mathcal{F}, P)$ where:

$\Omega$ is a sample space (DSample Space),
$\mathcal{F}$ is a $\sigma$ -algebra on $\Omega$ ( $\sigma$ $σ$ -Algebra (Sigma-Algebra)" data-ref-type="definition">D $\sigma$ -Algebra (Sigma-Algebra)),
$P$ is a probability measure on $(\Omega, \mathcal{F})$ (DProbability Measure (Kolmogorov Axioms)).

Every probabilistic model in communications theory begins --- at least implicitly --- with the specification of a probability space. The triple $(\Omega, \mathcal{F}, P)$ provides:

a complete catalogue of what can happen ( $\Omega$ ),
a specification of which collections of outcomes are "observable" or "measurable" ( $\mathcal{F}$ ), and
a consistent assignment of likelihoods ( $P$ ).

,

Example: Probability Space for BPSK over AWGN

Construct an explicit probability space for the experiment of transmitting a single BPSK symbol $s \in \{+1, -1\}$ over an AWGN channel with received signal $r = s + n$ , where $n \sim \mathcal{N}(0, \sigma^2)$ .

Solution

Identify the sample space

Each outcome of the experiment is fully described by the pair (transmitted symbol, received signal). Hence

$\Omega = \{+1, -1\} \times \mathbb{R}.$

A typical sample point is $\omega = (s, r)$ with $s \in \{+1,-1\}$ and $r \in \mathbb{R}$ .

Define the $\sigma$-algebra

We take the product $\sigma$ -algebra

$\mathcal{F} = 2^{\{+1,-1\}} \otimes \mathcal{B}(\mathbb{R}),$

where $2^{\{+1,-1\}}$ is the power set of the binary set (with four elements: $\varnothing, \{+1\}, \{-1\}, \{+1,-1\}$ ) and $\mathcal{B}(\mathbb{R})$ is the Borel $\sigma$ -algebra on $\mathbb{R}$ .

Typical events include:

$\{+1\} \times \mathbb{R}$ = "symbol $+1$ was sent"
$\{+1,-1\} \times (0, \infty)$ = "the received signal is positive"
$\{+1\} \times (0.5, 1.5)$ = "symbol $+1$ was sent and $r \in (0.5, 1.5)$ "

Specify the probability measure

Assume equally likely symbols: $P(s = +1) = P(s = -1) = 1/2$ . Given $s$ , the received signal has conditional density

$f_{R \mid S}(r \mid s) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(r - s)^2}{2\sigma^2}\right).$

The joint probability of any measurable set $A = \{s_0\} \times B$ (where $s_0 \in \{+1,-1\}$ and $B \in \mathcal{B}(\mathbb{R})$ ) is

$P(A) = \frac{1}{2}\int_B \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(r - s_0)^2}{2\sigma^2}\right) dr.$

This extends by additivity to all events in $\mathcal{F}$ . One can verify that the Kolmogorov axioms (K1)--(K3) are satisfied: non-negativity is immediate, normalisation follows from $\int_{-\infty}^{\infty} f_{R \mid S}(r \mid s)\,dr = 1$ for each $s$ , and countable additivity is inherited from the Lebesgue integral.

Summary

The complete probability space is

$\bigl(\{+1,-1\} \times \mathbb{R},\; 2^{\{+1,-1\}} \otimes \mathcal{B}(\mathbb{R}),\; P\bigr)$

where $P$ is defined via the joint density above. This simple example already captures the essential structure: a discrete "information" component (the symbol) and a continuous "observation" component (the received signal), coupled through the channel law.

Theorem: Basic Properties of Probability Measures

Let $(\Omega, \mathcal{F}, P)$ be a probability space. Then:

$P(\varnothing) = 0$ .
Complement rule: $P(A^c) = 1 - P(A)$ for all $A \in \mathcal{F}$ .
Monotonicity: If $A \subseteq B$ , then $P(A) \le P(B)$ .
Sub-additivity (union bound): $P\!\left(\displaystyle\bigcup_{k=1}^{n} A_k\right) \le \sum_{k=1}^{n} P(A_k)$ .

These properties follow directly from the three Kolmogorov axioms. The union bound (property 4) is used extensively in communications for bounding error probabilities over constellations --- it is the foundation of the "nearest-neighbour union bound" on symbol-error rate.

Show Hint

For property 1, write $\Omega = \Omega \cup \varnothing \cup \varnothing \cup \cdots$ and apply (K3).

For monotonicity, decompose $B = A \cup (B \setminus A)$ .

Proof

Property 1: $P(\varnothing) = 0$

Write $\Omega = \Omega \cup \varnothing \cup \varnothing \cup \cdots$ (a countable union of pairwise disjoint sets). By (K3),

$P(\Omega) = P(\Omega) + P(\varnothing) + P(\varnothing) + \cdots = P(\Omega) + \sum_{k=1}^{\infty} P(\varnothing).$

Since $P(\Omega) = 1 < \infty$ , the infinite sum must converge, which requires $P(\varnothing) = 0$ .

Property 2: Complement rule

$\Omega = A \cup A^c$ with $A \cap A^c = \varnothing$ . By (K3) (finite additivity as a special case) and (K2),

$1 = P(\Omega) = P(A) + P(A^c),$

giving $P(A^c) = 1 - P(A)$ .

Property 3: Monotonicity

If $A \subseteq B$ , then $B = A \cup (B \setminus A)$ with $A \cap (B \setminus A) = \varnothing$ . By finite additivity,

$P(B) = P(A) + P(B \setminus A) \ge P(A),$

since $P(B \setminus A) \ge 0$ by (K1).

Property 4: Union bound

Define $B_1 = A_1$ and $B_k = A_k \setminus \bigcup_{i=1}^{k-1} A_i$ for $k \ge 2$ . The sets $B_k$ are pairwise disjoint, $\bigcup_k B_k = \bigcup_k A_k$ , and $B_k \subseteq A_k$ , so $P(B_k) \le P(A_k)$ by monotonicity. Then

$P\!\left(\bigcup_{k=1}^{n} A_k\right) = \sum_{k=1}^{n} P(B_k) \le \sum_{k=1}^{n} P(A_k). \qquad \blacksquare$

,

Theorem: Inclusion-Exclusion Principle

Let $(\Omega, \mathcal{F}, P)$ be a probability space. For any two events $A, B \in \mathcal{F}$ ,

$P(A \cup B) = P(A) + P(B) - P(A \cap B).$

More generally, for $n$ events $A_1, \dots, A_n$ ,

$P\!\left(\bigcup_{k=1}^{n} A_k\right) = \sum_{k} P(A_k) - \sum_{k < \ell} P(A_k \cap A_\ell) + \sum_{k < \ell < m} P(A_k \cap A_\ell \cap A_m) - \cdots + (-1)^{n+1} P(A_1 \cap \cdots \cap A_n).$

Summing $P(A) + P(B)$ double-counts the overlap $A \cap B$ . Subtracting $P(A \cap B)$ corrects this. The general formula alternately adds and subtracts to correct for over- and under-counting of higher-order overlaps.

Show Hint

Decompose $A \cup B$ into three disjoint parts: $A \setminus B$ , $B \setminus A$ , and $A \cap B$ .

Apply finite additivity to this disjoint decomposition.

Proof

Disjoint decomposition

Write $A \cup B$ as the disjoint union

$A \cup B = (A \setminus B) \;\cup\; (A \cap B) \;\cup\; (B \setminus A).$

By finite additivity (a consequence of K3),

$P(A \cup B) = P(A \setminus B) + P(A \cap B) + P(B \setminus A). \tag{1}$

Express individual probabilities

Similarly, $A = (A \setminus B) \cup (A \cap B)$ is a disjoint union, so

$P(A) = P(A \setminus B) + P(A \cap B),$

giving $P(A \setminus B) = P(A) - P(A \cap B)$ . Likewise,

$P(B \setminus A) = P(B) - P(A \cap B).$

Combine

Substituting into (1):

$P(A \cup B) = \bigl[P(A) - P(A \cap B)\bigr] + P(A \cap B) + \bigl[P(B) - P(A \cap B)\bigr] = P(A) + P(B) - P(A \cap B). \qquad \blacksquare$

,

Definition:
Conditional Probability

Let $(\Omega, \mathcal{F}, P)$ be a probability space and let $B \in \mathcal{F}$ with $P(B) > 0$ . The conditional probability of an event $A \in \mathcal{F}$ given $B$ is

$P(A \mid B) = \frac{P(A \cap B)}{P(B)}.$

Key properties:

For fixed $B$ with $P(B) > 0$ , the map $A \mapsto P(A \mid B)$ is itself a probability measure on $(\Omega, \mathcal{F})$ .
Multiplication rule: $P(A \cap B) = P(A \mid B)\,P(B)$ .
Chain rule (general): For events $A_1, \dots, A_n$ with $P(A_1 \cap \cdots \cap A_{n-1}) > 0$ ,

$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1)\,P(A_2 \mid A_1)\,P(A_3 \mid A_1 \cap A_2) \cdots P(A_n \mid A_1 \cap \cdots \cap A_{n-1}).$

In detection theory, conditioning is the fundamental operation: the receiver observes $r$ and must compute $P(s \mid r)$ --- the conditional probability of each hypothesis given the observation.

,

Theorem: Law of Total Probability

Let $\{B_1, B_2, \dots, B_n\}$ be a partition of $\Omega$ : the $B_i$ are pairwise disjoint, $\bigcup_{i=1}^n B_i = \Omega$ , and $P(B_i) > 0$ for each $i$ . Then for any event $A \in \mathcal{F}$ ,

$P(A) = \sum_{i=1}^{n} P(A \mid B_i)\,P(B_i).$

The result extends to countable partitions provided the sum converges.

We "break up" the calculation of $P(A)$ by conditioning on which element of the partition occurred. In communications, the partition is often the set of possible transmitted symbols $\{s_1, \dots, s_M\}$ , and $A$ is the event of a detection error.

Show Hint

Write $A = A \cap \Omega = \bigcup_i (A \cap B_i)$ and note the pieces are disjoint.

Proof

Decompose and apply additivity

Since $\{B_1, \dots, B_n\}$ partitions $\Omega$ ,

$A = A \cap \Omega = A \cap \left(\bigcup_{i=1}^n B_i\right) = \bigcup_{i=1}^n (A \cap B_i).$

The sets $A \cap B_i$ are pairwise disjoint (because the $B_i$ are), so by countable additivity (K3),

$P(A) = \sum_{i=1}^n P(A \cap B_i) = \sum_{i=1}^n P(A \mid B_i)\,P(B_i),$

where the last step uses the definition of conditional probability. $\blacksquare$

,

Theorem: Bayes' Theorem

Let $\{B_1, B_2, \dots, B_n\}$ be a partition of $\Omega$ with $P(B_i) > 0$ for each $i$ , and let $A \in \mathcal{F}$ with $P(A) > 0$ . Then for each $k = 1, \dots, n$ ,

$P(B_k \mid A) = \frac{P(A \mid B_k)\,P(B_k)}{\displaystyle\sum_{i=1}^{n} P(A \mid B_i)\,P(B_i)}.$

The terms have standard names:

Term	Name	Role
$P(B_k)$	Prior	Belief about $B_k$ before observing $A$
$P(A \mid B_k)$	Likelihood	How probable $A$ is under hypothesis $B_k$
$P(B_k \mid A)$	Posterior	Updated belief about $B_k$ after observing $A$
$\sum_i P(A \mid B_i)P(B_i)$	Evidence (marginal likelihood)	Normalising constant

Bayes' theorem "inverts" the direction of conditioning. The channel model gives us $P(\text{observation} \mid \text{symbol})$ --- the likelihood. The receiver needs $P(\text{symbol} \mid \text{observation})$ --- the posterior. Bayes' theorem is the bridge.

Show Hint

Start from the definition of $P(B_k \mid A)$ and expand $P(A)$ using total probability.

Proof

Apply the definition of conditional probability

By DConditional Probability,

$P(B_k \mid A) = \frac{P(A \cap B_k)}{P(A)} = \frac{P(A \mid B_k)\,P(B_k)}{P(A)}.$

Expand $P(A)$ via total probability

By TLaw of Total Probability,

$P(A) = \sum_{i=1}^{n} P(A \mid B_i)\,P(B_i).$

Substituting into the expression above yields

$P(B_k \mid A) = \frac{P(A \mid B_k)\,P(B_k)} {\displaystyle\sum_{i=1}^{n} P(A \mid B_i)\,P(B_i)}. \qquad \blacksquare$

,

Example: MAP Detection for BPSK via Bayes' Theorem

A BPSK transmitter sends $s \in \{+1, -1\}$ with equal prior probabilities $P(s = +1) = P(s = -1) = 1/2$ over an AWGN channel. The receiver observes

$r = s + n, \qquad n \sim \mathcal{N}(0, \sigma^2).$

Using Bayes' theorem, compute the posterior probability $P(s = +1 \mid r)$ and derive the Maximum A Posteriori (MAP) decision rule.

Solution

Identify the likelihoods

The conditional density of $r$ given $s$ is

$f(r \mid s) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(r - s)^2}{2\sigma^2}\right).$

Therefore:

$f(r \mid s = +1) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(r - 1)^2}{2\sigma^2}\right),$

$f(r \mid s = -1) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(r + 1)^2}{2\sigma^2}\right).$

Apply Bayes' theorem

The posterior probability is (using the continuous analogue of Bayes' theorem):

$P(s = +1 \mid r) = \frac{f(r \mid s = +1)\,P(s = +1)} {f(r \mid s = +1)\,P(s = +1) + f(r \mid s = -1)\,P(s = -1)}.$

With equal priors $P(s = +1) = P(s = -1) = 1/2$ , the priors cancel:

$P(s = +1 \mid r) = \frac{f(r \mid s = +1)} {f(r \mid s = +1) + f(r \mid s = -1)}.$

Simplify using the log-likelihood ratio

Define the log-likelihood ratio (LLR):

$\Lambda(r) = \ln \frac{f(r \mid s = +1)}{f(r \mid s = -1)} = -\frac{(r-1)^2}{2\sigma^2} + \frac{(r+1)^2}{2\sigma^2} = \frac{2r}{\sigma^2}.$

Then

$P(s = +1 \mid r) = \frac{1}{1 + e^{-\Lambda(r)}} = \frac{1}{1 + \exp(-2r/\sigma^2)},$

which is the sigmoid (logistic) function of $2r/\sigma^2$ .

Derive the MAP decision rule

The MAP detector chooses the symbol with the highest posterior:

$\hat{s}_{\mathrm{MAP}} = \arg\max_{s \in \{+1,-1\}} P(s \mid r).$

Since $P(s = +1 \mid r) > 1/2$ if and only if $\Lambda(r) > 0$ , i.e., $r > 0$ , the MAP rule simplifies to:

$\hat{s}_{\mathrm{MAP}} = \operatorname{sgn}(r) = \begin{cases} +1 & \text{if } r > 0, \\ -1 & \text{if } r < 0. \end{cases}$

With equal priors, the MAP detector coincides with the maximum-likelihood (ML) detector. The decision boundary is at $r = 0$ , which is the midpoint between the two constellation points --- an intuitively satisfying result.

Note: With unequal priors $P(s = +1) = p \neq 1/2$ , the decision boundary shifts to $r = \frac{\sigma^2}{2}\ln\frac{1-p}{p}$ , biasing detection toward the more probable symbol.

,

Bayes' Theorem in Action: BPSK Detection

A cinematic animation of Bayes' theorem applied to BPSK detection. Watch how the prior, likelihood, and posterior interact as the received signal sweeps across the decision boundary.

The two Gaussian likelihoods (conditioned on

s = +1

and

s = -1

) combine with the prior via Bayes' theorem to yield the posterior probability. The MAP decision rule selects the hypothesis with higher posterior.

Definition:
Independence of Events

Let $(\Omega, \mathcal{F}, P)$ be a probability space.

Pairwise independence. Two events $A, B \in \mathcal{F}$ are independent if

$P(A \cap B) = P(A)\,P(B).$

Equivalently (when $P(B) > 0$ ), $A$ and $B$ are independent if and only if $P(A \mid B) = P(A)$ : knowing that $B$ occurred does not change the probability of $A$ .

Mutual independence. Events $A_1, A_2, \dots, A_n$ are mutually independent if for every subcollection $\{A_{i_1}, A_{i_2}, \dots, A_{i_k}\}$ with $2 \le k \le n$ ,

$P(A_{i_1} \cap A_{i_2} \cap \cdots \cap A_{i_k}) = P(A_{i_1})\,P(A_{i_2}) \cdots P(A_{i_k}).$

This requires $2^n - n - 1$ equalities to hold (one for each subset of size $\ge 2$ ), which is strictly stronger than pairwise independence alone (which only requires $\binom{n}{2}$ equalities).

Independence is a modelling assumption, not something derived from the axioms. In wireless, the assumption that fading coefficients across well-separated antennas are independent is justified by physical arguments (sufficient antenna spacing $\ge \lambda/2$ in a rich scattering environment), but must always be stated explicitly.

,

Why This Matters: Independent Fading and Diversity Gain

The concept of independence is at the heart of diversity in wireless systems. Consider a receiver with $L$ antennas, each observing a faded copy of the transmitted signal:

$r_\ell = h_\ell\,s + n_\ell, \qquad \ell = 1, 2, \dots, L,$

where $h_\ell$ is the fading coefficient on antenna $\ell$ , $s$ is the transmitted symbol, and $n_\ell$ is additive noise.

If the fading coefficients $h_1, h_2, \dots, h_L$ are mutually independent (DIndependence of Events), the probability that all $L$ channels are simultaneously in a deep fade is

$P\!\left(\bigcap_{\ell=1}^L \{|h_\ell|^2 < \gamma_{\mathrm{th}}\}\right) = \prod_{\ell=1}^{L} P(|h_\ell|^2 < \gamma_{\mathrm{th}}).$

For Rayleigh fading with $P(|h_\ell|^2 < \gamma_{\mathrm{th}}) \approx \gamma_{\mathrm{th}}$ at low threshold, this product scales as $\gamma_{\mathrm{th}}^L$ , yielding a diversity order of $L$ . The error probability at high SNR then decays as $P_e \propto \text{SNR}^{-L}$ --- each additional independent antenna path contributes one order of magnitude faster decay.

Key insight: Without independence, adding antennas may not help. If $h_1 = h_2 = \cdots = h_L$ (fully correlated fading), all antennas fade together and the diversity order remains $1$ . Independence is the mechanism that converts extra hardware into reliability.

See full treatment in Chapter 4، Section 3

Historical Note: Kolmogorov's Axiomatization (1933)

Prior to the 20th century, probability was a collection of useful but ad hoc calculation rules. Attempts to ground it rigorously --- by Laplace (equally likely outcomes), von Mises (frequency limits), and others --- each covered only special cases.

In 1933, the Russian mathematician Andrey Nikolaevich Kolmogorov (1903--1987) published Grundbegriffe der Wahrscheinlichkeitsrechnung ("Foundations of the Theory of Probability"), in which he showed that the entire theory could be derived from just three axioms by identifying probability with a normalised measure on a $\sigma$ -algebra of events. This measure-theoretic framework unified discrete and continuous probability, resolved paradoxes, and put limit theorems (law of large numbers, central limit theorem) on rigorous footing.

Kolmogorov's axioms are precisely the axioms (K1)--(K3) of DProbability Measure (Kolmogorov Axioms). Nearly a century later, they remain the universally accepted foundation of probability theory --- and, by extension, of all stochastic models in communications, information theory, and signal processing.

Historical irony: Kolmogorov initially trained in history before switching to mathematics. His axiomatization of probability was inspired by Lebesgue's theory of integration --- the same machinery that today underpins the definition of expectation, density functions, and information-theoretic integrals.

Common Mistake: Confusing Pairwise Independence with Mutual Independence

Mistake:

A common error is to assume that if events $A_1, A_2, \dots, A_n$ are pairwise independent (i.e., $P(A_i \cap A_j) = P(A_i)P(A_j)$ for every pair $i \neq j$ ), then they are automatically mutually independent.

Correction:

Pairwise independence does not imply mutual independence.

A classic counterexample uses two fair coin flips. Define:

$A_1$ = "first coin is heads,"
$A_2$ = "second coin is heads,"
$A_3$ = "the two coins show the same face."

One can verify:

$P(A_1) = P(A_2) = P(A_3) = 1/2$ .
$P(A_1 \cap A_2) = 1/4 = P(A_1)P(A_2)$ . (Pairwise independent.)
$P(A_1 \cap A_3) = 1/4 = P(A_1)P(A_3)$ . (Pairwise independent.)
$P(A_2 \cap A_3) = 1/4 = P(A_2)P(A_3)$ . (Pairwise independent.)
But $P(A_1 \cap A_2 \cap A_3) = 1/4 \neq 1/8 = P(A_1)P(A_2)P(A_3)$ .

The triple intersection condition fails. Knowing both coins are heads ( $A_1 \cap A_2$ ) makes $A_3$ certain, so the three events are not mutually independent.

In wireless context: When modelling fading across multiple antennas or subcarriers, one must verify (or assume) mutual independence, not merely pairwise independence, for diversity arguments to hold. Physical channel models based on independent scatterers typically guarantee mutual independence, but engineered systems with shared RF components may introduce subtle three-way (or higher) correlations.

Quick Check

In a binary communication system, $P(s = +1) = 0.6$ and $P(s = -1) = 0.4$ . The channel has crossover probabilities $P(r = -1 \mid s = +1) = 0.1$ and $P(r = +1 \mid s = -1) = 0.2$ . What is $P(s = +1 \mid r = +1)$ ?

$0.60$

$0.54$

$\frac{0.54}{0.62} \approx 0.871$

$\frac{0.54}{0.54 + 0.08} = 0.90$

Correction:

\frac{0.54}{0.62} \approx 0.871

By Bayes' theorem: $P(s{=}+1 \mid r{=}+1) = \frac{P(r{=}+1 \mid s{=}+1)\,P(s{=}+1)} {P(r{=}+1 \mid s{=}+1)\,P(s{=}+1) + P(r{=}+1 \mid s{=}-1)\,P(s{=}-1)} = \frac{0.9 \times 0.6}{0.9 \times 0.6 + 0.2 \times 0.4} = \frac{0.54}{0.62} \approx 0.871$ .

Quick Check

Let $A$ and $B$ be independent events with $P(A) = 0.3$ and $P(B) = 0.5$ . What is $P(A^c \cap B)$ ?

$0.15$

$0.35$

$0.50$

$0.65$

Correction:

0.35

If $A$ and $B$ are independent, then $A^c$ and $B$ are also independent. Therefore $P(A^c \cap B) = P(A^c)\,P(B) = (1 - 0.3) \times 0.5 = 0.35$ .

Quick Check

Events $A$ and $B$ satisfy $P(A) = 0.4$ , $P(B) = 0.5$ , and $P(A \cap B) = 0.2$ . What is $P(A \cup B)$ ?

$0.9$

$0.7$

$0.5$

$0.8$

Correction:

0.7

By the inclusion-exclusion principle: $P(A \cup B) = P(A) + P(B) - P(A \cap B) = 0.4 + 0.5 - 0.2 = 0.7$ .

Sample space

The set $\Omega$ of all possible outcomes of a random experiment. Each element $\omega \in \Omega$ is a sample point. The sample space may be finite, countably infinite, or uncountable.

$\sigma$ -algebra

A collection $\mathcal{F}$ of subsets of $\Omega$ that contains $\Omega$ , is closed under complementation, and is closed under countable unions. Elements of $\mathcal{F}$ are called events. For $\Omega = \mathbb{R}$ , the standard choice is the Borel $\sigma$ -algebra $\mathcal{B}(\mathbb{R})$ .

Probability measure

A function $P \colon \mathcal{F} \to [0,1]$ satisfying the Kolmogorov axioms: non-negativity ( $P(A) \ge 0$ ), normalisation ( $P(\Omega) = 1$ ), and countable additivity for disjoint events.

Conditional probability

The probability of event $A$ given that event $B$ has occurred, defined as $P(A \mid B) = P(A \cap B) / P(B)$ when $P(B) > 0$ . Conditioning is the fundamental operation in Bayesian detection and estimation.

Independence

Events $A$ and $B$ are independent if $P(A \cap B) = P(A)P(B)$ . Mutual independence of $n$ events requires the product rule to hold for every subcollection of size $\ge 2$ , which is strictly stronger than pairwise independence.

Key Takeaway

The core messages of this section:

Everything starts with $(\Omega, \mathcal{F}, P)$ . The probability space triple is the rigorous foundation for every stochastic statement in communications. The three Kolmogorov axioms --- non-negativity, normalisation, countable additivity --- are minimal yet sufficient to derive all of probability theory.
Bayes' theorem inverts the channel. The channel model gives us $P(\text{observation} \mid \text{hypothesis})$ ; Bayes' theorem converts this into the posterior $P(\text{hypothesis} \mid \text{observation})$ , enabling optimal (MAP) detection. For BPSK with equal priors over AWGN, the MAP rule reduces to $\hat{s} = \operatorname{sgn}(r)$ .
Independence enables diversity. When fading paths are mutually independent, the probability of simultaneous deep fades decays as a product, yielding diversity order $L$ with $L$ antennas. Pairwise independence alone is insufficient --- mutual independence is required.

Probability Spaces and Axioms

Why Probability for Wireless Communications?

Definition: Sample Space

Definition: σ\sigmaσ-Algebra (Sigma-Algebra)

Definition: Probability Measure (Kolmogorov Axioms)

Definition: Probability Space

Example: Probability Space for BPSK over AWGN

Identify the sample space

Define the $\sigma$-algebra

Specify the probability measure

Summary

Theorem: Basic Properties of Probability Measures

Property 1: $P(\varnothing) = 0$

Property 2: Complement rule

Property 3: Monotonicity

Property 4: Union bound

Theorem: Inclusion-Exclusion Principle

Disjoint decomposition

Express individual probabilities

Combine

Definition: Conditional Probability

Theorem: Law of Total Probability

Decompose and apply additivity

Theorem: Bayes' Theorem

Apply the definition of conditional probability

Expand $P(A)$ via total probability

Example: MAP Detection for BPSK via Bayes' Theorem

Identify the likelihoods

Apply Bayes' theorem

Simplify using the log-likelihood ratio

Derive the MAP decision rule

Bayes' Theorem in Action: BPSK Detection

Definition: Independence of Events

Why This Matters: Independent Fading and Diversity Gain

Historical Note: Kolmogorov's Axiomatization (1933)

Common Mistake: Confusing Pairwise Independence with Mutual Independence

Quick Check

Quick Check

Quick Check

Sample space

σ\sigmaσ-algebra

Probability measure

Conditional probability

Independence

Key Takeaway

Definition:
Sample Space

Definition:
$\sigma$ -Algebra (Sigma-Algebra)

Definition:
Probability Measure (Kolmogorov Axioms)

Definition:
Probability Space

Definition:
Conditional Probability

Definition:
Independence of Events

$\sigma$ -algebra