Ferkans — Interactive Telecom Tutor

Why Convergence Concepts Matter in Telecommunications

Throughout wireless communications and information theory, we repeatedly encounter statements of the form "as the number of samples (or users, or antennas, or code length) grows, a random quantity approaches a deterministic limit." Making such statements precise requires a rigorous notion of convergence for random variables --- and it turns out that there are several inequivalent notions, each with different strengths and applications.

Two limit theorems dominate the field:

The Law of Large Numbers (LLN) guarantees that the sample mean $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$ converges to the true mean $\mu$ as $n \to \infty$ . This underpins:
- Channel estimation: averaging $n$ noisy pilot measurements to recover the true channel coefficient $h$ ;
- Ergodic capacity: the time-averaged mutual information converges to the ergodic capacity over many fading realisations;
- Monte Carlo simulation: computing bit-error rates by averaging over many independent trials.
The Central Limit Theorem (CLT) asserts that the normalised sum of iid random variables converges in distribution to a Gaussian, regardless of the original distribution. This explains:
- why thermal noise is well modelled as Gaussian (superposition of many independent microscopic contributions);
- why aggregate interference in large cellular networks is approximately Gaussian;
- why the Rayleigh fading model arises from many scattered paths (Section 2.3).

The Chernoff bound provides exponentially tight tail probabilities that are essential for analysing error exponents in coding theory and outage probabilities in fading channels.

This section formalises the four modes of convergence, establishes their hierarchy, proves the LLN and CLT, and derives the Chernoff bound --- equipping us with the limit-theorem toolkit needed for the remainder of this text.

Definition:
Convergence in Probability

A sequence of random variables $\{X_n\}_{n=1}^{\infty}$ defined on a probability space $(\Omega, \mathcal{F}, P)$ is said to converge in probability to a random variable $X$ , written

$X_n \xrightarrow{P} X,$

if for every $\epsilon > 0$ ,

$\lim_{n \to \infty} P\!\bigl(|X_n - X| > \epsilon\bigr) = 0.$

Equivalently, for every $\epsilon > 0$ and every $\delta > 0$ , there exists $N = N(\epsilon, \delta)$ such that for all $n \geq N$ ,

$P\!\bigl(|X_n - X| > \epsilon\bigr) < \delta.$

Interpretation: For large $n$ , the probability that $X_n$ deviates from $X$ by more than any prescribed tolerance $\epsilon$ becomes arbitrarily small. However, convergence in probability does not guarantee that $X_n(\omega) \to X(\omega)$ for every (or even almost every) sample point $\omega$ .

In channel estimation, if $\hat{h}_n$ is an estimator of the channel coefficient $h$ based on $n$ pilot symbols, then $\hat{h}_n \xrightarrow{P} h$ means the estimator is consistent: with enough pilots, the estimate is arbitrarily close to the true channel with high probability.

, ,

Definition:
Almost Sure Convergence

A sequence $\{X_n\}$ converges almost surely (a.s.) to $X$ , written

$X_n \xrightarrow{a.s.} X,$

if

$P\!\left(\lim_{n \to \infty} X_n = X\right) = 1.$

That is, the set of sample points $\omega \in \Omega$ for which $X_n(\omega) \to X(\omega)$ as $n \to \infty$ has probability one.

Equivalent formulation (via limsup):

$P\!\left(\bigcap_{\epsilon > 0} \liminf_{n \to \infty} \{|X_n - X| \leq \epsilon\}\right) = 1,$

or equivalently, for every $\epsilon > 0$ ,

$P\!\left(\limsup_{n \to \infty} \{|X_n - X| > \epsilon\}\right) = 0.$

Comparison with convergence in probability: Almost sure convergence is pointwise convergence of $X_n(\omega) \to X(\omega)$ outside a null set, whereas convergence in probability only controls the probability of deviation at each fixed $n$ . Almost sure convergence is strictly stronger.

The distinction matters in practice. If an adaptive equaliser's tap weights converge almost surely to the optimal Wiener solution, then on (almost) every sample path the equaliser eventually "locks on" and stays near the optimum. Mere convergence in probability would allow the equaliser to occasionally wander far from the optimum, even at late times --- though with vanishing probability.

,

Definition:
Convergence in Distribution

A sequence $\{X_n\}$ converges in distribution to $X$ , written

$X_n \xrightarrow{d} X,$

if

$\lim_{n \to \infty} F_{X_n}(x) = F_X(x)$

at every point $x$ where $F_X$ is continuous. Here $F_{X_n}$ and $F_X$ denote the cumulative distribution functions of $X_n$ and $X$ , respectively.

Key properties:

Convergence in distribution is the weakest of the four convergence modes.
It does not require $X_n$ and $X$ to be defined on the same probability space.
By the Portmanteau theorem, $X_n \xrightarrow{d} X$ is equivalent to $E[g(X_n)] \to E[g(X)]$ for every bounded continuous function $g$ .
Levy's continuity theorem: $X_n \xrightarrow{d} X$ if and only if $\Phi_{X_n}(\omega) \to \Phi_X(\omega)$ pointwise for all $\omega$ , where $\Phi$ denotes the characteristic function.

The CLT is a statement about convergence in distribution: the standardised sum converges in distribution to $\mathcal{N}(0, 1)$ . This does not mean the sum "becomes" Gaussian in any pathwise sense --- only that its CDF (and hence all tail probabilities) approaches the Gaussian CDF.

, ,

Definition:
Convergence in Mean Square ( $L^2$ Convergence)

A sequence $\{X_n\}$ converges in mean square (or in $L^2$ ) to $X$ , written

$X_n \xrightarrow{L^2} X,$

if

$\lim_{n \to \infty} E\!\bigl[|X_n - X|^2\bigr] = 0.$

This requires $E[|X_n|^2] < \infty$ and $E[|X|^2] < \infty$ (i.e., $X_n$ and $X$ must be square-integrable).

Immediate consequence: If $X_n \xrightarrow{L^2} X$ , then

$E[X_n] \to E[X] \quad \text{and} \quad \mathrm{Var}(X_n - X) \to 0.$

More generally, convergence in $L^p$ (i.e., $E[|X_n - X|^p] \to 0$ ) is defined analogously for $p \geq 1$ .

In estimation theory, the mean-squared error $\mathrm{MSE} = E[|\hat{h}_n - h|^2]$ is precisely the $L^2$ distance between the estimator and the true parameter. An estimator that converges in mean square is consistent in a particularly strong sense: both its bias and its variance vanish. The MMSE (minimum mean-squared error) estimator minimises this distance at every $n$ .

,

Theorem: Hierarchy of Convergence Modes

The four convergence modes are related by the following implications:

$\text{a.s.} \;\Longrightarrow\; \text{in probability} \;\Longrightarrow\; \text{in distribution}.$

$\text{mean square} \;\Longrightarrow\; \text{in probability} \;\Longrightarrow\; \text{in distribution}.$

In a diagram:

$\begin{array}{ccc} \text{a.s.} & & \text{mean square ($L^2$)} \\ & \searrow \quad \swarrow & \\ & \text{in probability} & \\ & \downarrow & \\ & \text{in distribution} & \end{array}$

No other implications hold in general:

Convergence in probability does $\not\!\!\Rightarrow$ a.s. convergence.
Convergence in probability does $\not\!\!\Rightarrow$ mean square convergence.
Mean square convergence does $\not\!\!\Rightarrow$ a.s. convergence.
Convergence in distribution does $\not\!\!\Rightarrow$ convergence in probability (unless $X$ is a constant).

Special case: If $X_n \xrightarrow{d} c$ where $c$ is a deterministic constant, then $X_n \xrightarrow{P} c$ .

The hierarchy reflects increasing "strength" of control over the random fluctuations of $X_n$ around $X$ :

In distribution controls only the shape of the distribution of $X_n$ (the CDF approaches the target CDF).
In probability controls the probability of deviation at each fixed $n$ (the tail $P(|X_n - X| > \epsilon) \to 0$ ).
Almost surely controls the sample paths (the sequence $X_n(\omega) \to X(\omega)$ for a.e. $\omega$ ).
Mean square controls the average squared deviation ( $E[|X_n - X|^2] \to 0$ ).

The key insight for "mean square $\Rightarrow$ in probability" is Markov's inequality: if the expected squared deviation is small, then the probability of a large deviation must also be small. The implication "a.s. $\Rightarrow$ in probability" uses the fact that pointwise convergence on a set of measure one is stronger than just having small deviation probabilities.

Show Hint

Use Markov inequality to prove mean square implies in probability.

Use the dominated convergence theorem for a.s. implies in probability.

Proof

Mean square $\Rightarrow$ in probability (via Markov/Chebyshev)

By Markov's inequality applied to the nonnegative random variable $|X_n - X|^2$ , for any $\epsilon > 0$ :

$P\!\bigl(|X_n - X| > \epsilon\bigr) = P\!\bigl(|X_n - X|^2 > \epsilon^2\bigr) \leq \frac{E\!\bigl[|X_n - X|^2\bigr]}{\epsilon^2}.$

If $E[|X_n - X|^2] \to 0$ , then the right-hand side tends to zero, so $P(|X_n - X| > \epsilon) \to 0$ .

A.s. $\Rightarrow$ in probability (sketch)

Fix $\epsilon > 0$ and define $A_n = \{|X_n - X| > \epsilon\}$ . Almost sure convergence means $P(\limsup_n A_n) = 0$ (by the Borel--Cantelli-type characterisation). Since $P(A_n) \leq P(\bigcup_{k \geq n} A_k)$ and the latter decreases to $P(\limsup_n A_n) = 0$ , we get $P(A_n) \to 0$ , which is convergence in probability.

In probability $\Rightarrow$ in distribution (sketch)

Fix $x$ where $F_X$ is continuous and any $\epsilon > 0$ . Then $F_{X_n}(x) = P(X_n \leq x) \leq P(X \leq x + \epsilon) + P(|X_n - X| > \epsilon),$ and similarly $F_{X_n}(x) \geq P(X \leq x - \epsilon) - P(|X_n - X| > \epsilon).$ Taking $n \to \infty$ (so $P(|X_n - X| > \epsilon) \to 0$ ) and then $\epsilon \to 0$ (using continuity of $F_X$ at $x$ ) yields $F_{X_n}(x) \to F_X(x)$ .

, ,

Theorem: Weak Law of Large Numbers (WLLN)

Let $X_1, X_2, \ldots$ be independent and identically distributed (iid) random variables with finite mean $\mu = E[X_i]$ and finite variance $\sigma^2 = \mathrm{Var}(X_i) < \infty$ . Define the sample mean

$\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i.$

Then $\bar{X}_n$ converges to $\mu$ in probability:

$\bar{X}_n \xrightarrow{P} \mu.$

That is, for every $\epsilon > 0$ ,

$\lim_{n \to \infty} P\!\bigl(|\bar{X}_n - \mu| > \epsilon\bigr) = 0.$

Averaging $n$ iid samples reduces the variance by a factor of $1/n$ : $\mathrm{Var}(\bar{X}_n) = \sigma^2/n$ . By Chebyshev's inequality, the probability that $\bar{X}_n$ deviates from $\mu$ by more than $\epsilon$ is at most $\sigma^2/(n\epsilon^2)$ , which vanishes as $n \to \infty$ .

Physically: averaging many noisy channel measurements "averages out" the noise, leaving the true channel coefficient. The more pilots we transmit, the better the estimate --- this is the LLN in action.

Show Hint

Compute the mean and variance of $\bar{X}_n$ .

Apply Chebyshev's inequality.

Proof

Step 1: Mean and variance of $\bar{X}_n$

By linearity of expectation,

$E[\bar{X}_n] = \frac{1}{n}\sum_{i=1}^{n} E[X_i] = \frac{1}{n} \cdot n\mu = \mu.$

Since the $X_i$ are independent with common variance $\sigma^2$ ,

$\mathrm{Var}(\bar{X}_n) = \frac{1}{n^2}\sum_{i=1}^{n} \mathrm{Var}(X_i) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}.$

Step 2: Apply Chebyshev's inequality

Chebyshev's inequality states that for any random variable $Y$ with finite variance,

$P\!\bigl(|Y - E[Y]| > \epsilon\bigr) \leq \frac{\mathrm{Var}(Y)}{\epsilon^2}.$

Applying this with $Y = \bar{X}_n$ :

$P\!\bigl(|\bar{X}_n - \mu| > \epsilon\bigr) \leq \frac{\mathrm{Var}(\bar{X}_n)}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2}.$

Step 3: Take the limit

For any fixed $\epsilon > 0$ ,

$0 \leq P\!\bigl(|\bar{X}_n - \mu| > \epsilon\bigr) \leq \frac{\sigma^2}{n\epsilon^2} \to 0 \quad \text{as } n \to \infty.$

By the squeeze theorem, $P(|\bar{X}_n - \mu| > \epsilon) \to 0$ , which is precisely $\bar{X}_n \xrightarrow{P} \mu$ . $\blacksquare$

, ,

Theorem: Strong Law of Large Numbers (SLLN)

Let $X_1, X_2, \ldots$ be iid random variables with finite mean $\mu = E[X_i]$ . Then the sample mean converges to $\mu$ almost surely:

$\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i \xrightarrow{a.s.} \mu.$

That is,

$P\!\left(\lim_{n \to \infty} \bar{X}_n = \mu\right) = 1.$

Note that finite variance is not required; finite mean suffices.

The SLLN strengthens the WLLN from convergence in probability to almost sure convergence. It asserts that on (almost) every sample path $\omega$ , the running average $\bar{X}_n(\omega)$ eventually settles down to $\mu$ and stays there. This is the rigorous justification for the "frequency interpretation" of probability: if we repeat an experiment indefinitely, the relative frequency of any event converges to its probability.

The proof requires more sophisticated tools from measure theory (e.g., the Borel--Cantelli lemma, truncation arguments, or Kolmogorov's inequality) and is beyond our scope here. The key point is that the SLLN provides a pathwise guarantee that the WLLN does not.

Show Hint

The proof is omitted (requires measure-theoretic tools beyond our scope).

The SLLN implies the WLLN by the convergence hierarchy.

Proof

Proof omitted

The proof of the SLLN under the minimal assumption of finite mean (Kolmogorov's version) relies on truncation arguments, Kolmogorov's maximal inequality, and the Borel--Cantelli lemma. These measure-theoretic tools are beyond the scope of this text.

Under the stronger assumption of finite fourth moment ( $E[X_i^4] < \infty$ ), a more elementary proof using the Borel--Cantelli lemma and Chebyshev's inequality applied to $(\bar{X}_n - \mu)^4$ is possible. We refer the interested reader to Billingsley (1995), Chapter 22.

What we can verify: Since a.s. convergence implies convergence in probability (Theorem thm-convergence-hierarchy), the SLLN is indeed a stronger result than the WLLN. $\blacksquare$

,

Theorem: Central Limit Theorem (CLT)

Let $X_1, X_2, \ldots$ be iid random variables with finite mean $\mu = E[X_i]$ and finite variance $\sigma^2 = \mathrm{Var}(X_i) > 0$ . Define the standardised partial sum

$Z_n = \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} = \frac{\sum_{i=1}^{n} X_i - n\mu}{\sigma\sqrt{n}}.$

Then $Z_n$ converges in distribution to a standard normal random variable:

$Z_n \xrightarrow{d} \mathcal{N}(0, 1).$

Equivalently, for every $z \in \mathbb{R}$ ,

$\lim_{n \to \infty} P(Z_n \leq z) = \Phi(z) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{z} e^{-t^2/2}\,dt.$

The CLT is arguably the most important theorem in probability. It says that the sum of many independent, identically distributed random variables --- regardless of their individual distribution --- is approximately Gaussian after proper centering and scaling. The only requirements are a finite mean and a finite, positive variance.

The proof strategy via characteristic functions is elegant: show that the characteristic function of $Z_n$ converges pointwise to $e^{-\omega^2/2}$ , which is the characteristic function of $\mathcal{N}(0, 1)$ . By Levy's continuity theorem, pointwise convergence of characteristic functions implies convergence in distribution.

Show Hint

Compute the characteristic function of $Z_n$ in terms of $\Phi_{X_1}$ .

Use the Taylor expansion of the characteristic function near the origin.

Apply the limit $(1 + a/n)^n \to e^a$ .

Proof

Step 1: Standardise and express the CF of $Z_n$

Without loss of generality, assume $\mu = 0$ (otherwise replace $X_i$ by $X_i - \mu$ ). Let $S_n = \sum_{i=1}^{n} X_i$ , so $Z_n = S_n / (\sigma\sqrt{n})$ .

The characteristic function of $Z_n$ is

$\Phi_{Z_n}(\omega) = E\!\left[e^{j\omega Z_n}\right] = E\!\left[e^{j\omega S_n / (\sigma\sqrt{n})}\right].$

Since the $X_i$ are iid and independent,

$\Phi_{Z_n}(\omega) = \prod_{i=1}^{n} E\!\left[e^{j\omega X_i / (\sigma\sqrt{n})}\right] = \left[\Phi_X\!\left(\frac{\omega}{\sigma\sqrt{n}}\right) \right]^n,$

where $\Phi_X(t) = E[e^{jtX_1}]$ is the common CF of each $X_i$ .

Step 2: Taylor-expand the characteristic function

Since $E[X_1] = 0$ and $E[X_1^2] = \sigma^2$ , the Taylor expansion of $\Phi_X(t)$ around $t = 0$ gives

$\Phi_X(t) = 1 + jt\,E[X_1] + \frac{(jt)^2}{2}\,E[X_1^2] + o(t^2) = 1 - \frac{\sigma^2 t^2}{2} + o(t^2).$

Substituting $t = \omega / (\sigma\sqrt{n})$ :

$\Phi_X\!\left(\frac{\omega}{\sigma\sqrt{n}}\right) = 1 - \frac{\sigma^2}{2} \cdot \frac{\omega^2}{\sigma^2 n} + o(1/n) = 1 - \frac{\omega^2}{2n} + o(1/n).$

Step 3: Take the $n$-th power and the limit

Therefore

$\Phi_{Z_n}(\omega) = \left[1 - \frac{\omega^2}{2n} + o(1/n)\right]^n.$

Using the fundamental limit $\lim_{n \to \infty}(1 + a_n/n)^n = e^a$ whenever $a_n \to a$ , with $a_n = -\omega^2/2 + o(1)$ :

$\lim_{n \to \infty} \Phi_{Z_n}(\omega) = e^{-\omega^2/2}.$

Step 4: Identify the limit and conclude

The function $e^{-\omega^2/2}$ is the characteristic function of $\mathcal{N}(0, 1)$ . By Levy's continuity theorem, pointwise convergence of characteristic functions to a function that is continuous at $\omega = 0$ (and $e^{-\omega^2/2}$ is continuous everywhere) implies convergence in distribution:

$Z_n \xrightarrow{d} \mathcal{N}(0, 1). \qquad \blacksquare$

, ,

Central Limit Theorem in Action

Parameters

Source Distribution

Maximum Number of Summands

N

20

Number of Monte Carlo Samples10000

Theorem: Chernoff Bound

Let $X$ be a random variable with moment-generating function $M_X(s) = E[e^{sX}]$ . Then for any $a \in \mathbb{R}$ ,

$P(X \geq a) \leq \inf_{s > 0}\; e^{-sa}\,M_X(s).$

Similarly, for the left tail,

$P(X \leq a) \leq \inf_{s < 0}\; e^{-sa}\,M_X(s).$

The bound is obtained by optimising the free parameter $s$ to get the tightest possible exponential bound.

The Chernoff bound starts from Markov's inequality applied to the exponential function $e^{sX}$ (which is nonnegative and monotone increasing for $s > 0$ ). The extra parameter $s$ is then optimised to yield the tightest bound. Because the bound has an exponential form $e^{-sa}M_X(s)$ , it typically decays exponentially in the "excess" $a - E[X]$ , making it far sharper than Chebyshev's inequality for large deviations.

The Chernoff bound is the foundation of large deviation theory and is directly related to the error exponent in channel coding: the probability of decoding error decays exponentially with block length $n$ , and the exponent is characterised via a Chernoff-type optimisation.

Show Hint

Start with the identity $P(X \geq a) = P(e^{sX} \geq e^{sa})$ for $s > 0$ .

Apply Markov's inequality to $e^{sX}$ .

Optimise over $s$ .

Proof

Step 1: Exponentiate and apply Markov

For any $s > 0$ , the function $t \mapsto e^{st}$ is strictly increasing, so

$\{X \geq a\} = \{e^{sX} \geq e^{sa}\}.$

By Markov's inequality (for the nonnegative random variable $e^{sX}$ ),

$P(X \geq a) = P(e^{sX} \geq e^{sa}) \leq \frac{E[e^{sX}]}{e^{sa}} = e^{-sa}\,M_X(s).$

Step 2: Optimise over $s$

The inequality $P(X \geq a) \leq e^{-sa}M_X(s)$ holds for every $s > 0$ . To obtain the tightest bound, we minimise the right-hand side over $s$ :

$P(X \geq a) \leq \inf_{s > 0}\; e^{-sa}\,M_X(s).$

The optimal $s^*$ satisfies the first-order condition

$\frac{d}{ds}\bigl[e^{-sa}M_X(s)\bigr] = 0 \quad \Longleftrightarrow \quad M_X'(s^*) = a\,M_X(s^*),$

or equivalently $E[X\,e^{s^*X}] / E[e^{s^*X}] = a$ , which means $a$ equals the tilted mean under the exponentially twisted distribution. $\blacksquare$

Connection to rate function

Defining the log-moment generating function (cumulant generating function)

$\Lambda(s) = \ln M_X(s),$

and its Legendre--Fenchel transform (the rate function)

$\Lambda^*(a) = \sup_{s > 0}\;\bigl[sa - \Lambda(s)\bigr],$

the Chernoff bound takes the compact form

$P(X \geq a) \leq e^{-\Lambda^*(a)}.$

The rate function $\Lambda^*$ governs the exponential decay of tail probabilities and is the central object in Cramer's theorem (the foundation of large deviations theory).

, ,

Example: Chernoff Bound for the Sum of iid Bernoulli Random Variables

Let $X_1, X_2, \ldots, X_n$ be iid $\mathrm{Bernoulli}(p)$ random variables with $0 < p < 1$ , and let $S_n = \sum_{i=1}^{n} X_i$ . Use the Chernoff bound to show that for any $a > np$ (i.e., above the mean),

$P(S_n \geq a) \leq \exp\!\bigl(-n\,D(a/n \,\|\, p)\bigr),$

where $D(q \| p) = q \ln\frac{q}{p} + (1 - q)\ln\frac{1-q}{1-p}$ is the Kullback--Leibler divergence (relative entropy) between $\mathrm{Bernoulli}(q)$ and $\mathrm{Bernoulli}(p)$ .

Evaluate numerically for $n = 100$ , $p = 0.3$ , and $a = 50$ (i.e., $P(S_{100} \geq 50)$ ).

Solution

Step 1: MGF of a single Bernoulli

For $X_i \sim \mathrm{Bernoulli}(p)$ ,

$M_{X_i}(s) = E[e^{sX_i}] = (1 - p) + p\,e^s = 1 - p + p\,e^s.$

Step 2: MGF of the sum $S_n$

Since the $X_i$ are iid,

$M_{S_n}(s) = \bigl[M_{X_i}(s)\bigr]^n = \bigl(1 - p + p\,e^s\bigr)^n.$

Step 3: Apply the Chernoff bound

$P(S_n \geq a) \leq e^{-sa}\,M_{S_n}(s) = e^{-sa}\,(1 - p + p\,e^s)^n$ $for any$ s > 0$.

Step 4: Optimise over $s$

Let $q = a/n$ (the "target fraction"). Set the derivative of $\ln[e^{-sa}(1 - p + pe^s)^n]$ with respect to $s$ to zero:

$\frac{d}{ds}\bigl[-sa + n\ln(1 - p + pe^s)\bigr] = -a + \frac{npe^s}{1 - p + pe^s} = 0.$

Solving: $pe^{s^*} / (1 - p + pe^{s^*}) = q$ , which gives

$e^{s^*} = \frac{q(1-p)}{p(1-q)}, \qquad s^* = \ln\frac{q(1-p)}{p(1-q)}.$

This is positive when $q > p$ (i.e., $a > np$ ), confirming the regime where the upper tail is nontrivial.

Step 5: Substitute and simplify

Substituting $s^*$ back:

$P(S_n \geq a) \leq \exp\!\Bigl(-nq\ln\frac{q}{p} - n(1-q)\ln\frac{1-q}{1-p}\Bigr) = \exp\!\bigl(-n\,D(q \| p)\bigr).$

The exponent is $n$ times the KL divergence between $\mathrm{Bernoulli}(q)$ and $\mathrm{Bernoulli}(p)$ , showing that the tail probability decays exponentially in $n$ .

Step 6: Numerical evaluation

For $n = 100$ , $p = 0.3$ , $a = 50$ ( $q = 0.5$ ):

$D(0.5 \| 0.3) = 0.5\ln\frac{0.5}{0.3} + 0.5\ln\frac{0.5}{0.7} = 0.5\ln\frac{5}{3} + 0.5\ln\frac{5}{7}$ $\approx 0.5 \times 0.5108 + 0.5 \times (-0.3365) \approx 0.0872.$

Therefore

$P(S_{100} \geq 50) \leq e^{-100 \times 0.0872} = e^{-8.72} \approx 1.63 \times 10^{-4}.$

The exact probability (computed via the binomial CDF) is approximately $5.3 \times 10^{-5}$ , so the Chernoff bound overestimates by a factor of about $3$ --- remarkably tight for such a simple bound.

,

Central Limit Theorem: Watching Convergence to Gaussian

Histograms of the standardised sum of

n

i.i.d. uniform random variables for

n = 1, 2, 3, 5, 10, 30

, with the theoretical Gaussian PDF overlaid. The convergence is visually striking even for moderate

n

.

As

n

grows, the sum of i.i.d. random variables — regardless of the original distribution — converges in distribution to a Gaussian. By

n = 30

the histogram is nearly indistinguishable from the bell curve.

Why This Matters: CLT Justifies the Gaussian Interference Model

In a large cellular network with $K$ co-channel interferers, the aggregate interference at a receiver is

$I = \sum_{k=1}^{K} \sqrt{P_k}\,h_k\,s_k,$

where $P_k$ is the received power from the $k$ -th interferer, $h_k$ is its fading coefficient, and $s_k$ is its data symbol. The terms $\sqrt{P_k}\,h_k\,s_k$ are (approximately) independent and identically distributed.

When $K$ is large, the Central Limit Theorem guarantees that $I$ is approximately Gaussian, regardless of the distributions of $h_k$ and $s_k$ . This justifies the widespread modelling assumption:

$I \approx \mathcal{CN}(0, \sigma_I^2),$

where $\sigma_I^2 = \sum_{k=1}^K P_k\,E[|h_k|^2]\,E[|s_k|^2]$ .

Practical implications:

SINR analysis: Under the Gaussian interference assumption, the signal-to-interference-plus-noise ratio (SINR) fully determines the achievable rate via $C = \log_2(1 + \mathrm{SINR})$ , just as for AWGN channels.
Stochastic geometry: In Poisson cellular network models (e.g., the PPP model of Andrews et al., 2011), the aggregate interference from infinitely many base stations is analysed using the CLT and its refinements.
Massive MIMO: When a base station with $M$ antennas serves $K$ users, the effective interference after matched filtering involves sums of $M$ iid terms. As $M \to \infty$ , these sums "harden" (by the LLN) and their fluctuations are Gaussian (by the CLT), leading to channel hardening and favourable propagation --- the two pillars of massive MIMO theory.

See full treatment in Chapter 4، Section 5

Historical Note: De Moivre, Laplace, and Lyapunov: The Long Road to the CLT

The Central Limit Theorem has one of the longest gestation periods in the history of mathematics, spanning nearly two centuries:

Abraham de Moivre (1733) proved the earliest version: the binomial distribution $\mathrm{Bin}(n, 1/2)$ , properly standardised, converges to what we now call the normal distribution. He published this in The Doctrine of Chances as a computational tool for approximating binomial probabilities. De Moivre did not have the concept of a "distribution" --- he worked directly with the ratio of the middle binomial coefficient to $2^n$ .
Pierre-Simon Laplace (1812) extended de Moivre's result to arbitrary $p \neq 1/2$ in his monumental Theorie analytique des probabilites, obtaining what is now called the de Moivre--Laplace theorem. Laplace also recognised the broader principle: sums of "errors" tend to follow the "law of errors" (the Gaussian).
Pafnuty Chebyshev (1867) and his student Andrey Markov (1900) proved versions of the CLT under increasingly general conditions, using the method of moments.
Aleksandr Lyapunov (1901) gave the first rigorous proof of the CLT for independent (not necessarily identically distributed) random variables using characteristic functions, under a condition now known as the Lyapunov condition. This is essentially the modern proof strategy.
Jarl Waldemar Lindeberg (1922) and William Feller (1935) established the Lindeberg--Feller theorem, giving necessary and sufficient conditions for the CLT to hold for independent (non-identically distributed) summands.

The CLT that we prove in this section (for iid summands with finite variance) is the simplest and most commonly used version. The general Lindeberg--Feller form is needed in wireless when interferers have unequal powers or different fading statistics.

,

Quick Check

Consider the following statement: "If $X_n \xrightarrow{P} X$ , then $X_n \xrightarrow{a.s.} X$ ." Is this statement true or false?

True

False

Correction:

False

Convergence in probability does not imply almost sure convergence in general. The hierarchy is a.s. $\Rightarrow$ in probability $\Rightarrow$ in distribution, and this chain of implications is strict.

Counterexample: Let $\Omega = [0,1]$ with Lebesgue measure. Define $X_n$ to be the indicator of a "sliding block" of width $1/k$ that cycles through $[0,1]$ , where $k$ increases as $n$ grows. Then $P(|X_n| > \epsilon) \to 0$ (convergence in probability to $0$ ), but for every $\omega \in [0,1]$ , $X_n(\omega) = 1$ infinitely often, so $X_n(\omega) \not\to 0$ for any $\omega$ . Almost sure convergence fails completely.

Quick Check

Let $X_1, \ldots, X_n$ be iid with mean $\mu = 5$ and variance $\sigma^2 = 4$ . For $n = 100$ , what is the approximate distribution of the sample mean $\bar{X}_{100}$ according to the CLT?

$\mathcal{N}(5, 4)$

$\mathcal{N}(5, 0.04)$

$\mathcal{N}(0, 1)$

$\mathcal{N}(5, 0.4)$

Correction:

\mathcal{N}(5, 0.04)

By the CLT, $\bar{X}_n$ is approximately $\mathcal{N}(\mu, \sigma^2/n)$ for large $n$ . With $\mu = 5$ , $\sigma^2 = 4$ , and $n = 100$ :

$\bar{X}_{100} \;\dot{\sim}\; \mathcal{N}\!\left(5, \frac{4}{100}\right) = \mathcal{N}(5, 0.04).$

The standard deviation of $\bar{X}_{100}$ is $\sigma/\sqrt{n} = 2/10 = 0.2$ .

Quick Check

The Chernoff bound is obtained by applying Markov's inequality to which random variable?

$X$ directly

$|X - \mu|^2$ (Chebyshev approach)

$e^{sX}$ for optimised $s > 0$

$X^2$

Correction:

e^{sX}

for optimised

s > 0

The Chernoff bound exploits the monotonicity of the exponential: $P(X \geq a) = P(e^{sX} \geq e^{sa})$ for $s > 0$ . Applying Markov's inequality to the nonnegative random variable $e^{sX}$ gives $P(X \geq a) \leq e^{-sa}M_X(s)$ . Optimising over $s$ yields the tightest exponential bound.

This is strictly tighter than Chebyshev's inequality (which applies Markov to $|X - \mu|^2$ ) because the exponential tilting provides an additional degree of freedom.

Convergence in Probability

A sequence $\{X_n\}$ converges in probability to $X$ ( $X_n \xrightarrow{P} X$ ) if $P(|X_n - X| > \epsilon) \to 0$ for every $\epsilon > 0$ . This is weaker than almost sure convergence but stronger than convergence in distribution. It is the mode of convergence established by the Weak Law of Large Numbers.

Almost Sure Convergence

A sequence $\{X_n\}$ converges almost surely to $X$ ( $X_n \xrightarrow{a.s.} X$ ) if $P(\lim_{n\to\infty} X_n = X) = 1$ . This is pathwise convergence outside a null set and is strictly stronger than convergence in probability. It is the mode of convergence established by the Strong Law of Large Numbers.

Convergence in Distribution

A sequence $\{X_n\}$ converges in distribution to $X$ ( $X_n \xrightarrow{d} X$ ) if $F_{X_n}(x) \to F_X(x)$ at all continuity points of $F_X$ . This is the weakest convergence mode and does not require the random variables to be defined on the same probability space. It is the mode of convergence in the Central Limit Theorem.

Law of Large Numbers (LLN)

The Law of Large Numbers states that the sample mean $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$ of iid random variables converges to the population mean $\mu$ . The Weak Law (WLLN) gives convergence in probability; the Strong Law (SLLN) gives almost sure convergence. The WLLN requires finite variance; the SLLN requires only finite mean.

Central Limit Theorem (CLT)

For iid random variables with mean $\mu$ and finite variance $\sigma^2 > 0$ , the standardised sum $Z_n = (\bar{X}_n - \mu)/(\sigma/\sqrt{n})$ converges in distribution to $\mathcal{N}(0,1)$ . This is the fundamental reason why the Gaussian distribution appears ubiquitously in communications: thermal noise, aggregate interference, and fading envelopes all arise from summing many independent contributions.

Chernoff Bound

An exponential tail bound: $P(X \geq a) \leq \inf_{s>0} e^{-sa} M_X(s)$ , obtained by applying Markov's inequality to $e^{sX}$ and optimising over the tilting parameter $s$ . The Chernoff bound provides exponentially tight estimates and is the basis for error exponent analysis in coding theory and large deviations theory.

Common Mistake: Using the CLT for Small Sample Sizes

Mistake:

"The CLT says the sum is Gaussian, so even for $n = 3$ or $n = 5$ samples the Gaussian approximation should be accurate."

Correction:

The CLT is an asymptotic result: it guarantees convergence in distribution as $n \to \infty$ , but says nothing about the rate of convergence or the accuracy at any finite $n$ . The quality of the Gaussian approximation at finite $n$ depends critically on the shape of the underlying distribution:

Symmetric, light-tailed distributions (e.g., uniform, symmetric triangular): the approximation is excellent even for $n = 5{-}10$ .
Skewed distributions (e.g., exponential, chi-squared with few degrees of freedom): the Gaussian approximation can be poor for $n < 20{-}30$ . The skewness of the sum decreases as $1/\sqrt{n}$ (by the Berry--Esseen theorem, the CDF error is bounded by $C \cdot E[|X_1|^3] / (\sigma^3 \sqrt{n})$ where $C \leq 0.4748$ ), so highly skewed distributions need larger $n$ .
Heavy-tailed distributions (e.g., Pareto with infinite variance): the CLT does not apply at all, and the sum converges to a stable distribution instead of a Gaussian.

In wireless: The Gaussian interference model (justified by the CLT) is reliable in dense networks with $K \geq 20{-}30$ interferers. For sparse networks with $K = 3{-}5$ dominant interferers, the Gaussian assumption can significantly underestimate the tail of the interference distribution (and hence underestimate outage probability). In such cases, the exact interference distribution or a more refined approximation (e.g., using the Gamma distribution) should be used.

Key Takeaway

The core messages of this section:

Four modes, one hierarchy. Almost sure and mean square convergence each imply convergence in probability, which in turn implies convergence in distribution. No other implications hold in general. Choosing the right convergence mode is a modelling decision: the WLLN uses convergence in probability, the SLLN uses almost sure convergence, and the CLT uses convergence in distribution.
The LLN: averaging works. The sample mean of iid observations converges to the true mean. This is why pilot- based channel estimation, Monte Carlo simulation, and ergodic capacity arguments are all valid: with enough samples, the average faithfully represents the expectation.
The CLT: sums become Gaussian. The standardised sum of iid random variables converges in distribution to $\mathcal{N}(0, 1)$ , regardless of the original distribution (as long as the variance is finite). This single theorem explains why:
- Thermal noise is Gaussian (many microscopic contributions).
- Aggregate interference in large networks is Gaussian.
- The Rayleigh fading model arises from many scattered paths (the in-phase and quadrature components are Gaussian by the CLT, so the envelope is Rayleigh).
The Chernoff bound: exponential tail control. For sums of iid random variables, the Chernoff bound provides exponentially decaying tail probabilities. The decay rate is governed by the KL divergence (for Bernoulli sums) or more generally by the Legendre--Fenchel transform of the log-MGF. This is the key tool for analysing error exponents in coding theory.
Respect the limits. The CLT is asymptotic; for small $n$ or heavy-tailed distributions, the Gaussian approximation can be dangerously inaccurate. Always check whether $n$ is "large enough" for the specific distribution at hand.

Convergence Concepts and Limit Theorems

Why Convergence Concepts Matter in Telecommunications

Definition: Convergence in Probability

Definition: Almost Sure Convergence

Definition: Convergence in Distribution

Definition: Convergence in Mean Square (L2L^2L2 Convergence)

Theorem: Hierarchy of Convergence Modes

Mean square $\Rightarrow$ in probability (via Markov/Chebyshev)

A.s. $\Rightarrow$ in probability (sketch)

In probability $\Rightarrow$ in distribution (sketch)

Theorem: Weak Law of Large Numbers (WLLN)

Step 1: Mean and variance of $\bar{X}_n$

Step 2: Apply Chebyshev's inequality

Step 3: Take the limit

Theorem: Strong Law of Large Numbers (SLLN)

Proof omitted

Theorem: Central Limit Theorem (CLT)

Step 1: Standardise and express the CF of $Z_n$

Step 2: Taylor-expand the characteristic function

Step 3: Take the $n$-th power and the limit

Step 4: Identify the limit and conclude

Central Limit Theorem in Action

Parameters

Theorem: Chernoff Bound

Step 1: Exponentiate and apply Markov

Step 2: Optimise over $s$

Connection to rate function

Example: Chernoff Bound for the Sum of iid Bernoulli Random Variables

Step 1: MGF of a single Bernoulli

Step 2: MGF of the sum $S_n$

Step 3: Apply the Chernoff bound

Step 4: Optimise over $s$

Step 5: Substitute and simplify

Step 6: Numerical evaluation

Central Limit Theorem: Watching Convergence to Gaussian

Why This Matters: CLT Justifies the Gaussian Interference Model

Historical Note: De Moivre, Laplace, and Lyapunov: The Long Road to the CLT

Quick Check

Quick Check

Quick Check

Convergence in Probability

Almost Sure Convergence

Convergence in Distribution

Law of Large Numbers (LLN)

Central Limit Theorem (CLT)

Chernoff Bound

Common Mistake: Using the CLT for Small Sample Sizes

Key Takeaway

Definition:
Convergence in Probability

Definition:
Almost Sure Convergence

Definition:
Convergence in Distribution

Definition:
Convergence in Mean Square ( $L^2$ Convergence)