Convergence Concepts and Limit Theorems
Why Convergence Concepts Matter in Telecommunications
Throughout wireless communications and information theory, we repeatedly encounter statements of the form "as the number of samples (or users, or antennas, or code length) grows, a random quantity approaches a deterministic limit." Making such statements precise requires a rigorous notion of convergence for random variables --- and it turns out that there are several inequivalent notions, each with different strengths and applications.
Two limit theorems dominate the field:
-
The Law of Large Numbers (LLN) guarantees that the sample mean converges to the true mean as . This underpins:
- Channel estimation: averaging noisy pilot measurements to recover the true channel coefficient ;
- Ergodic capacity: the time-averaged mutual information converges to the ergodic capacity over many fading realisations;
- Monte Carlo simulation: computing bit-error rates by averaging over many independent trials.
-
The Central Limit Theorem (CLT) asserts that the normalised sum of iid random variables converges in distribution to a Gaussian, regardless of the original distribution. This explains:
- why thermal noise is well modelled as Gaussian (superposition of many independent microscopic contributions);
- why aggregate interference in large cellular networks is approximately Gaussian;
- why the Rayleigh fading model arises from many scattered paths (Section 2.3).
The Chernoff bound provides exponentially tight tail probabilities that are essential for analysing error exponents in coding theory and outage probabilities in fading channels.
This section formalises the four modes of convergence, establishes their hierarchy, proves the LLN and CLT, and derives the Chernoff bound --- equipping us with the limit-theorem toolkit needed for the remainder of this text.
Definition: Convergence in Probability
Convergence in Probability
A sequence of random variables defined on a probability space is said to converge in probability to a random variable , written
if for every ,
Equivalently, for every and every , there exists such that for all ,
Interpretation: For large , the probability that deviates from by more than any prescribed tolerance becomes arbitrarily small. However, convergence in probability does not guarantee that for every (or even almost every) sample point .
In channel estimation, if is an estimator of the channel coefficient based on pilot symbols, then means the estimator is consistent: with enough pilots, the estimate is arbitrarily close to the true channel with high probability.
Definition: Almost Sure Convergence
Almost Sure Convergence
A sequence converges almost surely (a.s.) to , written
if
That is, the set of sample points for which as has probability one.
Equivalent formulation (via limsup):
or equivalently, for every ,
Comparison with convergence in probability: Almost sure convergence is pointwise convergence of outside a null set, whereas convergence in probability only controls the probability of deviation at each fixed . Almost sure convergence is strictly stronger.
The distinction matters in practice. If an adaptive equaliser's tap weights converge almost surely to the optimal Wiener solution, then on (almost) every sample path the equaliser eventually "locks on" and stays near the optimum. Mere convergence in probability would allow the equaliser to occasionally wander far from the optimum, even at late times --- though with vanishing probability.
Definition: Convergence in Distribution
Convergence in Distribution
A sequence converges in distribution to , written
if
at every point where is continuous. Here and denote the cumulative distribution functions of and , respectively.
Key properties:
- Convergence in distribution is the weakest of the four convergence modes.
- It does not require and to be defined on the same probability space.
- By the Portmanteau theorem, is equivalent to for every bounded continuous function .
- Levy's continuity theorem: if and only if pointwise for all , where denotes the characteristic function.
The CLT is a statement about convergence in distribution: the standardised sum converges in distribution to . This does not mean the sum "becomes" Gaussian in any pathwise sense --- only that its CDF (and hence all tail probabilities) approaches the Gaussian CDF.
Definition: Convergence in Mean Square ( Convergence)
Convergence in Mean Square ( Convergence)
A sequence converges in mean square (or in ) to , written
if
This requires and (i.e., and must be square-integrable).
Immediate consequence: If , then
More generally, convergence in (i.e., ) is defined analogously for .
In estimation theory, the mean-squared error is precisely the distance between the estimator and the true parameter. An estimator that converges in mean square is consistent in a particularly strong sense: both its bias and its variance vanish. The MMSE (minimum mean-squared error) estimator minimises this distance at every .
Theorem: Hierarchy of Convergence Modes
The four convergence modes are related by the following implications:
In a diagram:
No other implications hold in general:
- Convergence in probability does a.s. convergence.
- Convergence in probability does mean square convergence.
- Mean square convergence does a.s. convergence.
- Convergence in distribution does convergence in probability (unless is a constant).
Special case: If where is a deterministic constant, then .
The hierarchy reflects increasing "strength" of control over the random fluctuations of around :
- In distribution controls only the shape of the distribution of (the CDF approaches the target CDF).
- In probability controls the probability of deviation at each fixed (the tail ).
- Almost surely controls the sample paths (the sequence for a.e. ).
- Mean square controls the average squared deviation ().
The key insight for "mean square in probability" is Markov's inequality: if the expected squared deviation is small, then the probability of a large deviation must also be small. The implication "a.s. in probability" uses the fact that pointwise convergence on a set of measure one is stronger than just having small deviation probabilities.
Use Markov inequality to prove mean square implies in probability.
Use the dominated convergence theorem for a.s. implies in probability.
Mean square $\Rightarrow$ in probability (via Markov/Chebyshev)
By Markov's inequality applied to the nonnegative random variable , for any :
If , then the right-hand side tends to zero, so .
A.s. $\Rightarrow$ in probability (sketch)
Fix and define . Almost sure convergence means (by the Borel--Cantelli-type characterisation). Since and the latter decreases to , we get , which is convergence in probability.
In probability $\Rightarrow$ in distribution (sketch)
Fix where is continuous and any . Then and similarly Taking (so ) and then (using continuity of at ) yields .
Theorem: Weak Law of Large Numbers (WLLN)
Let be independent and identically distributed (iid) random variables with finite mean and finite variance . Define the sample mean
Then converges to in probability:
That is, for every ,
Averaging iid samples reduces the variance by a factor of : . By Chebyshev's inequality, the probability that deviates from by more than is at most , which vanishes as .
Physically: averaging many noisy channel measurements "averages out" the noise, leaving the true channel coefficient. The more pilots we transmit, the better the estimate --- this is the LLN in action.
Compute the mean and variance of .
Apply Chebyshev's inequality.
Step 1: Mean and variance of $\bar{X}_n$
By linearity of expectation,
Since the are independent with common variance ,
Step 2: Apply Chebyshev's inequality
Chebyshev's inequality states that for any random variable with finite variance,
Applying this with :
Step 3: Take the limit
For any fixed ,
By the squeeze theorem, , which is precisely .
Theorem: Strong Law of Large Numbers (SLLN)
Let be iid random variables with finite mean . Then the sample mean converges to almost surely:
That is,
Note that finite variance is not required; finite mean suffices.
The SLLN strengthens the WLLN from convergence in probability to almost sure convergence. It asserts that on (almost) every sample path , the running average eventually settles down to and stays there. This is the rigorous justification for the "frequency interpretation" of probability: if we repeat an experiment indefinitely, the relative frequency of any event converges to its probability.
The proof requires more sophisticated tools from measure theory (e.g., the Borel--Cantelli lemma, truncation arguments, or Kolmogorov's inequality) and is beyond our scope here. The key point is that the SLLN provides a pathwise guarantee that the WLLN does not.
The proof is omitted (requires measure-theoretic tools beyond our scope).
The SLLN implies the WLLN by the convergence hierarchy.
Proof omitted
The proof of the SLLN under the minimal assumption of finite mean (Kolmogorov's version) relies on truncation arguments, Kolmogorov's maximal inequality, and the Borel--Cantelli lemma. These measure-theoretic tools are beyond the scope of this text.
Under the stronger assumption of finite fourth moment (), a more elementary proof using the Borel--Cantelli lemma and Chebyshev's inequality applied to is possible. We refer the interested reader to Billingsley (1995), Chapter 22.
What we can verify: Since a.s. convergence implies convergence in probability (Theorem thm-convergence-hierarchy), the SLLN is indeed a stronger result than the WLLN.
Theorem: Central Limit Theorem (CLT)
Let be iid random variables with finite mean and finite variance . Define the standardised partial sum
Then converges in distribution to a standard normal random variable:
Equivalently, for every ,
The CLT is arguably the most important theorem in probability. It says that the sum of many independent, identically distributed random variables --- regardless of their individual distribution --- is approximately Gaussian after proper centering and scaling. The only requirements are a finite mean and a finite, positive variance.
The proof strategy via characteristic functions is elegant: show that the characteristic function of converges pointwise to , which is the characteristic function of . By Levy's continuity theorem, pointwise convergence of characteristic functions implies convergence in distribution.
Compute the characteristic function of in terms of .
Use the Taylor expansion of the characteristic function near the origin.
Apply the limit .
Step 1: Standardise and express the CF of $Z_n$
Without loss of generality, assume (otherwise replace by ). Let , so .
The characteristic function of is
Since the are iid and independent,
where is the common CF of each .
Step 2: Taylor-expand the characteristic function
Since and , the Taylor expansion of around gives
Substituting :
Step 3: Take the $n$-th power and the limit
Therefore
Using the fundamental limit whenever , with :
Step 4: Identify the limit and conclude
The function is the characteristic function of . By Levy's continuity theorem, pointwise convergence of characteristic functions to a function that is continuous at (and is continuous everywhere) implies convergence in distribution:
Central Limit Theorem in Action
Parameters
Theorem: Chernoff Bound
Let be a random variable with moment-generating function . Then for any ,
Similarly, for the left tail,
The bound is obtained by optimising the free parameter to get the tightest possible exponential bound.
The Chernoff bound starts from Markov's inequality applied to the exponential function (which is nonnegative and monotone increasing for ). The extra parameter is then optimised to yield the tightest bound. Because the bound has an exponential form , it typically decays exponentially in the "excess" , making it far sharper than Chebyshev's inequality for large deviations.
The Chernoff bound is the foundation of large deviation theory and is directly related to the error exponent in channel coding: the probability of decoding error decays exponentially with block length , and the exponent is characterised via a Chernoff-type optimisation.
Start with the identity for .
Apply Markov's inequality to .
Optimise over .
Step 1: Exponentiate and apply Markov
For any , the function is strictly increasing, so
By Markov's inequality (for the nonnegative random variable ),
Step 2: Optimise over $s$
The inequality holds for every . To obtain the tightest bound, we minimise the right-hand side over :
The optimal satisfies the first-order condition
or equivalently , which means equals the tilted mean under the exponentially twisted distribution.
Connection to rate function
Defining the log-moment generating function (cumulant generating function)
and its Legendre--Fenchel transform (the rate function)
the Chernoff bound takes the compact form
The rate function governs the exponential decay of tail probabilities and is the central object in Cramer's theorem (the foundation of large deviations theory).
Example: Chernoff Bound for the Sum of iid Bernoulli Random Variables
Let be iid random variables with , and let . Use the Chernoff bound to show that for any (i.e., above the mean),
where is the Kullback--Leibler divergence (relative entropy) between and .
Evaluate numerically for , , and (i.e., ).
Step 1: MGF of a single Bernoulli
For ,
Step 2: MGF of the sum $S_n$
Since the are iid,
Step 3: Apply the Chernoff bound
s > 0$.
Step 4: Optimise over $s$
Let (the "target fraction"). Set the derivative of with respect to to zero:
Solving: , which gives
This is positive when (i.e., ), confirming the regime where the upper tail is nontrivial.
Step 5: Substitute and simplify
Substituting back:
The exponent is times the KL divergence between and , showing that the tail probability decays exponentially in .
Step 6: Numerical evaluation
For , , ():
Therefore
The exact probability (computed via the binomial CDF) is approximately , so the Chernoff bound overestimates by a factor of about --- remarkably tight for such a simple bound.
Central Limit Theorem: Watching Convergence to Gaussian
Why This Matters: CLT Justifies the Gaussian Interference Model
In a large cellular network with co-channel interferers, the aggregate interference at a receiver is
where is the received power from the -th interferer, is its fading coefficient, and is its data symbol. The terms are (approximately) independent and identically distributed.
When is large, the Central Limit Theorem guarantees that is approximately Gaussian, regardless of the distributions of and . This justifies the widespread modelling assumption:
where .
Practical implications:
-
SINR analysis: Under the Gaussian interference assumption, the signal-to-interference-plus-noise ratio (SINR) fully determines the achievable rate via , just as for AWGN channels.
-
Stochastic geometry: In Poisson cellular network models (e.g., the PPP model of Andrews et al., 2011), the aggregate interference from infinitely many base stations is analysed using the CLT and its refinements.
-
Massive MIMO: When a base station with antennas serves users, the effective interference after matched filtering involves sums of iid terms. As , these sums "harden" (by the LLN) and their fluctuations are Gaussian (by the CLT), leading to channel hardening and favourable propagation --- the two pillars of massive MIMO theory.
See full treatment in Chapter 4، Section 5
Historical Note: De Moivre, Laplace, and Lyapunov: The Long Road to the CLT
The Central Limit Theorem has one of the longest gestation periods in the history of mathematics, spanning nearly two centuries:
-
Abraham de Moivre (1733) proved the earliest version: the binomial distribution , properly standardised, converges to what we now call the normal distribution. He published this in The Doctrine of Chances as a computational tool for approximating binomial probabilities. De Moivre did not have the concept of a "distribution" --- he worked directly with the ratio of the middle binomial coefficient to .
-
Pierre-Simon Laplace (1812) extended de Moivre's result to arbitrary in his monumental Theorie analytique des probabilites, obtaining what is now called the de Moivre--Laplace theorem. Laplace also recognised the broader principle: sums of "errors" tend to follow the "law of errors" (the Gaussian).
-
Pafnuty Chebyshev (1867) and his student Andrey Markov (1900) proved versions of the CLT under increasingly general conditions, using the method of moments.
-
Aleksandr Lyapunov (1901) gave the first rigorous proof of the CLT for independent (not necessarily identically distributed) random variables using characteristic functions, under a condition now known as the Lyapunov condition. This is essentially the modern proof strategy.
-
Jarl Waldemar Lindeberg (1922) and William Feller (1935) established the Lindeberg--Feller theorem, giving necessary and sufficient conditions for the CLT to hold for independent (non-identically distributed) summands.
The CLT that we prove in this section (for iid summands with finite variance) is the simplest and most commonly used version. The general Lindeberg--Feller form is needed in wireless when interferers have unequal powers or different fading statistics.
Quick Check
Consider the following statement: "If , then ." Is this statement true or false?
True
False
Convergence in probability does not imply almost sure convergence in general. The hierarchy is a.s. in probability in distribution, and this chain of implications is strict.
Counterexample: Let with Lebesgue measure. Define to be the indicator of a "sliding block" of width that cycles through , where increases as grows. Then (convergence in probability to ), but for every , infinitely often, so for any . Almost sure convergence fails completely.
Quick Check
Let be iid with mean and variance . For , what is the approximate distribution of the sample mean according to the CLT?
By the CLT, is approximately for large . With , , and :
The standard deviation of is .
Quick Check
The Chernoff bound is obtained by applying Markov's inequality to which random variable?
directly
(Chebyshev approach)
for optimised
The Chernoff bound exploits the monotonicity of the exponential: for . Applying Markov's inequality to the nonnegative random variable gives . Optimising over yields the tightest exponential bound.
This is strictly tighter than Chebyshev's inequality (which applies Markov to ) because the exponential tilting provides an additional degree of freedom.
Convergence in Probability
A sequence converges in probability to () if for every . This is weaker than almost sure convergence but stronger than convergence in distribution. It is the mode of convergence established by the Weak Law of Large Numbers.
Related: Convergence in Probability, Weak Law of Large Numbers (WLLN), Hierarchy of Convergence Modes
Almost Sure Convergence
A sequence converges almost surely to () if . This is pathwise convergence outside a null set and is strictly stronger than convergence in probability. It is the mode of convergence established by the Strong Law of Large Numbers.
Related: Almost Sure Convergence, Strong Law of Large Numbers (SLLN), Hierarchy of Convergence Modes
Convergence in Distribution
A sequence converges in distribution to () if at all continuity points of . This is the weakest convergence mode and does not require the random variables to be defined on the same probability space. It is the mode of convergence in the Central Limit Theorem.
Related: Convergence in Distribution, Central Limit Theorem (CLT), Hierarchy of Convergence Modes
Law of Large Numbers (LLN)
The Law of Large Numbers states that the sample mean of iid random variables converges to the population mean . The Weak Law (WLLN) gives convergence in probability; the Strong Law (SLLN) gives almost sure convergence. The WLLN requires finite variance; the SLLN requires only finite mean.
Related: Weak Law of Large Numbers (WLLN), Strong Law of Large Numbers (SLLN), Convergence in Probability, Almost Sure Convergence
Central Limit Theorem (CLT)
For iid random variables with mean and finite variance , the standardised sum converges in distribution to . This is the fundamental reason why the Gaussian distribution appears ubiquitously in communications: thermal noise, aggregate interference, and fading envelopes all arise from summing many independent contributions.
Related: Central Limit Theorem (CLT), Central Limit Theorem in Action, CLT Justifies the Gaussian Interference Model, De Moivre, Laplace, and Lyapunov: The Long Road to the CLT
Chernoff Bound
An exponential tail bound: , obtained by applying Markov's inequality to and optimising over the tilting parameter . The Chernoff bound provides exponentially tight estimates and is the basis for error exponent analysis in coding theory and large deviations theory.
Related: Chernoff Bound, Chernoff Bound for the Sum of iid Bernoulli Random Variables
Common Mistake: Using the CLT for Small Sample Sizes
Mistake:
"The CLT says the sum is Gaussian, so even for or samples the Gaussian approximation should be accurate."
Correction:
The CLT is an asymptotic result: it guarantees convergence in distribution as , but says nothing about the rate of convergence or the accuracy at any finite . The quality of the Gaussian approximation at finite depends critically on the shape of the underlying distribution:
- Symmetric, light-tailed distributions (e.g., uniform, symmetric triangular): the approximation is excellent even for .
- Skewed distributions (e.g., exponential, chi-squared with few degrees of freedom): the Gaussian approximation can be poor for . The skewness of the sum decreases as (by the Berry--Esseen theorem, the CDF error is bounded by where ), so highly skewed distributions need larger .
- Heavy-tailed distributions (e.g., Pareto with infinite variance): the CLT does not apply at all, and the sum converges to a stable distribution instead of a Gaussian.
In wireless: The Gaussian interference model (justified by the CLT) is reliable in dense networks with interferers. For sparse networks with dominant interferers, the Gaussian assumption can significantly underestimate the tail of the interference distribution (and hence underestimate outage probability). In such cases, the exact interference distribution or a more refined approximation (e.g., using the Gamma distribution) should be used.
Key Takeaway
The core messages of this section:
-
Four modes, one hierarchy. Almost sure and mean square convergence each imply convergence in probability, which in turn implies convergence in distribution. No other implications hold in general. Choosing the right convergence mode is a modelling decision: the WLLN uses convergence in probability, the SLLN uses almost sure convergence, and the CLT uses convergence in distribution.
-
The LLN: averaging works. The sample mean of iid observations converges to the true mean. This is why pilot- based channel estimation, Monte Carlo simulation, and ergodic capacity arguments are all valid: with enough samples, the average faithfully represents the expectation.
-
The CLT: sums become Gaussian. The standardised sum of iid random variables converges in distribution to , regardless of the original distribution (as long as the variance is finite). This single theorem explains why:
- Thermal noise is Gaussian (many microscopic contributions).
- Aggregate interference in large networks is Gaussian.
- The Rayleigh fading model arises from many scattered paths (the in-phase and quadrature components are Gaussian by the CLT, so the envelope is Rayleigh).
-
The Chernoff bound: exponential tail control. For sums of iid random variables, the Chernoff bound provides exponentially decaying tail probabilities. The decay rate is governed by the KL divergence (for Bernoulli sums) or more generally by the Legendre--Fenchel transform of the log-MGF. This is the key tool for analysing error exponents in coding theory.
-
Respect the limits. The CLT is asymptotic; for small or heavy-tailed distributions, the Gaussian approximation can be dangerously inaccurate. Always check whether is "large enough" for the specific distribution at hand.