Applications to Error Probability Analysis
Large Deviations Meet Hypothesis Testing
The machinery developed in this chapter — rate functions, KL divergence, sub-Gaussian concentration — finds its most natural application in the analysis of error probabilities. In hypothesis testing, the probability of confusing two distributions decays exponentially with the number of observations, and the exponent is governed by KL divergence. Stein's lemma makes this precise: the best achievable exponent for Type II error (while keeping Type I error bounded) is exactly . This connects large deviations to the Neyman-Pearson framework from detection theory (Book FSI, Chapter 2) and to the error exponents of channel coding (Book ITA, Chapter 4).
Definition: Error Exponent
Error Exponent
Consider a sequence of decision problems indexed by (the number of observations or the block length), with error probability . The error exponent is when the limit exists. It measures the exponential rate at which the error probability decays: .
A positive error exponent means exponentially reliable communication or detection. The larger , the faster errors vanish. Error exponents are the bridge between information-theoretic limits (which say whether reliable communication is possible) and practical performance (which asks how many observations are needed).
Theorem: Stein's Lemma
Let be i.i.d. with distribution either (hypothesis ) or (hypothesis ), where and are distributions on a finite alphabet . For a sequence of tests with Type I error and Type II error :
If for some fixed and all , then where is the minimum Type II error over all tests satisfying the Type I constraint.
Under , the log-likelihood ratio concentrates around by the law of large numbers. Under , it concentrates around . The Neyman-Pearson test thresholds this statistic, and the probability of the LLR falling on the wrong side decays at rate under — this is exactly Cramér's theorem applied to the LLR.
The log-likelihood ratio per sample is .
Under , this has mean . Stein's lemma concerns the rate of the event that it exceeds the threshold.
Use Sanov's theorem: under , the test erroneously accepts when the empirical distribution is close to .
Achievability (Type II exponent $\geq D(P_0 \| P_1)$)
Consider the likelihood ratio test with threshold . By the WLLN, under : a.s., so . Under : by Cramér's theorem applied to (which have mean under ): As , the exponent approaches (by properties of the rate function of the LLR under ).
Converse (Type II exponent $\leq D(P_0 \| P_1)$)
For any test with , define the acceptance region . By the Neyman-Pearson lemma, the optimal consists of sequences with high likelihood ratio. Using Sanov's theorem: Therefore .
Example: Stein's Lemma for Binary Hypothesis Testing
A communication system sends i.i.d. Either (no interference) or (interference present). Using observations, what is the optimal Type II error exponent when Type I error is ?
Apply Stein's lemma
The optimal exponent is , independent of the Type I constraint (as long as ).
Compute KL divergence
$
Interpret
The Type II error decays as . For observations: . Even moderate sample sizes give very small Type II error when the hypotheses are well-separated.
Stein's Lemma: Type II Error Exponent
Visualize the exponential decay of the optimal Type II error probability as a function of , for two Bernoulli hypotheses. The slope on the log scale is .
Parameters
Connection to the Neyman-Pearson Lemma
The Neyman-Pearson lemma (Book FSI, Chapter 2) tells us the structure of the optimal test: threshold the likelihood ratio. Stein's lemma tells us the performance: the best achievable error exponent. Together, they give a complete picture. The Neyman-Pearson test with threshold achieves and . By large deviations, when is held fixed. This is not just a theoretical curiosity — it tells practitioners exactly how many samples are needed to achieve a target detection reliability.
Theorem: Chernoff Information (Symmetric Error Exponent)
For the Bayesian hypothesis testing problem with equal priors and minimum total error probability : where the Chernoff information is
When both types of error are penalized equally, the optimal exponent is the Chernoff information — a symmetric measure of "distance" between and . It satisfies , and equals the Bhattacharyya distance when .
Upper bound
The MAP test yields . For any :
Factor the product
C_\lambda = -\log\sum_x P_0(x)^\lambda P_1(x)^{1-\lambda}\lambdaC(P_0, P_1)$.
Matching lower bound
The lower bound follows from analyzing the MAP test via the saddle-point method on the likelihood ratio, showing the exponent cannot exceed .
Example: Chernoff Information for Gaussian Hypotheses
Compute the Chernoff information for and (same variance, different means).
Compute the Chernoff exponent
$
Optimize over $\lambda$
The function is maximized at :
Compare with KL divergences
, so . The symmetric exponent is 4 times smaller than the Stein exponent — the price of penalizing both error types equally.
Why This Matters: Error Exponents in Communication System Design
In digital communications, the error exponent determines how fast the block error rate decays with code length. For a discrete memoryless channel at rate , the random coding error exponent is — the probability of decoding error decays as . In practice, this means doubling the block length roughly squares the reliability (halves the log-error). The error exponent at rate zero equals the channel's Rényi divergence and governs the performance of ultra-reliable low-latency communications (URLLC) in 5G NR, where block lengths are short and the exponent matters more than capacity.
Finite-Blocklength Bounds for MIMO Channels
While classical error exponent analysis assumes infinite block lengths, modern 5G systems (especially URLLC) operate at block lengths of 100--1000 symbols. Östman, Durisi, and Caire developed tight finite-blocklength bounds for MIMO channels with Rician fading, showing that the gap between the normal approximation (which uses the channel dispersion) and the achievable rate can be significant at these short lengths. Their analysis relies on concentration inequalities — specifically, Berry-Esseen refinements of the CLT — to bound the probability of atypical channel realizations. This work demonstrates that the concentration tools of this chapter are not merely theoretical: they are essential for characterizing 5G URLLC performance.
Common Mistake: Stein's Lemma Is Asymmetric in the Hypotheses
Mistake:
Assuming or that the optimal exponent is the same regardless of which hypothesis is constrained.
Correction:
If Type I error ( rejected) is constrained, the Type II exponent is . If Type II error is constrained instead, the Type I exponent is . These are generally different because KL divergence is asymmetric. Only for the symmetric Bayesian problem does the Chernoff information apply.
Error Exponent
The exponential rate of decay of the error probability with the number of observations (or block length): . A positive error exponent means exponentially reliable performance.
Related: Stein's Lemma, Chernoff Information (Symmetric Error Exponent), Large Deviation Principle
Chernoff Information
A symmetric measure of distinguishability between two distributions: . It governs the Bayesian error exponent with equal priors.
Related: Error Exponent, Bhattacharyya Distance
Quick Check
In Stein's lemma, if nats and we constrain Type I error to 5%, approximately how many observations are needed for Type II error ?
Depends on
We need , so .
Key Takeaway
Large deviations theory provides the exponential rates at which error probabilities decay in hypothesis testing and communication. Stein's lemma identifies as the optimal Type II error exponent under a Type I constraint. The Chernoff information governs the symmetric (Bayesian) case. These exponents are not abstract quantities — they directly determine how many observations (or how long a code) is needed to achieve a target error rate, making them essential for system design in telecommunications.