Applications to Error Probability Analysis

Large Deviations Meet Hypothesis Testing

The machinery developed in this chapter — rate functions, KL divergence, sub-Gaussian concentration — finds its most natural application in the analysis of error probabilities. In hypothesis testing, the probability of confusing two distributions decays exponentially with the number of observations, and the exponent is governed by KL divergence. Stein's lemma makes this precise: the best achievable exponent for Type II error (while keeping Type I error bounded) is exactly D(P0P1)D(P_0 \| P_1). This connects large deviations to the Neyman-Pearson framework from detection theory (Book FSI, Chapter 2) and to the error exponents of channel coding (Book ITA, Chapter 4).

Definition:

Error Exponent

Consider a sequence of decision problems indexed by nn (the number of observations or the block length), with error probability Pe(n)P_e(n). The error exponent is Er=limn1nlogPe(n),E_r = \lim_{n \to \infty} -\frac{1}{n}\log P_e(n), when the limit exists. It measures the exponential rate at which the error probability decays: Pe(n)enErP_e(n) \doteq e^{-nE_r}.

A positive error exponent means exponentially reliable communication or detection. The larger ErE_r, the faster errors vanish. Error exponents are the bridge between information-theoretic limits (which say whether reliable communication is possible) and practical performance (which asks how many observations are needed).

Theorem: Stein's Lemma

Let X1,X2,X_1, X_2, \ldots be i.i.d. with distribution either P0P_0 (hypothesis H0H_0) or P1P_1 (hypothesis H1H_1), where P0P_0 and P1P_1 are distributions on a finite alphabet X\mathcal{X}. For a sequence of tests ϕn\phi_n with Type I error αn=P0(ϕn=1)\alpha_n = P_0(\phi_n = 1) and Type II error βn=P1(ϕn=0)\beta_n = P_1(\phi_n = 0):

If αnα\alpha_n \leq \alpha for some fixed α(0,1)\alpha \in (0, 1) and all nn, then limn1nlogβn=D(P0P1),\lim_{n \to \infty} -\frac{1}{n}\log \beta_n^* = D(P_0 \| P_1), where βn\beta_n^* is the minimum Type II error over all tests satisfying the Type I constraint.

Under H0H_0, the log-likelihood ratio 1ni=1nlogP0(Xi)P1(Xi)\frac{1}{n}\sum_{i=1}^n \log\frac{P_0(X_i)}{P_1(X_i)} concentrates around D(P0P1)>0D(P_0 \| P_1) > 0 by the law of large numbers. Under H1H_1, it concentrates around D(P1P0)<0-D(P_1 \| P_0) < 0. The Neyman-Pearson test thresholds this statistic, and the probability of the LLR falling on the wrong side decays at rate D(P0P1)D(P_0 \| P_1) under H1H_1 — this is exactly Cramér's theorem applied to the LLR.

,

Example: Stein's Lemma for Binary Hypothesis Testing

A communication system sends Xi{0,1}X_i \in \{0, 1\} i.i.d. Either P0=Ber(0.3)P_0 = \text{Ber}(0.3) (no interference) or P1=Ber(0.7)P_1 = \text{Ber}(0.7) (interference present). Using nn observations, what is the optimal Type II error exponent when Type I error is 0.01\leq 0.01?

Stein's Lemma: Type II Error Exponent

Visualize the exponential decay of the optimal Type II error probability βn\beta_n^* as a function of nn, for two Bernoulli hypotheses. The slope on the log scale is D(P0P1)-D(P_0 \| P_1).

Parameters
0.3
0.7
100

Connection to the Neyman-Pearson Lemma

The Neyman-Pearson lemma (Book FSI, Chapter 2) tells us the structure of the optimal test: threshold the likelihood ratio. Stein's lemma tells us the performance: the best achievable error exponent. Together, they give a complete picture. The Neyman-Pearson test with threshold γ\gamma achieves α=P0(nγ)\alpha = \mathbb{P}_0(\ell_n \geq \gamma) and β=P1(n<γ)\beta = \mathbb{P}_1(\ell_n < \gamma). By large deviations, βenD(P0P1)\beta \doteq e^{-nD(P_0 \| P_1)} when α\alpha is held fixed. This is not just a theoretical curiosity — it tells practitioners exactly how many samples are needed to achieve a target detection reliability.

Theorem: Chernoff Information (Symmetric Error Exponent)

For the Bayesian hypothesis testing problem with equal priors and minimum total error probability Pe(n)P_e(n): limn1nlogPe(n)=C(P0,P1),\lim_{n \to \infty} -\frac{1}{n}\log P_e(n) = C(P_0, P_1), where the Chernoff information is C(P0,P1)=min0λ1log ⁣(xXP0(x)λP1(x)1λ).C(P_0, P_1) = -\min_{0 \leq \lambda \leq 1} \log\!\left(\sum_{x \in \mathcal{X}} P_0(x)^\lambda P_1(x)^{1-\lambda}\right).

When both types of error are penalized equally, the optimal exponent is the Chernoff information — a symmetric measure of "distance" between P0P_0 and P1P_1. It satisfies C(P0,P1)min{D(P0P1),D(P1P0)}C(P_0, P_1) \leq \min\{D(P_0 \| P_1), D(P_1 \| P_0)\}, and equals the Bhattacharyya distance when λ=1/2\lambda^* = 1/2.

Example: Chernoff Information for Gaussian Hypotheses

Compute the Chernoff information for P0=N(μ0,σ2)P_0 = \mathcal{N}(\mu_0, \sigma^2) and P1=N(μ1,σ2)P_1 = \mathcal{N}(\mu_1, \sigma^2) (same variance, different means).

Why This Matters: Error Exponents in Communication System Design

In digital communications, the error exponent determines how fast the block error rate decays with code length. For a discrete memoryless channel at rate R<CR < C, the random coding error exponent is Er(R)>0E_r(R) > 0 — the probability of decoding error decays as enEr(R)e^{-nE_r(R)}. In practice, this means doubling the block length roughly squares the reliability (halves the log-error). The error exponent at rate zero equals the channel's Rényi divergence and governs the performance of ultra-reliable low-latency communications (URLLC) in 5G NR, where block lengths are short and the exponent matters more than capacity.

🎓CommIT Contribution(2021)

Finite-Blocklength Bounds for MIMO Channels

J. Östman, G. Durisi, G. CaireIEEE Transactions on Wireless Communications

While classical error exponent analysis assumes infinite block lengths, modern 5G systems (especially URLLC) operate at block lengths of 100--1000 symbols. Östman, Durisi, and Caire developed tight finite-blocklength bounds for MIMO channels with Rician fading, showing that the gap between the normal approximation (which uses the channel dispersion) and the achievable rate can be significant at these short lengths. Their analysis relies on concentration inequalities — specifically, Berry-Esseen refinements of the CLT — to bound the probability of atypical channel realizations. This work demonstrates that the concentration tools of this chapter are not merely theoretical: they are essential for characterizing 5G URLLC performance.

finite-blocklengthMIMOURLLCconcentration-inequalitiesView Paper →

Common Mistake: Stein's Lemma Is Asymmetric in the Hypotheses

Mistake:

Assuming D(P0P1)=D(P1P0)D(P_0 \| P_1) = D(P_1 \| P_0) or that the optimal exponent is the same regardless of which hypothesis is constrained.

Correction:

If Type I error (H0H_0 rejected) is constrained, the Type II exponent is D(P0P1)D(P_0 \| P_1). If Type II error is constrained instead, the Type I exponent is D(P1P0)D(P_1 \| P_0). These are generally different because KL divergence is asymmetric. Only for the symmetric Bayesian problem does the Chernoff information C(P0,P1)C(P_0, P_1) apply.

Error Exponent

The exponential rate of decay of the error probability with the number of observations (or block length): Er=limn1nlogPe(n)E_r = -\lim_{n \to \infty} \frac{1}{n}\log P_e(n). A positive error exponent means exponentially reliable performance.

Related: Stein's Lemma, Chernoff Information (Symmetric Error Exponent), Large Deviation Principle

Chernoff Information

A symmetric measure of distinguishability between two distributions: C(P0,P1)=minλ[0,1]logxP0(x)λP1(x)1λC(P_0, P_1) = -\min_{\lambda \in [0,1]} \log\sum_x P_0(x)^\lambda P_1(x)^{1-\lambda}. It governs the Bayesian error exponent with equal priors.

Related: Error Exponent, Bhattacharyya Distance

Quick Check

In Stein's lemma, if D(P0P1)=0.5D(P_0 \| P_1) = 0.5 nats and we constrain Type I error to 5%, approximately how many observations are needed for Type II error 106\leq 10^{-6}?

n28n \approx 28

n12n \approx 12

n100n \approx 100

Depends on α\alpha

Key Takeaway

Large deviations theory provides the exponential rates at which error probabilities decay in hypothesis testing and communication. Stein's lemma identifies D(P0P1)D(P_0 \| P_1) as the optimal Type II error exponent under a Type I constraint. The Chernoff information governs the symmetric (Bayesian) case. These exponents are not abstract quantities — they directly determine how many observations (or how long a code) is needed to achieve a target error rate, making them essential for system design in telecommunications.