Ferkans — Interactive Telecom Tutor

Large Deviations Meet Hypothesis Testing

The machinery developed in this chapter — rate functions, KL divergence, sub-Gaussian concentration — finds its most natural application in the analysis of error probabilities. In hypothesis testing, the probability of confusing two distributions decays exponentially with the number of observations, and the exponent is governed by KL divergence. Stein's lemma makes this precise: the best achievable exponent for Type II error (while keeping Type I error bounded) is exactly $D(P_0 \| P_1)$ . This connects large deviations to the Neyman-Pearson framework from detection theory (Book FSI, Chapter 2) and to the error exponents of channel coding (Book ITA, Chapter 4).

Definition:
Error Exponent

Consider a sequence of decision problems indexed by $n$ (the number of observations or the block length), with error probability $P_e(n)$ . The error exponent is $E_r = \lim_{n \to \infty} -\frac{1}{n}\log P_e(n),$ when the limit exists. It measures the exponential rate at which the error probability decays: $P_e(n) \doteq e^{-nE_r}$ .

A positive error exponent means exponentially reliable communication or detection. The larger $E_r$ , the faster errors vanish. Error exponents are the bridge between information-theoretic limits (which say whether reliable communication is possible) and practical performance (which asks how many observations are needed).

Theorem: Stein's Lemma

Let $X_1, X_2, \ldots$ be i.i.d. with distribution either $P_0$ (hypothesis $H_0$ ) or $P_1$ (hypothesis $H_1$ ), where $P_0$ and $P_1$ are distributions on a finite alphabet $\mathcal{X}$ . For a sequence of tests $\phi_n$ with Type I error $\alpha_n = P_0(\phi_n = 1)$ and Type II error $\beta_n = P_1(\phi_n = 0)$ :

If $\alpha_n \leq \alpha$ for some fixed $\alpha \in (0, 1)$ and all $n$ , then $\lim_{n \to \infty} -\frac{1}{n}\log \beta_n^* = D(P_0 \| P_1),$ where $\beta_n^*$ is the minimum Type II error over all tests satisfying the Type I constraint.

Under $H_0$ , the log-likelihood ratio $\frac{1}{n}\sum_{i=1}^n \log\frac{P_0(X_i)}{P_1(X_i)}$ concentrates around $D(P_0 \| P_1) > 0$ by the law of large numbers. Under $H_1$ , it concentrates around $-D(P_1 \| P_0) < 0$ . The Neyman-Pearson test thresholds this statistic, and the probability of the LLR falling on the wrong side decays at rate $D(P_0 \| P_1)$ under $H_1$ — this is exactly Cramér's theorem applied to the LLR.

Show Hint

The log-likelihood ratio per sample is $\frac{1}{n}\ell_n = \frac{1}{n}\sum_i \log\frac{P_0(X_i)}{P_1(X_i)}$ .

Under $H_1$ , this has mean $-D(P_1 \| P_0) < 0$ . Stein's lemma concerns the rate of the event that it exceeds the threshold.

Use Sanov's theorem: under $H_1$ , the test erroneously accepts $H_0$ when the empirical distribution is close to $P_0$ .

Proof

Achievability (Type II exponent $\geq D(P_0 \| P_1)$)

Consider the likelihood ratio test with threshold $\gamma_n = e^{n(D(P_0 \| P_1) - \epsilon)}$ . By the WLLN, under $H_0$ : $\frac{1}{n}\ell_n \to D(P_0 \| P_1)$ a.s., so $\alpha_n \to 0$ . Under $H_1$ : by Cramér's theorem applied to $\ell_i = \log(P_0(X_i)/P_1(X_i))$ (which have mean $-D(P_1 \| P_0)$ under $P_1$ ): $\beta_n \leq \mathbb{P}_1\!\left(\frac{1}{n}\ell_n \geq D(P_0 \| P_1) - \epsilon\right) \doteq e^{-nI_1(D(P_0 \| P_1) - \epsilon)}.$ As $\epsilon \to 0$ , the exponent approaches $D(P_0 \| P_1)$ (by properties of the rate function of the LLR under $P_1$ ).

Converse (Type II exponent $\leq D(P_0 \| P_1)$)

For any test $\phi_n$ with $\alpha_n \leq \alpha$ , define the acceptance region $A_n = \{\phi_n = 0\}$ . By the Neyman-Pearson lemma, the optimal $A_n$ consists of sequences with high likelihood ratio. Using Sanov's theorem: $\beta_n = P_1(A_n) \geq (n+1)^{-|\mathcal{X}|} e^{-n D(P_0 \| P_1)}.$ Therefore $\limsup_{n \to \infty} -\frac{1}{n}\log\beta_n^* \leq D(P_0 \| P_1)$ . $\blacksquare$

,

Example: Stein's Lemma for Binary Hypothesis Testing

A communication system sends $X_i \in \{0, 1\}$ i.i.d. Either $P_0 = \text{Ber}(0.3)$ (no interference) or $P_1 = \text{Ber}(0.7)$ (interference present). Using $n$ observations, what is the optimal Type II error exponent when Type I error is $\leq 0.01$ ?

Solution

Apply Stein's lemma

The optimal exponent is $D(P_0 \| P_1)$ , independent of the Type I constraint $\alpha$ (as long as $\alpha > 0$ ).

Compute KL divergence

$D(P_0 \| P_1) = 0.3\ln\frac{0.3}{0.7} + 0.7\ln\frac{0.7}{0.3} \approx 0.3 \cdot (-0.847) + 0.7 \cdot 0.847 \approx 0.339 \text{ nats}.$ $

Interpret

The Type II error decays as $e^{-0.339n}$ . For $n = 20$ observations: $\beta_{20}^* \approx e^{-6.78} \approx 0.001$ . Even moderate sample sizes give very small Type II error when the hypotheses are well-separated.

Stein's Lemma: Type II Error Exponent

Visualize the exponential decay of the optimal Type II error probability $\beta_n^*$ as a function of $n$ , for two Bernoulli hypotheses. The slope on the log scale is $-D(P_0 \| P_1)$ .

Parameters

p_0

(under

H_0

)0.3

p_1

(under

H_1

)0.7

n_{\\max}

100

Connection to the Neyman-Pearson Lemma

The Neyman-Pearson lemma (Book FSI, Chapter 2) tells us the structure of the optimal test: threshold the likelihood ratio. Stein's lemma tells us the performance: the best achievable error exponent. Together, they give a complete picture. The Neyman-Pearson test with threshold $\gamma$ achieves $\alpha = \mathbb{P}_0(\ell_n \geq \gamma)$ and $\beta = \mathbb{P}_1(\ell_n < \gamma)$ . By large deviations, $\beta \doteq e^{-nD(P_0 \| P_1)}$ when $\alpha$ is held fixed. This is not just a theoretical curiosity — it tells practitioners exactly how many samples are needed to achieve a target detection reliability.

Theorem: Chernoff Information (Symmetric Error Exponent)

For the Bayesian hypothesis testing problem with equal priors and minimum total error probability $P_e(n)$ : $\lim_{n \to \infty} -\frac{1}{n}\log P_e(n) = C(P_0, P_1),$ where the Chernoff information is $C(P_0, P_1) = -\min_{0 \leq \lambda \leq 1} \log\!\left(\sum_{x \in \mathcal{X}} P_0(x)^\lambda P_1(x)^{1-\lambda}\right).$

When both types of error are penalized equally, the optimal exponent is the Chernoff information — a symmetric measure of "distance" between $P_0$ and $P_1$ . It satisfies $C(P_0, P_1) \leq \min\{D(P_0 \| P_1), D(P_1 \| P_0)\}$ , and equals the Bhattacharyya distance when $\lambda^* = 1/2$ .

Proof

Upper bound

The MAP test yields $P_e(n) = \frac{1}{2}[\alpha_n + \beta_n]$ . For any $\lambda \in [0,1]$ : $P_e(n) \leq \frac{1}{2}\sum_{x^n} \min\{P_0^n(x^n), P_1^n(x^n)\} \leq \frac{1}{2}\sum_{x^n} P_0^n(x^n)^\lambda P_1^n(x^n)^{1-\lambda}.$

Factor the product

$\sum_{x^n} P_0^n(x^n)^\lambda P_1^n(x^n)^{1-\lambda} = \left(\sum_x P_0(x)^\lambda P_1(x)^{1-\lambda}\right)^n = e^{-nC_\lambda}$ $where$ C_\lambda = -\log\sum_x P_0(x)^\lambda P_1(x)^{1-\lambda} $. Minimizing over$ \lambda $gives$ C(P_0, P_1)$.

Matching lower bound

The lower bound follows from analyzing the MAP test via the saddle-point method on the likelihood ratio, showing the exponent cannot exceed $C(P_0, P_1)$ . $\blacksquare$

Example: Chernoff Information for Gaussian Hypotheses

Compute the Chernoff information for $P_0 = \mathcal{N}(\mu_0, \sigma^2)$ and $P_1 = \mathcal{N}(\mu_1, \sigma^2)$ (same variance, different means).

Solution

Compute the Chernoff exponent

$-\log\int f_0(x)^\lambda f_1(x)^{1-\lambda}dx = \frac{\lambda(1-\lambda)(\mu_0 - \mu_1)^2}{2\sigma^2}.$ $

Optimize over $\lambda$

The function $\lambda(1-\lambda)$ is maximized at $\lambda^* = 1/2$ : $C(P_0, P_1) = \frac{(\mu_0 - \mu_1)^2}{8\sigma^2}.$

Compare with KL divergences

$D(P_0 \| P_1) = D(P_1 \| P_0) = \frac{(\mu_0 - \mu_1)^2}{2\sigma^2}$ , so $C = D/4$ . The symmetric exponent is 4 times smaller than the Stein exponent — the price of penalizing both error types equally.

Why This Matters: Error Exponents in Communication System Design

In digital communications, the error exponent determines how fast the block error rate decays with code length. For a discrete memoryless channel at rate $R < C$ , the random coding error exponent is $E_r(R) > 0$ — the probability of decoding error decays as $e^{-nE_r(R)}$ . In practice, this means doubling the block length roughly squares the reliability (halves the log-error). The error exponent at rate zero equals the channel's Rényi divergence and governs the performance of ultra-reliable low-latency communications (URLLC) in 5G NR, where block lengths are short and the exponent matters more than capacity.

🎓CommIT Contribution(2021)

Finite-Blocklength Bounds for MIMO Channels

J. Östman, G. Durisi, G. Caire — IEEE Transactions on Wireless Communications

While classical error exponent analysis assumes infinite block lengths, modern 5G systems (especially URLLC) operate at block lengths of 100--1000 symbols. Östman, Durisi, and Caire developed tight finite-blocklength bounds for MIMO channels with Rician fading, showing that the gap between the normal approximation (which uses the channel dispersion) and the achievable rate can be significant at these short lengths. Their analysis relies on concentration inequalities — specifically, Berry-Esseen refinements of the CLT — to bound the probability of atypical channel realizations. This work demonstrates that the concentration tools of this chapter are not merely theoretical: they are essential for characterizing 5G URLLC performance.

finite-blocklengthMIMOURLLCconcentration-inequalitiesView Paper →

Common Mistake: Stein's Lemma Is Asymmetric in the Hypotheses

Mistake:

Assuming $D(P_0 \| P_1) = D(P_1 \| P_0)$ or that the optimal exponent is the same regardless of which hypothesis is constrained.

Correction:

If Type I error ( $H_0$ rejected) is constrained, the Type II exponent is $D(P_0 \| P_1)$ . If Type II error is constrained instead, the Type I exponent is $D(P_1 \| P_0)$ . These are generally different because KL divergence is asymmetric. Only for the symmetric Bayesian problem does the Chernoff information $C(P_0, P_1)$ apply.

Error Exponent

The exponential rate of decay of the error probability with the number of observations (or block length): $E_r = -\lim_{n \to \infty} \frac{1}{n}\log P_e(n)$ . A positive error exponent means exponentially reliable performance.

Chernoff Information

A symmetric measure of distinguishability between two distributions: $C(P_0, P_1) = -\min_{\lambda \in [0,1]} \log\sum_x P_0(x)^\lambda P_1(x)^{1-\lambda}$ . It governs the Bayesian error exponent with equal priors.

Quick Check

In Stein's lemma, if $D(P_0 \| P_1) = 0.5$ nats and we constrain Type I error to 5%, approximately how many observations are needed for Type II error $\leq 10^{-6}$ ?

$n \approx 28$

$n \approx 12$

$n \approx 100$

Depends on $\alpha$

Correction:

n \approx 28

We need $e^{-0.5n} \leq 10^{-6}$ , so $n \geq \frac{6\ln 10}{0.5} = \frac{13.82}{0.5} \approx 28$ .

Key Takeaway

Large deviations theory provides the exponential rates at which error probabilities decay in hypothesis testing and communication. Stein's lemma identifies $D(P_0 \| P_1)$ as the optimal Type II error exponent under a Type I constraint. The Chernoff information governs the symmetric (Bayesian) case. These exponents are not abstract quantities — they directly determine how many observations (or how long a code) is needed to achieve a target error rate, making them essential for system design in telecommunications.

Applications to Error Probability Analysis

Large Deviations Meet Hypothesis Testing

Definition: Error Exponent

Theorem: Stein's Lemma

Achievability (Type II exponent $\geq D(P_0 \| P_1)$)

Converse (Type II exponent $\leq D(P_0 \| P_1)$)

Example: Stein's Lemma for Binary Hypothesis Testing

Apply Stein's lemma

Compute KL divergence

Interpret

Stein's Lemma: Type II Error Exponent

Parameters

Connection to the Neyman-Pearson Lemma

Theorem: Chernoff Information (Symmetric Error Exponent)

Upper bound

Factor the product

Matching lower bound

Example: Chernoff Information for Gaussian Hypotheses

Compute the Chernoff exponent

Optimize over $\lambda$

Compare with KL divergences

Why This Matters: Error Exponents in Communication System Design

Finite-Blocklength Bounds for MIMO Channels

Common Mistake: Stein's Lemma Is Asymmetric in the Hypotheses

Error Exponent

Chernoff Information

Quick Check

Key Takeaway

Definition:
Error Exponent