Error Exponents and the KL Divergence
The Information-Theoretic Limit of Detection
In Chapter 1 we saw that binary Neyman-Pearson testing has an asymptotic error exponent equal to a KL divergence (Stein's lemma). This section closes the loop: the same KL divergence governs the high-SNR / large- error probability of -ary ML detection, and in the continuous-channel limit it becomes the AWGN channel capacity. This is one of the most beautiful parallels in probability theory β detection and communication are two sides of the same KL coin.
Definition: Minimum Pairwise KL Divergence
Minimum Pairwise KL Divergence
Let be the per-sample likelihoods under hypotheses. Given i.i.d. observations, the likelihood under is . The minimum pairwise KL divergence of the problem is Its symmetrized counterpart, the Bhattacharyya exponent, is , and we set .
, and the two are comparable for nearly-symmetric hypothesis pairs. The Chernoff exponent refines both.
Theorem: Exponential Decay of in the Large- Limit
Consider the -ary ML detector applied to i.i.d. observations with equal priors. Then where denotes exponential equality () and is the Chernoff exponent of the pair . In particular, , giving the detection-theoretic counterpart of the channel-coding exponent.
For large , the empirical distribution concentrates near under . Confusing with a neighbor requires to drift toward the "tilted" distribution that lies on the geodesic between and β this event has exponentially small probability governed by the Chernoff exponent.
Reduce to pairwise via union bound
The union bound gives . Conversely, . Both sides have the same exponent at large , namely the worst (smallest) pairwise exponent.
Chernoff bound on pairwise error
For any , . Optimizing over gives .
Matching lower bound
Sanov's theorem applied to the set of types closer (in KL) to than to gives for any . Combined with the upper bound, we obtain exponential equality .
Conclude
. The bound (at ) gives .
Example: Gaussian Shift-in-Mean: Exponent = Half the Euclidean Distance Squared
Let be the per-sample likelihoods under Gaussian hypotheses with the same variance. Compute the pairwise KL divergence, the Chernoff exponent, and .
KL divergence
β one half the squared distance normalized by noise variance.
Chernoff exponent
By symmetry in (same variance), the minimizing is , yielding β half the Bhattacharyya exponent, which matches in the symmetric Gaussian case.
Minimum exponent
where . Thus at large , β exactly the exponent matching the tail: with gives the same exponent.
Theorem: KL Divergence as the Bridge Between Detection and Capacity
Let be a finite-alphabet channel family with capacity If one transmits at rate using a codebook of rate , the maximum-likelihood decoder achieves exponentially in the block length. Conversely, if we test only hypotheses with the same family, Stein's lemma gives error exponent . The two statements share the same KL quantity: the detection exponent for a pair is a capacity term for a binary codebook using those two signals.
Detection asks "given samples from one of distributions, which one?" β a rate- coding problem. Capacity asks "how many hypotheses can I distinguish at vanishing error rate per channel use?" β the optimal scaling of with . KL divergence is the common currency because it is the natural measure of discriminability between two probability distributions.
Single-letter representation of MI
where is the output marginal.
Random-coding achievability
Pick a codebook of codewords i.i.d. from . Under ML decoding, the pairwise error exponent between two random codewords is given by a KL-like quantity (Gallager's function), and it exceeds for .
Connection to detection
For a fixed pair of codewords β a 2-hypothesis problem β the error exponent is exactly the Chernoff exponent between their induced output distributions. Averaging over random codebooks recovers the capacity bound.
Converging to
For a 3-ary Gaussian detection problem, plot the empirical exponent (from Monte-Carlo simulation) together with the theoretical limit as grows. The empirical curve converges to the horizontal line predicted by the large- deviations theorem.
Parameters
Stein vs. Chernoff Exponents for Binary Hypothesis Testing
| Regime | Error Controlled | Exponent |
|---|---|---|
| Stein (NP) | fixed, minimize | |
| Chernoff (Bayes) | Minimize | |
| Bhattacharyya | Upper bound (simpler) | |
| Hoeffding | fixed, minimize |
Chernoff vs. Bhattacharyya Exponents
The Bhattacharyya exponent is obtained from the Chernoff exponent by fixing . For symmetric hypothesis pairs (equal variances in Gaussian, equal crossover in BSC) the optimum is and the two coincide. For asymmetric pairs they differ, but Bhattacharyya is always Chernoff exponent and is often used in practice because it admits closed forms for almost every distribution family.
Why This Matters: Error Exponents in Channel Coding
Gallager's random-coding exponent for channel coding, , is built from exactly the same Chernoff-style KL quantities as the detection exponent in this section. When you read that "LDPC codes achieve capacity with vanishing BER for rates below ", the underlying mechanism is the same large-deviations concentration that drives in detection. This is why advanced modulation-and- coding schemes are designed jointly: the code picks a set of codewords whose pairwise Chernoff exponents are uniformly large.
See full treatment in Chapter 12
Common Mistake: KL Divergence Is Not Symmetric
Mistake:
Students sometimes write "the KL distance between and " and use interchangeably with .
Correction:
KL divergence is not a metric: in general, and the triangle inequality fails. In detection contexts, is the Stein exponent controlling miss probability (fix , minimize ), while controls false-alarm when you fix . Use the correct one by checking: "which distribution is the observation actually drawn from under the alternative I am trying to reject?"
Chernoff Exponent
Given two densities , the Chernoff exponent is . It controls the asymptotic decay of symmetric-Bayes error probability for binary hypothesis testing on i.i.d. samples.
Related: Bhattacharyya Bound, Kl Divergence, Large Deviations
Quick Check
You use equiprobable hypotheses. Two of the four pairs share exponent , three share , and one pair has . What is the asymptotic exponent of the ML detector's error probability with i.i.d. samples?
3
2
1
(harmonic mean)
and .
Operational Gap Between Capacity and Uncoded BER
A common source of confusion: the AWGN channel has capacity bits/channel use, but uncoded 16-QAM at SNR 10 dB produces BER β hardly "error-free transmission at 4 bits/channel use." The resolution: capacity promises vanishing BER at rate with long-blocklength coding. The uncoded SER of this chapter is the pre-decoder error probability; Gray coding maps it to BERSER, and the outer code (LDPC, Polar) drives BER toward zero. For 5G NR at high SNR, the outer code operates at raw channel BER around and produces post-decoding BER below β a 4-orders-of-magnitude improvement delivered by the KL-divergence-controlled error exponent of the code.
Chernoff information
For two distributions , the Chernoff information is with . It is the exponential rate at which the minimum Bayes error probability decays with the number of i.i.d. observations.
Related: KL divergence, Bhattacharyya distance, Error exponent