Error Exponents and the KL Divergence

The Information-Theoretic Limit of Detection

In Chapter 1 we saw that binary Neyman-Pearson testing has an asymptotic error exponent equal to a KL divergence (Stein's lemma). This section closes the loop: the same KL divergence governs the high-SNR / large-nn error probability of MM-ary ML detection, and in the continuous-channel limit it becomes the AWGN channel capacity. This is one of the most beautiful parallels in probability theory β€” detection and communication are two sides of the same KL coin.

Definition:

Minimum Pairwise KL Divergence

Let {fm}m=0Mβˆ’1\{f_m\}_{m=0}^{M-1} be the per-sample likelihoods under MM hypotheses. Given nn i.i.d. observations, the likelihood under Hm\mathcal{H}_m is fmβŠ—n(y)=∏i=1nfm(yi)f_m^{\otimes n}(\mathbf{y}) = \prod_{i=1}^n f_m(y_i). The minimum pairwise KL divergence of the problem is Dminβ‘β€…β€Š=β€…β€Šmin⁑mβ‰ jD(fmβˆ₯fj)β€…β€Š=β€…β€Šmin⁑mβ‰ j∫fm(y)log⁑fm(y)fj(y) dy.D_{\min} \;=\; \min_{m \neq j} D(f_m \| f_j) \;=\; \min_{m \neq j} \int f_m(y) \log \frac{f_m(y)}{f_j(y)}\,dy. Its symmetrized counterpart, the Bhattacharyya exponent, is ΞΌm,j=βˆ’log⁑∫fm(y)fj(y) dy\mu_{m,j} = -\log \int \sqrt{f_m(y) f_j(y)}\,dy, and we set ΞΌmin⁑=min⁑mβ‰ jΞΌm,j\mu_{\min} = \min_{m\neq j}\mu_{m,j}.

ΞΌm,j≀min⁑(D(fmβˆ₯fj),D(fjβˆ₯fm))/2\mu_{m,j} \leq \min(D(f_m\|f_j), D(f_j\|f_m)) / 2, and the two are comparable for nearly-symmetric hypothesis pairs. The Chernoff exponent em,jβˆ—=βˆ’min⁑λ∈[0,1]log⁑∫fmΞ»fj1βˆ’Ξ»e^*_{m,j} = -\min_{\lambda\in[0,1]}\log\int f_m^\lambda f_j^{1-\lambda} refines both.

Theorem: Exponential Decay of PeP_e in the Large-nn Limit

Consider the MM-ary ML detector applied to nn i.i.d. observations with equal priors. Then Pe(n)β€…β€Šβ‰β€…β€Šeβˆ’n eminβ‘βˆ—,eminβ‘βˆ—β€…β€Š=β€…β€Šmin⁑mβ‰ jem,jβˆ—,P_e^{(n)} \;\doteq\; e^{-n\, e^*_{\min}}, \qquad e^*_{\min} \;=\; \min_{m\neq j} e^*_{m,j}, where ≐\doteq denotes exponential equality (lim⁑nβ†’βˆžβˆ’1nlog⁑Pe(n)=eminβ‘βˆ—\lim_{n\to\infty} -\tfrac{1}{n}\log P_e^{(n)} = e^*_{\min}) and em,jβˆ—e^*_{m,j} is the Chernoff exponent of the pair (m,j)(m,j). In particular, Pe(n)β‰₯eβˆ’nDmin⁑(1+o(1))P_e^{(n)} \geq e^{-n D_{\min} (1 + o(1))}, giving the detection-theoretic counterpart of the channel-coding exponent.

For large nn, the empirical distribution P^y\hat P_{\mathbf{y}} concentrates near fmf_m under Hm\mathcal{H}_m. Confusing mm with a neighbor jj requires P^y\hat P_{\mathbf{y}} to drift toward the "tilted" distribution that lies on the geodesic between fmf_m and fjf_j β€” this event has exponentially small probability governed by the Chernoff exponent.

Example: Gaussian Shift-in-Mean: Exponent = Half the Euclidean Distance Squared

Let fm=N(ΞΌm,Οƒ2)f_m = \mathcal{N}(\mu_m, \sigma^2) be the per-sample likelihoods under MM Gaussian hypotheses with the same variance. Compute the pairwise KL divergence, the Chernoff exponent, and eminβ‘βˆ—e^*_{\min}.

βˆ’1nlog⁑Pe-\tfrac{1}{n}\log P_e Converging to Dmin⁑D_{\min}

For a 3-ary Gaussian detection problem, plot the empirical exponent βˆ’1nlog⁑P^e(n)-\tfrac{1}{n}\log \hat P_e(n) (from Monte-Carlo simulation) together with the theoretical limit eminβ‘βˆ—e^*_{\min} as nn grows. The empirical curve converges to the horizontal line predicted by the large- deviations theorem.

Parameters
1.5
1
40

Stein vs. Chernoff Exponents for Binary Hypothesis Testing

RegimeError ControlledExponent
Stein (NP)PF≀αP_F \leq \alpha fixed, minimize PMP_MD(f1βˆ₯f0)D(f_1 \| f_0)
Chernoff (Bayes)Minimize Ο€0PF+Ο€1PM\pi_0 P_F + \pi_1 P_Meβˆ—=βˆ’min⁑λlog⁑∫f01βˆ’Ξ»f1Ξ»e^* = -\min_\lambda \log \int f_0^{1-\lambda}f_1^\lambda
BhattacharyyaUpper bound (simpler)βˆ’log⁑∫f0f1=eβˆ—βˆ£Ξ»=1/2-\log\int\sqrt{f_0 f_1} = e^*|_{\lambda=1/2}
HoeffdingPM≀βP_M \leq \beta fixed, minimize PFP_FD(f0βˆ₯f1)D(f_0 \| f_1)

Chernoff vs. Bhattacharyya Exponents

The Bhattacharyya exponent is obtained from the Chernoff exponent by fixing Ξ»=1/2\lambda = 1/2. For symmetric hypothesis pairs (equal variances in Gaussian, equal crossover in BSC) the optimum is Ξ»βˆ—=1/2\lambda^* = 1/2 and the two coincide. For asymmetric pairs they differ, but Bhattacharyya is always ≀\leq Chernoff exponent and is often used in practice because it admits closed forms for almost every distribution family.

Why This Matters: Error Exponents in Channel Coding

Gallager's random-coding exponent for channel coding, Er(R)=max⁑0≀ρ≀1[E0(ρ)βˆ’ΟR]E_r(R) = \max_{0\leq\rho\leq 1} [E_0(\rho) - \rho R], is built from exactly the same Chernoff-style KL quantities as the detection exponent in this section. When you read that "LDPC codes achieve capacity with vanishing BER for rates below CC", the underlying mechanism is the same large-deviations concentration that drives Pe(n)β†’0P_e^{(n)} \to 0 in detection. This is why advanced modulation-and- coding schemes are designed jointly: the code picks a set of codewords whose pairwise Chernoff exponents are uniformly large.

See full treatment in Chapter 12

Common Mistake: KL Divergence Is Not Symmetric

Mistake:

Students sometimes write "the KL distance between ff and gg" and use D(fβˆ₯g)D(f\|g) interchangeably with D(gβˆ₯f)D(g\|f).

Correction:

KL divergence is not a metric: D(fβˆ₯g)β‰ D(gβˆ₯f)D(f\|g) \neq D(g\|f) in general, and the triangle inequality fails. In detection contexts, D(f1βˆ₯f0)D(f_1\|f_0) is the Stein exponent controlling miss probability (fix PFP_F, minimize PMP_M), while D(f0βˆ₯f1)D(f_0\|f_1) controls false-alarm when you fix PMP_M. Use the correct one by checking: "which distribution is the observation actually drawn from under the alternative I am trying to reject?"

Chernoff Exponent

Given two densities f0,f1f_0, f_1, the Chernoff exponent is eβˆ—=βˆ’min⁑λ∈[0,1]log⁑∫f01βˆ’Ξ»f1λ dye^* = -\min_{\lambda \in [0,1]} \log\int f_0^{1-\lambda} f_1^\lambda\,dy. It controls the asymptotic decay of symmetric-Bayes error probability for binary hypothesis testing on nn i.i.d. samples.

Related: Bhattacharyya Bound, Kl Divergence, Large Deviations

Quick Check

You use M=4M=4 equiprobable hypotheses. Two of the four pairs share exponent e=1e = 1, three share e=2e = 2, and one pair has e=3e = 3. What is the asymptotic exponent of the ML detector's error probability with nn i.i.d. samples?

3

2

1

1.831.83 (harmonic mean)

πŸ”§Engineering Note

Operational Gap Between Capacity and Uncoded BER

A common source of confusion: the AWGN channel has capacity C=log⁑2(1+SNR)C = \log_2(1 + \text{SNR}) bits/channel use, but uncoded 16-QAM at SNR 10 dB produces BER β‰ˆ10βˆ’2\approx 10^{-2} β€” hardly "error-free transmission at 4 bits/channel use." The resolution: capacity promises vanishing BER at rate R<CR < C with long-blocklength coding. The uncoded SER of this chapter is the pre-decoder error probability; Gray coding maps it to BERβ‰ˆ\approxSER/log⁑2M/\log_2 M, and the outer code (LDPC, Polar) drives BER toward zero. For 5G NR at high SNR, the outer code operates at raw channel BER around 10βˆ’210^{-2} and produces post-decoding BER below 10βˆ’610^{-6} β€” a 4-orders-of-magnitude improvement delivered by the KL-divergence-controlled error exponent of the code.

πŸ“‹ Ref: Shannon 1948; Gallager 1968

Chernoff information

For two distributions f0,f1f_0, f_1, the Chernoff information is C(f0,f1)=max⁑s∈[0,1]ΞΌ(s)C(f_0, f_1) = \max_{s \in [0,1]} \mu(s) with ΞΌ(s)=βˆ’log⁑∫f01βˆ’sf1s\mu(s) = -\log \int f_0^{1-s} f_1^s. It is the exponential rate at which the minimum Bayes error probability decays with the number of i.i.d. observations.

Related: KL divergence, Bhattacharyya distance, Error exponent