Ferkans — Interactive Telecom Tutor

The Information-Theoretic Limit of Detection

In Chapter 1 we saw that binary Neyman-Pearson testing has an asymptotic error exponent equal to a KL divergence (Stein's lemma). This section closes the loop: the same KL divergence governs the high-SNR / large- $n$ error probability of $M$ -ary ML detection, and in the continuous-channel limit it becomes the AWGN channel capacity. This is one of the most beautiful parallels in probability theory — detection and communication are two sides of the same KL coin.

Definition:
Minimum Pairwise KL Divergence

Let $\{f_m\}_{m=0}^{M-1}$ be the per-sample likelihoods under $M$ hypotheses. Given $n$ i.i.d. observations, the likelihood under $\mathcal{H}_m$ is $f_m^{\otimes n}(\mathbf{y}) = \prod_{i=1}^n f_m(y_i)$ . The minimum pairwise KL divergence of the problem is $D_{\min} \;=\; \min_{m \neq j} D(f_m \| f_j) \;=\; \min_{m \neq j} \int f_m(y) \log \frac{f_m(y)}{f_j(y)}\,dy.$ Its symmetrized counterpart, the Bhattacharyya exponent, is $\mu_{m,j} = -\log \int \sqrt{f_m(y) f_j(y)}\,dy$ , and we set $\mu_{\min} = \min_{m\neq j}\mu_{m,j}$ .

$\mu_{m,j} \leq \min(D(f_m\|f_j), D(f_j\|f_m)) / 2$ , and the two are comparable for nearly-symmetric hypothesis pairs. The Chernoff exponent $e^*_{m,j} = -\min_{\lambda\in[0,1]}\log\int f_m^\lambda f_j^{1-\lambda}$ refines both.

Theorem: Exponential Decay of $P_e$ in the Large- $n$ Limit

Consider the $M$ -ary ML detector applied to $n$ i.i.d. observations with equal priors. Then $P_e^{(n)} \;\doteq\; e^{-n\, e^*_{\min}}, \qquad e^*_{\min} \;=\; \min_{m\neq j} e^*_{m,j},$ where $\doteq$ denotes exponential equality ( $\lim_{n\to\infty} -\tfrac{1}{n}\log P_e^{(n)} = e^*_{\min}$ ) and $e^*_{m,j}$ is the Chernoff exponent of the pair $(m,j)$ . In particular, $P_e^{(n)} \geq e^{-n D_{\min} (1 + o(1))}$ , giving the detection-theoretic counterpart of the channel-coding exponent.

For large $n$ , the empirical distribution $\hat P_{\mathbf{y}}$ concentrates near $f_m$ under $\mathcal{H}_m$ . Confusing $m$ with a neighbor $j$ requires $\hat P_{\mathbf{y}}$ to drift toward the "tilted" distribution that lies on the geodesic between $f_m$ and $f_j$ — this event has exponentially small probability governed by the Chernoff exponent.

Proof

Reduce to pairwise via union bound

The union bound gives $P_e^{(n)} \leq \sum_{m\neq j} \pi_m\, P^{(n)}(m\to j)$ . Conversely, $P_e^{(n)} \geq \tfrac{1}{M} \max_{m\neq j} P^{(n)}(m\to j)$ . Both sides have the same exponent at large $n$ , namely the worst (smallest) pairwise exponent.

Chernoff bound on pairwise error

For any $\lambda \in [0,1]$ , $P^{(n)}(m\to j) = \Pr(\prod f_j(Y_i)/f_m(Y_i) \geq 1 \mid \mathcal{H}_m) \leq \mathbb{E}_m\!\bigl[(f_j/f_m)^\lambda\bigr]^n = \bigl(\int f_m^{1-\lambda} f_j^{\lambda}\bigr)^n = e^{-n e^*_{m,j}(\lambda)}$ . Optimizing over $\lambda$ gives $P^{(n)}(m\to j) \leq e^{-n e^*_{m,j}}$ .

Matching lower bound

Sanov's theorem applied to the set of types closer (in KL) to $f_j$ than to $f_m$ gives $P^{(n)}(m\to j) \geq e^{-n (e^*_{m,j} + \varepsilon_n)}$ for any $\varepsilon_n \to 0$ . Combined with the upper bound, we obtain exponential equality $P^{(n)}(m\to j) \doteq e^{-n e^*_{m,j}}$ .

Conclude

$P_e^{(n)} \doteq e^{-n \min_{m\neq j} e^*_{m,j}} = e^{-n e^*_{\min}}$ . The bound $e^*_{m,j} \leq D_{m,j} = D(f_m\|f_j)$ (at $\lambda = 1$ ) gives $P_e^{(n)} \geq e^{-n D_{\min} (1+o(1))}$ . $\blacksquare$

Example: Gaussian Shift-in-Mean: Exponent = Half the Euclidean Distance Squared

Let $f_m = \mathcal{N}(\mu_m, \sigma^2)$ be the per-sample likelihoods under $M$ Gaussian hypotheses with the same variance. Compute the pairwise KL divergence, the Chernoff exponent, and $e^*_{\min}$ .

Solution

KL divergence

$D(\mathcal{N}(\mu_m,\sigma^2) \| \mathcal{N}(\mu_j,\sigma^2)) = (\mu_m - \mu_j)^2/(2\sigma^2)$ — one half the squared distance normalized by noise variance.

Chernoff exponent

By symmetry in $\lambda$ (same variance), the minimizing $\lambda$ is $1/2$ , yielding $e^*_{m,j} = (\mu_m - \mu_j)^2/(8\sigma^2)$ — half the Bhattacharyya exponent, which matches $\tfrac{1}{4}D$ in the symmetric Gaussian case.

Minimum exponent

$e^*_{\min} = d_{\min}^2/(8\sigma^2)$ where $d_{\min} = \min_{m\neq j} |\mu_m-\mu_j|$ . Thus at large $n$ , $P_e^{(n)} \doteq e^{-n d_{\min}^2/(8\sigma^2)}$ — exactly the exponent matching the $Q$ tail: $Q(x) \sim e^{-x^2/2}$ with $x = d_{\min}/(2\sigma)$ gives the same exponent.

Theorem: KL Divergence as the Bridge Between Detection and Capacity

Let $\{f_x\}_{x\in\mathcal{X}}$ be a finite-alphabet channel family with capacity $C = \max_{p_X} I(X;Y) = \max_{p_X} \mathbb{E}_{X}\bigl[D(f_X \| p_Y)\bigr].$ If one transmits at rate $R < C$ using a codebook of rate $R$ , the maximum-likelihood decoder achieves $P_e \to 0$ exponentially in the block length. Conversely, if we test only $M = 2$ hypotheses with the same family, Stein's lemma gives error exponent $D(f_1\|f_0)$ . The two statements share the same KL quantity: the detection exponent for a pair is a capacity term for a binary codebook using those two signals.

Detection asks "given $n$ samples from one of $M$ distributions, which one?" — a rate- $\frac{\log M}{n}$ coding problem. Capacity asks "how many hypotheses can I distinguish at vanishing error rate per channel use?" — the optimal scaling of $M$ with $n$ . KL divergence is the common currency because it is the natural measure of discriminability between two probability distributions.

Proof

Single-letter representation of MI

$I(X;Y) = \mathbb{E}_X[D(f_X \| p_Y)]$ where $p_Y = \sum_x p_X(x) f_x$ is the output marginal.

Random-coding achievability

Pick a codebook of $2^{nR}$ codewords i.i.d. from $p_X$ . Under ML decoding, the pairwise error exponent between two random codewords is given by a KL-like quantity (Gallager's $E_0$ function), and it exceeds $R$ for $R < C$ .

Connection to detection

For a fixed pair of codewords — a 2-hypothesis problem — the error exponent is exactly the Chernoff exponent between their induced output distributions. Averaging over random codebooks recovers the capacity bound. $\blacksquare$

$-\tfrac{1}{n}\log P_e$ Converging to $D_{\min}$

For a 3-ary Gaussian detection problem, plot the empirical exponent $-\tfrac{1}{n}\log \hat P_e(n)$ (from Monte-Carlo simulation) together with the theoretical limit $e^*_{\min}$ as $n$ grows. The empirical curve converges to the horizontal line predicted by the large- deviations theorem.

Parameters

Mean separation

\Delta\mu

1.5

\sigma

1

Max block length

n

40

Stein vs. Chernoff Exponents for Binary Hypothesis Testing

Regime	Error Controlled	Exponent
Stein (NP)	$P_F \leq \alpha$ fixed, minimize $P_M$	$D(f_1 \\| f_0)$
Chernoff (Bayes)	Minimize $\pi_0 P_F + \pi_1 P_M$	$e^* = -\min_\lambda \log \int f_0^{1-\lambda}f_1^\lambda$
Bhattacharyya	Upper bound (simpler)	$-\log\int\sqrt{f_0 f_1} = e^*\|_{\lambda=1/2}$
Hoeffding	$P_M \leq \beta$ fixed, minimize $P_F$	$D(f_0 \\| f_1)$

Chernoff vs. Bhattacharyya Exponents

The Bhattacharyya exponent is obtained from the Chernoff exponent by fixing $\lambda = 1/2$ . For symmetric hypothesis pairs (equal variances in Gaussian, equal crossover in BSC) the optimum is $\lambda^* = 1/2$ and the two coincide. For asymmetric pairs they differ, but Bhattacharyya is always $\leq$ Chernoff exponent and is often used in practice because it admits closed forms for almost every distribution family.

Why This Matters: Error Exponents in Channel Coding

Gallager's random-coding exponent for channel coding, $E_r(R) = \max_{0\leq\rho\leq 1} [E_0(\rho) - \rho R]$ , is built from exactly the same Chernoff-style KL quantities as the detection exponent in this section. When you read that "LDPC codes achieve capacity with vanishing BER for rates below $C$ ", the underlying mechanism is the same large-deviations concentration that drives $P_e^{(n)} \to 0$ in detection. This is why advanced modulation-and- coding schemes are designed jointly: the code picks a set of codewords whose pairwise Chernoff exponents are uniformly large.

See full treatment in Chapter 12

Common Mistake: KL Divergence Is Not Symmetric

Mistake:

Students sometimes write "the KL distance between $f$ and $g$ " and use $D(f\|g)$ interchangeably with $D(g\|f)$ .

Correction:

KL divergence is not a metric: $D(f\|g) \neq D(g\|f)$ in general, and the triangle inequality fails. In detection contexts, $D(f_1\|f_0)$ is the Stein exponent controlling miss probability (fix $P_F$ , minimize $P_M$ ), while $D(f_0\|f_1)$ controls false-alarm when you fix $P_M$ . Use the correct one by checking: "which distribution is the observation actually drawn from under the alternative I am trying to reject?"

Chernoff Exponent

Given two densities $f_0, f_1$ , the Chernoff exponent is $e^* = -\min_{\lambda \in [0,1]} \log\int f_0^{1-\lambda} f_1^\lambda\,dy$ . It controls the asymptotic decay of symmetric-Bayes error probability for binary hypothesis testing on $n$ i.i.d. samples.

Quick Check

You use $M=4$ equiprobable hypotheses. Two of the four pairs share exponent $e = 1$ , three share $e = 2$ , and one pair has $e = 3$ . What is the asymptotic exponent of the ML detector's error probability with $n$ i.i.d. samples?

3

2

1

$1.83$ (harmonic mean)

Correction:

1

$P_e^{(n)} \doteq e^{-n e^*_{\min}}$ and $e^*_{\min} = 1$ .

🔧Engineering Note

Operational Gap Between Capacity and Uncoded BER

A common source of confusion: the AWGN channel has capacity $C = \log_2(1 + \text{SNR})$ bits/channel use, but uncoded 16-QAM at SNR 10 dB produces BER $\approx 10^{-2}$ — hardly "error-free transmission at 4 bits/channel use." The resolution: capacity promises vanishing BER at rate $R < C$ with long-blocklength coding. The uncoded SER of this chapter is the pre-decoder error probability; Gray coding maps it to BER $\approx$ SER $/\log_2 M$ , and the outer code (LDPC, Polar) drives BER toward zero. For 5G NR at high SNR, the outer code operates at raw channel BER around $10^{-2}$ and produces post-decoding BER below $10^{-6}$ — a 4-orders-of-magnitude improvement delivered by the KL-divergence-controlled error exponent of the code.

📋 Ref: Shannon 1948; Gallager 1968

Chernoff information

For two distributions $f_0, f_1$ , the Chernoff information is $C(f_0, f_1) = \max_{s \in [0,1]} \mu(s)$ with $\mu(s) = -\log \int f_0^{1-s} f_1^s$ . It is the exponential rate at which the minimum Bayes error probability decays with the number of i.i.d. observations.

Error Exponents and the KL Divergence

The Information-Theoretic Limit of Detection

Definition: Minimum Pairwise KL Divergence

Theorem: Exponential Decay of PeP_ePe​ in the Large-nnn Limit

Reduce to pairwise via union bound

Chernoff bound on pairwise error

Matching lower bound

Conclude

Example: Gaussian Shift-in-Mean: Exponent = Half the Euclidean Distance Squared

KL divergence

Chernoff exponent

Minimum exponent

Theorem: KL Divergence as the Bridge Between Detection and Capacity

Single-letter representation of MI

Random-coding achievability

Connection to detection

−1nlog⁡Pe-\tfrac{1}{n}\log P_e−n1​logPe​ Converging to Dmin⁡D_{\min}Dmin​

Parameters

Stein vs. Chernoff Exponents for Binary Hypothesis Testing

Chernoff vs. Bhattacharyya Exponents

Why This Matters: Error Exponents in Channel Coding

Common Mistake: KL Divergence Is Not Symmetric

Chernoff Exponent

Quick Check

Operational Gap Between Capacity and Uncoded BER

Chernoff information

Definition:
Minimum Pairwise KL Divergence

Theorem: Exponential Decay of $P_e$ in the Large- $n$ Limit

$-\tfrac{1}{n}\log P_e$ Converging to $D_{\min}$