Ferkans — Interactive Telecom Tutor

From Counting to Geometry

We now know how to compute the probability of individual type classes. But in most applications, we care about the probability of a set of distributions — for example, "what is the probability that the empirical distribution lands in a set $\mathcal{E}$ ?" Sanov's theorem gives a strikingly geometric answer: the probability decays exponentially at the rate of the closest distribution in $\mathcal{E}$ to the true distribution $P$ , measured in KL divergence. This turns probability questions into optimization problems on the simplex of distributions.

Definition:
The Probability Simplex

The probability simplex on alphabet $\mathcal{X}$ is $\mathcal{P}(\mathcal{X}) = \left\{Q : \mathcal{X} \to [0,1] \;\middle|\; \sum_{a \in \mathcal{X}} Q(a) = 1\right\}.$ This is a $(|\mathcal{X}|-1)$ -dimensional convex polytope. For a binary alphabet, it is the interval $[0,1]$ . For a ternary alphabet, it is a triangle. Types $\mathcal{P}_n(\mathcal{X})$ form a grid of rational points on this simplex, becoming denser as $n$ grows.

Theorem: Sanov's Theorem

Let $X_1, \ldots, X_n$ be i.i.d. $\sim P$ on a finite alphabet $\mathcal{X}$ , and let $\mathcal{E} \subseteq \mathcal{P}(\mathcal{X})$ be a set of distributions. Then:

Upper bound: If $\mathcal{E}$ is the closure of its interior, $P^n(\hat{P}_{\mathbf{X}} \in \mathcal{E}) \leq (n+1)^{|\mathcal{X}|} \cdot 2^{-n D^*(\mathcal{E} \| P)}$ where $D^*(\mathcal{E} \| P) = \min_{Q \in \mathcal{E}} D(Q \| P)$ is the information projection of $P$ onto $\mathcal{E}$ .

Lower bound: If $\mathcal{E}$ contains an interior point, $P^n(\hat{P}_{\mathbf{X}} \in \mathcal{E}) \geq (n+1)^{-|\mathcal{X}|} \cdot 2^{-n D^*(\mathcal{E} \| P)}.$

In exponential notation: $P^n(\hat{P}_{\mathbf{X}} \in \mathcal{E}) \doteq 2^{-n D^*(\mathcal{E} \| P)}$ .

Imagine the simplex of distributions with the true distribution $P$ sitting at some point. The set $\mathcal{E}$ is a region in this simplex. Sanov's theorem says: the probability of the empirical distribution falling in $\mathcal{E}$ is determined by the closest point in $\mathcal{E}$ to $P$ , where "distance" is KL divergence. The KL divergence plays the role of a "cost" for the empirical distribution to deviate from the truth.

The point is that $D$ is not a true distance (it is not symmetric and does not satisfy the triangle inequality), but it behaves like one for this purpose. The minimizer $Q^* = \arg\min_{Q \in \mathcal{E}} D(Q \| P)$ is called the I-projection or information projection of $P$ onto $\mathcal{E}$ .

Proof

Upper bound by union over types

We write the event as a union over types in $\mathcal{E}$ : $P^n(\hat{P}_{\mathbf{X}} \in \mathcal{E}) = \sum_{Q \in \mathcal{P}_n(\mathcal{X}) \cap \mathcal{E}} P^n(T_Q).$ Using $P^n(T_Q) \leq 2^{-nD(Q \| P)}$ and $D(Q \| P) \geq D^*(\mathcal{E} \| P)$ for all $Q \in \mathcal{E}$ : $P^n(\hat{P}_{\mathbf{X}} \in \mathcal{E}) \leq (n+1)^{|\mathcal{X}|} \cdot 2^{-nD^*(\mathcal{E} \| P)}.$

Lower bound via the best type

Let $Q^*_n \in \mathcal{P}_n(\mathcal{X}) \cap \mathcal{E}$ be the type in $\mathcal{E}$ closest to $P$ . For large enough $n$ , such a type exists if $\mathcal{E}$ has an interior point. Then: $P^n(\hat{P}_{\mathbf{X}} \in \mathcal{E}) \geq P^n(T_{Q^*_n}) \geq (n+1)^{-|\mathcal{X}|} 2^{-nD(Q^*_n \| P)}.$ Since $Q^*_n \to Q^*$ and $D(Q^*_n \| P) \to D^*({\mathcal{E}} \| P)$ , the exponent is correct.

,

Example: Probability of a Biased Empirical Frequency

A fair coin ( $P(H) = P(T) = 1/2$ ) is flipped $n$ times. What is the exponential rate at which the probability of observing at least 60% heads decays?

Solution

Set up the problem

We have $\mathcal{X} = \{H, T\}$ , $P = (1/2, 1/2)$ , and $\mathcal{E} = \{Q : Q(H) \geq 0.6\}$ .

Find the I-projection

We need $Q^* = \arg\min_{Q \in \mathcal{E}} D(Q \| P)$ . Since $P$ is uniform, $D(Q \| P) = \log 2 - H(Q)$ . Minimizing $D$ is equivalent to maximizing $H(Q)$ over $\mathcal{E}$ . The maximum entropy distribution in $\mathcal{E}$ is the boundary point $Q^* = (0.6, 0.4)$ .

Compute the exponent

$D(Q^* \| P) = 0.6 \log \frac{0.6}{0.5} + 0.4 \log \frac{0.4}{0.5} = 0.6 \log 1.2 + 0.4 \log 0.8 \approx 0.0290 \text{ bits}.$ $So$ \Pr[\text{at least 60% heads}] \doteq 2^{-0.0290 \cdot n} $. For$ n = 100 $, the exponent gives approximately$ 2^{-2.9} \approx 0.134 $. The exact value (by summing binomial probabilities) is$ \approx 0.0284 $, showing that the exponential approximation captures the right order of magnitude even for moderate$ n$.

Sanov's Theorem on the Probability Simplex

Visualize the probability simplex for a ternary alphabet. Place the true distribution $P$ , drag the region $\mathcal{E}$ , and observe how the I-projection (closest point in KL divergence) determines the decay exponent. Contours of constant KL divergence from $P$ are shown.

Parameters

P(a_1)

0.5

P(a_2)

0.3

Threshold on

Q(a_1)

0.6

Definition:
Information Projection (I-Projection)

Given a distribution $P$ and a closed convex set $\mathcal{E} \subseteq \mathcal{P}(\mathcal{X})$ , the I-projection (or information projection) of $P$ onto $\mathcal{E}$ is $Q^* = \arg\min_{Q \in \mathcal{E}} D(Q \| P).$ When $\mathcal{E}$ is a convex set, the I-projection is unique (because $D(Q \| P)$ is strictly convex in $Q$ for fixed $P$ ).

There is a dual concept: the M-projection (or moment projection) $P^* = \arg\min_{Q \in \mathcal{E}} D(P \| Q)$ , which reverses the order of arguments. The I-projection preserves the support of $P$ ; the M-projection matches moments. In exponential families, these two projections have elegant geometric interpretations via Pythagorean-like identities for KL divergence.

Theorem: Application: Neyman–Pearson Hypothesis Testing Exponent

Consider testing $H_0: X_1, \ldots, X_n \sim P$ against $H_1: X_1, \ldots, X_n \sim Q$ . Among all tests with type-I error probability at most $\alpha \in (0,1)$ , the best achievable type-II error exponent (as $n \to \infty$ ) is $D(P \| Q)$ . More precisely: for any test with $P^n(\text{reject } H_0) \leq \alpha$ , $\liminf_{n \to \infty} -\frac{1}{n} \log Q^n(\text{accept } H_0) \leq D(P \| Q)$ and equality is achievable by the likelihood ratio test (Stein's lemma).

Sanov's theorem tells us that the probability of the empirical distribution looking like $P$ when the truth is $Q$ decays as $2^{-nD(P \| Q)}$ . This is precisely the type-II error of the optimal test. The KL divergence thus has a direct operational meaning as the best achievable error exponent in binary hypothesis testing.

Proof

Connection to Sanov

The acceptance region of any test can be characterized by a set of types. Under $H_1$ (truth is $Q$ ), the probability of accepting $H_0$ (deciding the empirical distribution looks like $P$ ) is dominated by the type in the acceptance region closest to $Q$ in KL divergence. The optimal acceptance region is a KL divergence ball around $P$ , and the boundary of this ball hits $Q$ at rate $D(P \| Q)$ .

Achievability via the type-based test

Accept $H_0$ if $D(\hat{P}_{\mathbf{x}} \| P) \leq \gamma_n$ for a threshold $\gamma_n$ chosen to satisfy the type-I error constraint. Under $H_1$ , the empirical distribution concentrates around $Q$ , and the probability of $D(\hat{P}_{\mathbf{x}} \| P) \leq \gamma_n$ decays at rate $D(P \| Q)$ by Sanov's theorem.

Quick Check

In Sanov's theorem, if the set $\mathcal{E}$ contains the true distribution $P$ , then $D^*(\mathcal{E} \| P) = \min_{Q \in \mathcal{E}} D(Q \| P)$ equals:

$H(P)$

$0$

$+\infty$

Depends on $|\mathcal{X}|$

Correction:

0

If $P \in \mathcal{E}$ , then $Q = P$ is a feasible point with $D(P \| P) = 0$ . So the probability does not decay exponentially — it stays bounded away from zero. This makes sense: the empirical distribution converges to $P$ , so the event $\hat{P}_{\mathbf{X}} \in \mathcal{E}$ has probability tending to 1.

Common Mistake: KL Divergence Asymmetry in Sanov's Theorem

Mistake:

Writing the exponent in Sanov's theorem as $\min_{Q \in \mathcal{E}} D(P \| Q)$ (with $P$ in the first argument) instead of $\min_{Q \in \mathcal{E}} D(Q \| P)$ .

Correction:

The correct exponent is $D(Q \| P)$ — the type $Q$ appears first, the truth $P$ appears second. This is because $P^n(T_Q) \doteq 2^{-nD(Q \| P)}$ : the probability of seeing type $Q$ under truth $P$ is governed by $D(Q \| P)$ . Swapping the arguments gives Stein's lemma exponent for hypothesis testing, which is a different (though related) quantity.

I-projection (information projection)

The distribution $Q^* \in \mathcal{E}$ closest to $P$ in KL divergence: $Q^* = \arg\min_{Q \in \mathcal{E}} D(Q \| P)$ . Determines the exponential rate in Sanov's theorem.

Large deviations

The study of exponential decay rates for probabilities of rare events. In the i.i.d. setting, Sanov's theorem provides the large deviations rate function for the empirical distribution. The rate function is the KL divergence $D(\cdot \| P)$ .

Why This Matters: Sanov's Theorem in Spectrum Sensing

In cognitive radio, a secondary user must decide whether a primary user is transmitting (hypothesis testing on the received signal distribution). The error exponent from Sanov's theorem determines how many samples are needed to achieve a target detection reliability. See Book telecom, Ch. 9 for detection and estimation fundamentals.

Key Takeaway

Sanov's theorem is the master large deviations result for i.i.d. sequences: the probability of the empirical distribution falling in a set $\mathcal{E}$ decays as $2^{-n D^*(\mathcal{E} \| P)}$ , where $D^*$ is the minimum KL divergence from any distribution in $\mathcal{E}$ to the truth $P$ . This converts probability problems into geometry problems on the simplex.

Sanov's Theorem and Large Deviations