Sanov's Theorem and Large Deviations

From Counting to Geometry

We now know how to compute the probability of individual type classes. But in most applications, we care about the probability of a set of distributions β€” for example, "what is the probability that the empirical distribution lands in a set E\mathcal{E}?" Sanov's theorem gives a strikingly geometric answer: the probability decays exponentially at the rate of the closest distribution in E\mathcal{E} to the true distribution PP, measured in KL divergence. This turns probability questions into optimization problems on the simplex of distributions.

Definition:

The Probability Simplex

The probability simplex on alphabet X\mathcal{X} is P(X)={Q:Xβ†’[0,1]β€…β€Š|β€…β€Šβˆ‘a∈XQ(a)=1}.\mathcal{P}(\mathcal{X}) = \left\{Q : \mathcal{X} \to [0,1] \;\middle|\; \sum_{a \in \mathcal{X}} Q(a) = 1\right\}. This is a (∣Xβˆ£βˆ’1)(|\mathcal{X}|-1)-dimensional convex polytope. For a binary alphabet, it is the interval [0,1][0,1]. For a ternary alphabet, it is a triangle. Types Pn(X)\mathcal{P}_n(\mathcal{X}) form a grid of rational points on this simplex, becoming denser as nn grows.

Theorem: Sanov's Theorem

Let X1,…,XnX_1, \ldots, X_n be i.i.d. ∼P\sim P on a finite alphabet X\mathcal{X}, and let EβŠ†P(X)\mathcal{E} \subseteq \mathcal{P}(\mathcal{X}) be a set of distributions. Then:

Upper bound: If E\mathcal{E} is the closure of its interior, Pn(P^X∈E)≀(n+1)∣Xβˆ£β‹…2βˆ’nDβˆ—(Eβˆ₯P)P^n(\hat{P}_{\mathbf{X}} \in \mathcal{E}) \leq (n+1)^{|\mathcal{X}|} \cdot 2^{-n D^*(\mathcal{E} \| P)} where Dβˆ—(Eβˆ₯P)=min⁑Q∈ED(Qβˆ₯P)D^*(\mathcal{E} \| P) = \min_{Q \in \mathcal{E}} D(Q \| P) is the information projection of PP onto E\mathcal{E}.

Lower bound: If E\mathcal{E} contains an interior point, Pn(P^X∈E)β‰₯(n+1)βˆ’βˆ£Xβˆ£β‹…2βˆ’nDβˆ—(Eβˆ₯P).P^n(\hat{P}_{\mathbf{X}} \in \mathcal{E}) \geq (n+1)^{-|\mathcal{X}|} \cdot 2^{-n D^*(\mathcal{E} \| P)}.

In exponential notation: Pn(P^X∈E)≐2βˆ’nDβˆ—(Eβˆ₯P)P^n(\hat{P}_{\mathbf{X}} \in \mathcal{E}) \doteq 2^{-n D^*(\mathcal{E} \| P)}.

Imagine the simplex of distributions with the true distribution PP sitting at some point. The set E\mathcal{E} is a region in this simplex. Sanov's theorem says: the probability of the empirical distribution falling in E\mathcal{E} is determined by the closest point in E\mathcal{E} to PP, where "distance" is KL divergence. The KL divergence plays the role of a "cost" for the empirical distribution to deviate from the truth.

The point is that DD is not a true distance (it is not symmetric and does not satisfy the triangle inequality), but it behaves like one for this purpose. The minimizer Qβˆ—=arg⁑min⁑Q∈ED(Qβˆ₯P)Q^* = \arg\min_{Q \in \mathcal{E}} D(Q \| P) is called the I-projection or information projection of PP onto E\mathcal{E}.

,

Example: Probability of a Biased Empirical Frequency

A fair coin (P(H)=P(T)=1/2P(H) = P(T) = 1/2) is flipped nn times. What is the exponential rate at which the probability of observing at least 60% heads decays?

Sanov's Theorem on the Probability Simplex

Visualize the probability simplex for a ternary alphabet. Place the true distribution PP, drag the region E\mathcal{E}, and observe how the I-projection (closest point in KL divergence) determines the decay exponent. Contours of constant KL divergence from PP are shown.

Parameters
0.5
0.3
0.6

Definition:

Information Projection (I-Projection)

Given a distribution PP and a closed convex set EβŠ†P(X)\mathcal{E} \subseteq \mathcal{P}(\mathcal{X}), the I-projection (or information projection) of PP onto E\mathcal{E} is Qβˆ—=arg⁑min⁑Q∈ED(Qβˆ₯P).Q^* = \arg\min_{Q \in \mathcal{E}} D(Q \| P). When E\mathcal{E} is a convex set, the I-projection is unique (because D(Qβˆ₯P)D(Q \| P) is strictly convex in QQ for fixed PP).

There is a dual concept: the M-projection (or moment projection) Pβˆ—=arg⁑min⁑Q∈ED(Pβˆ₯Q)P^* = \arg\min_{Q \in \mathcal{E}} D(P \| Q), which reverses the order of arguments. The I-projection preserves the support of PP; the M-projection matches moments. In exponential families, these two projections have elegant geometric interpretations via Pythagorean-like identities for KL divergence.

Theorem: Application: Neyman–Pearson Hypothesis Testing Exponent

Consider testing H0:X1,…,Xn∼PH_0: X_1, \ldots, X_n \sim P against H1:X1,…,Xn∼QH_1: X_1, \ldots, X_n \sim Q. Among all tests with type-I error probability at most α∈(0,1)\alpha \in (0,1), the best achievable type-II error exponent (as nβ†’βˆžn \to \infty) is D(Pβˆ₯Q)D(P \| Q). More precisely: for any test with Pn(rejectΒ H0)≀αP^n(\text{reject } H_0) \leq \alpha, lim inf⁑nβ†’βˆžβˆ’1nlog⁑Qn(acceptΒ H0)≀D(Pβˆ₯Q)\liminf_{n \to \infty} -\frac{1}{n} \log Q^n(\text{accept } H_0) \leq D(P \| Q) and equality is achievable by the likelihood ratio test (Stein's lemma).

Sanov's theorem tells us that the probability of the empirical distribution looking like PP when the truth is QQ decays as 2βˆ’nD(Pβˆ₯Q)2^{-nD(P \| Q)}. This is precisely the type-II error of the optimal test. The KL divergence thus has a direct operational meaning as the best achievable error exponent in binary hypothesis testing.

Quick Check

In Sanov's theorem, if the set E\mathcal{E} contains the true distribution PP, then Dβˆ—(Eβˆ₯P)=min⁑Q∈ED(Qβˆ₯P)D^*(\mathcal{E} \| P) = \min_{Q \in \mathcal{E}} D(Q \| P) equals:

H(P)H(P)

00

+∞+\infty

Depends on ∣X∣|\mathcal{X}|

Common Mistake: KL Divergence Asymmetry in Sanov's Theorem

Mistake:

Writing the exponent in Sanov's theorem as min⁑Q∈ED(Pβˆ₯Q)\min_{Q \in \mathcal{E}} D(P \| Q) (with PP in the first argument) instead of min⁑Q∈ED(Qβˆ₯P)\min_{Q \in \mathcal{E}} D(Q \| P).

Correction:

The correct exponent is D(Qβˆ₯P)D(Q \| P) β€” the type QQ appears first, the truth PP appears second. This is because Pn(TQ)≐2βˆ’nD(Qβˆ₯P)P^n(T_Q) \doteq 2^{-nD(Q \| P)}: the probability of seeing type QQ under truth PP is governed by D(Qβˆ₯P)D(Q \| P). Swapping the arguments gives Stein's lemma exponent for hypothesis testing, which is a different (though related) quantity.

I-projection (information projection)

The distribution Qβˆ—βˆˆEQ^* \in \mathcal{E} closest to PP in KL divergence: Qβˆ—=arg⁑min⁑Q∈ED(Qβˆ₯P)Q^* = \arg\min_{Q \in \mathcal{E}} D(Q \| P). Determines the exponential rate in Sanov's theorem.

Related: Information Projection (I-Projection)

Large deviations

The study of exponential decay rates for probabilities of rare events. In the i.i.d. setting, Sanov's theorem provides the large deviations rate function for the empirical distribution. The rate function is the KL divergence D(β‹…βˆ₯P)D(\cdot \| P).

Why This Matters: Sanov's Theorem in Spectrum Sensing

In cognitive radio, a secondary user must decide whether a primary user is transmitting (hypothesis testing on the received signal distribution). The error exponent from Sanov's theorem determines how many samples are needed to achieve a target detection reliability. See Book telecom, Ch. 9 for detection and estimation fundamentals.

Key Takeaway

Sanov's theorem is the master large deviations result for i.i.d. sequences: the probability of the empirical distribution falling in a set E\mathcal{E} decays as 2βˆ’nDβˆ—(Eβˆ₯P)2^{-n D^*(\mathcal{E} \| P)}, where Dβˆ—D^* is the minimum KL divergence from any distribution in E\mathcal{E} to the truth PP. This converts probability problems into geometry problems on the simplex.