Sanov's Theorem and Large Deviations
From Counting to Geometry
We now know how to compute the probability of individual type classes. But in most applications, we care about the probability of a set of distributions β for example, "what is the probability that the empirical distribution lands in a set ?" Sanov's theorem gives a strikingly geometric answer: the probability decays exponentially at the rate of the closest distribution in to the true distribution , measured in KL divergence. This turns probability questions into optimization problems on the simplex of distributions.
Definition: The Probability Simplex
The Probability Simplex
The probability simplex on alphabet is This is a -dimensional convex polytope. For a binary alphabet, it is the interval . For a ternary alphabet, it is a triangle. Types form a grid of rational points on this simplex, becoming denser as grows.
Theorem: Sanov's Theorem
Let be i.i.d. on a finite alphabet , and let be a set of distributions. Then:
Upper bound: If is the closure of its interior, where is the information projection of onto .
Lower bound: If contains an interior point,
In exponential notation: .
Imagine the simplex of distributions with the true distribution sitting at some point. The set is a region in this simplex. Sanov's theorem says: the probability of the empirical distribution falling in is determined by the closest point in to , where "distance" is KL divergence. The KL divergence plays the role of a "cost" for the empirical distribution to deviate from the truth.
The point is that is not a true distance (it is not symmetric and does not satisfy the triangle inequality), but it behaves like one for this purpose. The minimizer is called the I-projection or information projection of onto .
Upper bound by union over types
We write the event as a union over types in : Using and for all :
Lower bound via the best type
Let be the type in closest to . For large enough , such a type exists if has an interior point. Then: Since and , the exponent is correct.
Example: Probability of a Biased Empirical Frequency
A fair coin () is flipped times. What is the exponential rate at which the probability of observing at least 60% heads decays?
Set up the problem
We have , , and .
Find the I-projection
We need . Since is uniform, . Minimizing is equivalent to maximizing over . The maximum entropy distribution in is the boundary point .
Compute the exponent
\Pr[\text{at least 60% heads}] \doteq 2^{-0.0290 \cdot n}n = 1002^{-2.9} \approx 0.134\approx 0.0284n$.
Sanov's Theorem on the Probability Simplex
Visualize the probability simplex for a ternary alphabet. Place the true distribution , drag the region , and observe how the I-projection (closest point in KL divergence) determines the decay exponent. Contours of constant KL divergence from are shown.
Parameters
Definition: Information Projection (I-Projection)
Information Projection (I-Projection)
Given a distribution and a closed convex set , the I-projection (or information projection) of onto is When is a convex set, the I-projection is unique (because is strictly convex in for fixed ).
There is a dual concept: the M-projection (or moment projection) , which reverses the order of arguments. The I-projection preserves the support of ; the M-projection matches moments. In exponential families, these two projections have elegant geometric interpretations via Pythagorean-like identities for KL divergence.
Theorem: Application: NeymanβPearson Hypothesis Testing Exponent
Consider testing against . Among all tests with type-I error probability at most , the best achievable type-II error exponent (as ) is . More precisely: for any test with , and equality is achievable by the likelihood ratio test (Stein's lemma).
Sanov's theorem tells us that the probability of the empirical distribution looking like when the truth is decays as . This is precisely the type-II error of the optimal test. The KL divergence thus has a direct operational meaning as the best achievable error exponent in binary hypothesis testing.
Connection to Sanov
The acceptance region of any test can be characterized by a set of types. Under (truth is ), the probability of accepting (deciding the empirical distribution looks like ) is dominated by the type in the acceptance region closest to in KL divergence. The optimal acceptance region is a KL divergence ball around , and the boundary of this ball hits at rate .
Achievability via the type-based test
Accept if for a threshold chosen to satisfy the type-I error constraint. Under , the empirical distribution concentrates around , and the probability of decays at rate by Sanov's theorem.
Quick Check
In Sanov's theorem, if the set contains the true distribution , then equals:
Depends on
If , then is a feasible point with . So the probability does not decay exponentially β it stays bounded away from zero. This makes sense: the empirical distribution converges to , so the event has probability tending to 1.
Common Mistake: KL Divergence Asymmetry in Sanov's Theorem
Mistake:
Writing the exponent in Sanov's theorem as (with in the first argument) instead of .
Correction:
The correct exponent is β the type appears first, the truth appears second. This is because : the probability of seeing type under truth is governed by . Swapping the arguments gives Stein's lemma exponent for hypothesis testing, which is a different (though related) quantity.
I-projection (information projection)
The distribution closest to in KL divergence: . Determines the exponential rate in Sanov's theorem.
Large deviations
The study of exponential decay rates for probabilities of rare events. In the i.i.d. setting, Sanov's theorem provides the large deviations rate function for the empirical distribution. The rate function is the KL divergence .
Why This Matters: Sanov's Theorem in Spectrum Sensing
In cognitive radio, a secondary user must decide whether a primary user is transmitting (hypothesis testing on the received signal distribution). The error exponent from Sanov's theorem determines how many samples are needed to achieve a target detection reliability. See Book telecom, Ch. 9 for detection and estimation fundamentals.
Key Takeaway
Sanov's theorem is the master large deviations result for i.i.d. sequences: the probability of the empirical distribution falling in a set decays as , where is the minimum KL divergence from any distribution in to the truth . This converts probability problems into geometry problems on the simplex.