Sanov's Theorem (Preview)
From Sample Means to Empirical Distributions
Cramér's theorem tells us how unlikely it is for the sample mean to deviate from the true mean. But in many problems — hypothesis testing, source coding, channel coding — we care about the entire empirical distribution of the data, not just the mean. Sanov's theorem extends the large deviations framework: the probability that the empirical distribution falls in a set of distributions decays as , where is the distribution in closest to in KL divergence. This makes KL divergence the natural "distance" for large deviations of types.
Definition: Empirical Distribution (Type)
Empirical Distribution (Type)
Given i.i.d. samples from a distribution on a finite alphabet , the empirical distribution (or type) is The type is a probability distribution on . The set of all possible types with denominator is denoted .
The number of types , which is polynomial in . This polynomial factor is sub-exponential and therefore invisible at the exponential scale of large deviations.
Definition: Kullback-Leibler Divergence
Kullback-Leibler Divergence
For two probability distributions and on a finite alphabet , the Kullback-Leibler divergence (or relative entropy) is with the conventions and for . We have with equality iff (Gibbs' inequality). KL divergence is not symmetric and does not satisfy the triangle inequality — it is not a metric.
Type (Empirical Distribution)
The histogram of symbol frequencies in a sequence. For , the type counts the fraction of positions where symbol appears.
Related: Empirical Distribution (Type), Type Class
Theorem: Probability of a Type
Let be i.i.d. on finite , and let be a type with denominator . Then: In particular, .
The probability that the empirical distribution equals a specific type is governed exponentially by the KL divergence from to . The polynomial pre-factor is negligible at the exponential scale.
Upper bound
For a sequence of type : Summing over all sequences of type (the type class ): Since (where ):
Lower bound
Using (a counting bound):
Theorem: Sanov's Theorem
Let be i.i.d. on a finite alphabet , and let be a set of probability distributions on whose closure equals the closure of its interior. Then: where is the information projection (or I-projection) of onto the closure of .
Among all distributions in , the one closest to in KL divergence dominates the probability. The empirical distribution lands in primarily by "aiming" at — the cheapest route. All other distributions in contribute sub-exponentially relative to .
Upper bound sketch
\frac{1}{n}\log-D(Q^* | P)$.
Lower bound sketch
Choose a sequence converging to : Since , the lower bound matches.
Example: Sanov's Theorem Applied to Hypothesis Testing
A coin has unknown bias. Under : , under : . We flip times and reject if the fraction of heads is . What is the exponential decay rate of the Type I error probability?
Identify the set
The rejection region in the space of types is . Under , , so we need where .
Find the I-projection
Since is minimized at but the constraint requires , the minimum on the boundary is at .
Compute the rate
e^{-0.0204n}n = 100e^{-2} \approx 0.13$.
Historical Note: I. N. Sanov and the Method of Types
1957--1981Ivan Nikolaevich Sanov (1919--1968) was a Soviet mathematician who proved his celebrated theorem in 1957. His work established the KL divergence as the natural "cost" for empirical distributions to deviate from the true distribution. The method of types, which provides the combinatorial machinery behind Sanov's theorem, was later systematized by Csiszár and Körner in their landmark 1981 monograph. The method of types has become one of the most powerful tools in information theory, yielding clean proofs of the channel coding theorem, source coding theorem, and multi-user results.
Sanov's Theorem Implies Cram\u00e9r's Theorem (Finite Alphabets)
For finite-alphabet i.i.d. sequences, Cramér's theorem is a corollary of Sanov's theorem. The event corresponds to in the space of distributions. The I-projection onto this set gives , which equals from Cramér's theorem. In this sense, Sanov's theorem is the more fundamental result, and KL divergence is the universal rate function for empirical distributions.
Information Projection (I-Projection)
Given a set of distributions and a reference distribution , the I-projection is . It is the distribution in that is "closest" to in the KL sense. When is convex, the I-projection is unique.
Related: Kullback-Leibler Divergence, Sanov Theorem
Common Mistake: The Direction of KL Divergence in Sanov's Theorem
Mistake:
Writing the exponent as instead of .
Correction:
Sanov's theorem uses — the divergence from the true distribution to the most likely "impostor" . The order matters because KL divergence is asymmetric. The I-projection minimizes over , not .
Why This Matters: From Sanov's Theorem to Channel Coding Error Exponents
In channel coding, a decoder errors when the empirical joint type of the transmitted codeword and the received sequence "looks like" it came from a different codeword. Sanov's theorem (and the method of types more broadly) provides the exponential rate at which this confusion probability decays with block length. This is the foundation of error exponent analysis for discrete memoryless channels, treated fully in Book ITA, Chapter 4.
Quick Check
If are i.i.d. Bernoulli(0.5) and , what is the exponential rate of ?
nats
nats
nats
The I-projection is (the boundary point closest to in KL). nats.
Key Takeaway
Sanov's theorem establishes KL divergence as the universal rate function for empirical distributions: the probability that lands in a set decays as where is the I-projection of the true distribution onto . This makes KL divergence the natural measure of "statistical distance" in exponential-rate problems, and underpins error exponent analysis in information theory.