Sanov's Theorem (Preview)

From Sample Means to Empirical Distributions

Cramér's theorem tells us how unlikely it is for the sample mean to deviate from the true mean. But in many problems — hypothesis testing, source coding, channel coding — we care about the entire empirical distribution of the data, not just the mean. Sanov's theorem extends the large deviations framework: the probability that the empirical distribution falls in a set E\mathcal{E} of distributions decays as enD(QP)e^{-nD(Q^* \| P)}, where QQ^* is the distribution in E\mathcal{E} closest to PP in KL divergence. This makes KL divergence the natural "distance" for large deviations of types.

Definition:

Empirical Distribution (Type)

Given i.i.d. samples X1,,XnX_1, \ldots, X_n from a distribution PP on a finite alphabet X\mathcal{X}, the empirical distribution (or type) is P^n(x)=1ni=1n1{Xi=x},xX.\hat{P}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}\{X_i = x\}, \quad x \in \mathcal{X}. The type P^n\hat{P}_n is a probability distribution on X\mathcal{X}. The set of all possible types with denominator nn is denoted Pn(X)\mathcal{P}_n(\mathcal{X}).

The number of types Pn(X)(n+1)X|\mathcal{P}_n(\mathcal{X})| \leq (n+1)^{|\mathcal{X}|}, which is polynomial in nn. This polynomial factor is sub-exponential and therefore invisible at the exponential scale of large deviations.

Definition:

Kullback-Leibler Divergence

For two probability distributions QQ and PP on a finite alphabet X\mathcal{X}, the Kullback-Leibler divergence (or relative entropy) is D(QP)=xXQ(x)logQ(x)P(x),D(Q \| P) = \sum_{x \in \mathcal{X}} Q(x) \log \frac{Q(x)}{P(x)}, with the conventions 0log0=00 \log 0 = 0 and qlog(q/0)=+q \log(q/0) = +\infty for q>0q > 0. We have D(QP)0D(Q \| P) \geq 0 with equality iff Q=PQ = P (Gibbs' inequality). KL divergence is not symmetric and does not satisfy the triangle inequality — it is not a metric.

Type (Empirical Distribution)

The histogram of symbol frequencies in a sequence. For xnXnx^n \in \mathcal{X}^n, the type P^xn(a)=N(axn)/n\hat{P}_{x^n}(a) = N(a | x^n)/n counts the fraction of positions where symbol aa appears.

Related: Empirical Distribution (Type), Type Class

Theorem: Probability of a Type

Let X1,,XnX_1, \ldots, X_n be i.i.d. P\sim P on finite X\mathcal{X}, and let QPn(X)Q \in \mathcal{P}_n(\mathcal{X}) be a type with denominator nn. Then: 1(n+1)XenD(QP)P(P^n=Q)enD(QP).\frac{1}{(n+1)^{|\mathcal{X}|}} e^{-nD(Q \| P)} \leq \mathbb{P}(\hat{P}_n = Q) \leq e^{-nD(Q \| P)}. In particular, P(P^n=Q)enD(QP)\mathbb{P}(\hat{P}_n = Q) \doteq e^{-nD(Q \| P)}.

The probability that the empirical distribution equals a specific type QQ is governed exponentially by the KL divergence from PP to QQ. The polynomial pre-factor (n+1)X(n+1)^{|\mathcal{X}|} is negligible at the exponential scale.

Theorem: Sanov's Theorem

Let X1,,XnX_1, \ldots, X_n be i.i.d. P\sim P on a finite alphabet X\mathcal{X}, and let E\mathcal{E} be a set of probability distributions on X\mathcal{X} whose closure equals the closure of its interior. Then: P(P^nE)enD(QP),\mathbb{P}(\hat{P}_n \in \mathcal{E}) \doteq e^{-nD(Q^* \| P)}, where Q=argminQED(QP)Q^* = \arg\min_{Q \in \overline{\mathcal{E}}} D(Q \| P) is the information projection (or I-projection) of PP onto the closure of E\mathcal{E}.

Among all distributions in E\mathcal{E}, the one closest to PP in KL divergence dominates the probability. The empirical distribution lands in E\mathcal{E} primarily by "aiming" at QQ^* — the cheapest route. All other distributions in E\mathcal{E} contribute sub-exponentially relative to QQ^*.

,

Example: Sanov's Theorem Applied to Hypothesis Testing

A coin has unknown bias. Under H0H_0: P=Ber(1/2)P = \text{Ber}(1/2), under H1H_1: P=Ber(1/3)P = \text{Ber}(1/3). We flip nn times and reject H0H_0 if the fraction of heads is 0.4\leq 0.4. What is the exponential decay rate of the Type I error probability?

Historical Note: I. N. Sanov and the Method of Types

1957--1981

Ivan Nikolaevich Sanov (1919--1968) was a Soviet mathematician who proved his celebrated theorem in 1957. His work established the KL divergence as the natural "cost" for empirical distributions to deviate from the true distribution. The method of types, which provides the combinatorial machinery behind Sanov's theorem, was later systematized by Csiszár and Körner in their landmark 1981 monograph. The method of types has become one of the most powerful tools in information theory, yielding clean proofs of the channel coding theorem, source coding theorem, and multi-user results.

Sanov's Theorem Implies Cram\u00e9r's Theorem (Finite Alphabets)

For finite-alphabet i.i.d. sequences, Cramér's theorem is a corollary of Sanov's theorem. The event {Xˉna}\{\bar{X}_n \geq a\} corresponds to E={Q:xxQ(x)a}\mathcal{E} = \{Q : \sum_x xQ(x) \geq a\} in the space of distributions. The I-projection QQ^* onto this set gives D(QP)D(Q^* \| P), which equals I(a)I(a) from Cramér's theorem. In this sense, Sanov's theorem is the more fundamental result, and KL divergence is the universal rate function for empirical distributions.

Information Projection (I-Projection)

Given a set of distributions E\mathcal{E} and a reference distribution PP, the I-projection is Q=argminQED(QP)Q^* = \arg\min_{Q \in \mathcal{E}} D(Q \| P). It is the distribution in E\mathcal{E} that is "closest" to PP in the KL sense. When E\mathcal{E} is convex, the I-projection is unique.

Related: Kullback-Leibler Divergence, Sanov Theorem

Common Mistake: The Direction of KL Divergence in Sanov's Theorem

Mistake:

Writing the exponent as D(PQ)D(P \| Q^*) instead of D(QP)D(Q^* \| P).

Correction:

Sanov's theorem uses D(QP)D(Q^* \| P) — the divergence from the true distribution PP to the most likely "impostor" QQ^*. The order matters because KL divergence is asymmetric. The I-projection minimizes D(QP)D(Q \| P) over QEQ \in \mathcal{E}, not D(PQ)D(P \| Q).

Why This Matters: From Sanov's Theorem to Channel Coding Error Exponents

In channel coding, a decoder errors when the empirical joint type of the transmitted codeword and the received sequence "looks like" it came from a different codeword. Sanov's theorem (and the method of types more broadly) provides the exponential rate at which this confusion probability decays with block length. This is the foundation of error exponent analysis for discrete memoryless channels, treated fully in Book ITA, Chapter 4.

Quick Check

If X1,,XnX_1, \ldots, X_n are i.i.d. Bernoulli(0.5) and E={Q:Q(1)0.8}\mathcal{E} = \{Q : Q(1) \geq 0.8\}, what is the exponential rate of P(P^nE)\mathbb{P}(\hat{P}_n \in \mathcal{E})?

D(Ber(0.8)Ber(0.5))0.278D(\text{Ber}(0.8) \| \text{Ber}(0.5)) \approx 0.278 nats

D(Ber(0.5)Ber(0.8))0.223D(\text{Ber}(0.5) \| \text{Ber}(0.8)) \approx 0.223 nats

(0.80.5)2/(20.25)=0.18(0.8 - 0.5)^2 / (2 \cdot 0.25) = 0.18

H(Ber(0.8))0.50H(\text{Ber}(0.8)) \approx 0.50 nats

Key Takeaway

Sanov's theorem establishes KL divergence as the universal rate function for empirical distributions: the probability that P^n\hat{P}_n lands in a set E\mathcal{E} decays as enD(QP)e^{-nD(Q^* \| P)} where QQ^* is the I-projection of the true distribution onto E\mathcal{E}. This makes KL divergence the natural measure of "statistical distance" in exponential-rate problems, and underpins error exponent analysis in information theory.