Ferkans — Interactive Telecom Tutor

From Sample Means to Empirical Distributions

Cramér's theorem tells us how unlikely it is for the sample mean to deviate from the true mean. But in many problems — hypothesis testing, source coding, channel coding — we care about the entire empirical distribution of the data, not just the mean. Sanov's theorem extends the large deviations framework: the probability that the empirical distribution falls in a set $\mathcal{E}$ of distributions decays as $e^{-nD(Q^* \| P)}$ , where $Q^*$ is the distribution in $\mathcal{E}$ closest to $P$ in KL divergence. This makes KL divergence the natural "distance" for large deviations of types.

Definition:
Empirical Distribution (Type)

Given i.i.d. samples $X_1, \ldots, X_n$ from a distribution $P$ on a finite alphabet $\mathcal{X}$ , the empirical distribution (or type) is $\hat{P}_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}\{X_i = x\}, \quad x \in \mathcal{X}.$ The type $\hat{P}_n$ is a probability distribution on $\mathcal{X}$ . The set of all possible types with denominator $n$ is denoted $\mathcal{P}_n(\mathcal{X})$ .

The number of types $|\mathcal{P}_n(\mathcal{X})| \leq (n+1)^{|\mathcal{X}|}$ , which is polynomial in $n$ . This polynomial factor is sub-exponential and therefore invisible at the exponential scale of large deviations.

Definition:
Kullback-Leibler Divergence

For two probability distributions $Q$ and $P$ on a finite alphabet $\mathcal{X}$ , the Kullback-Leibler divergence (or relative entropy) is $D(Q \| P) = \sum_{x \in \mathcal{X}} Q(x) \log \frac{Q(x)}{P(x)},$ with the conventions $0 \log 0 = 0$ and $q \log(q/0) = +\infty$ for $q > 0$ . We have $D(Q \| P) \geq 0$ with equality iff $Q = P$ (Gibbs' inequality). KL divergence is not symmetric and does not satisfy the triangle inequality — it is not a metric.

Type (Empirical Distribution)

The histogram of symbol frequencies in a sequence. For $x^n \in \mathcal{X}^n$ , the type $\hat{P}_{x^n}(a) = N(a | x^n)/n$ counts the fraction of positions where symbol $a$ appears.

Theorem: Probability of a Type

Let $X_1, \ldots, X_n$ be i.i.d. $\sim P$ on finite $\mathcal{X}$ , and let $Q \in \mathcal{P}_n(\mathcal{X})$ be a type with denominator $n$ . Then: $\frac{1}{(n+1)^{|\mathcal{X}|}} e^{-nD(Q \| P)} \leq \mathbb{P}(\hat{P}_n = Q) \leq e^{-nD(Q \| P)}.$ In particular, $\mathbb{P}(\hat{P}_n = Q) \doteq e^{-nD(Q \| P)}$ .

The probability that the empirical distribution equals a specific type $Q$ is governed exponentially by the KL divergence from $P$ to $Q$ . The polynomial pre-factor $(n+1)^{|\mathcal{X}|}$ is negligible at the exponential scale.

Proof

Upper bound

For a sequence $x^n$ of type $Q$ : $P^n(x^n) = \prod_{i=1}^n P(x_i) = \prod_{a \in \mathcal{X}} P(a)^{nQ(a)} = e^{n\sum_a Q(a)\log P(a)}.$ Summing over all sequences of type $Q$ (the type class $T_Q^n$ ): $\mathbb{P}(\hat{P}_n = Q) = |T_Q^n| \cdot e^{n\sum_a Q(a)\log P(a)}.$ Since $|T_Q^n| \leq e^{nH(Q)}$ (where $H(Q) = -\sum_a Q(a)\log Q(a)$ ): $\mathbb{P}(\hat{P}_n = Q) \leq e^{n[H(Q) + \sum_a Q(a)\log P(a)]} = e^{-nD(Q \| P)}.$

Lower bound

Using $|T_Q^n| \geq \frac{1}{(n+1)^{|\mathcal{X}|}} e^{nH(Q)}$ (a counting bound): $\mathbb{P}(\hat{P}_n = Q) \geq \frac{e^{-nD(Q \| P)}}{(n+1)^{|\mathcal{X}|}}. \quad \blacksquare$

Theorem: Sanov's Theorem

Let $X_1, \ldots, X_n$ be i.i.d. $\sim P$ on a finite alphabet $\mathcal{X}$ , and let $\mathcal{E}$ be a set of probability distributions on $\mathcal{X}$ whose closure equals the closure of its interior. Then: $\mathbb{P}(\hat{P}_n \in \mathcal{E}) \doteq e^{-nD(Q^* \| P)},$ where $Q^* = \arg\min_{Q \in \overline{\mathcal{E}}} D(Q \| P)$ is the information projection (or I-projection) of $P$ onto the closure of $\mathcal{E}$ .

Among all distributions in $\mathcal{E}$ , the one closest to $P$ in KL divergence dominates the probability. The empirical distribution lands in $\mathcal{E}$ primarily by "aiming" at $Q^*$ — the cheapest route. All other distributions in $\mathcal{E}$ contribute sub-exponentially relative to $Q^*$ .

Proof

Upper bound sketch

$\mathbb{P}(\hat{P}_n \in \mathcal{E}) = \sum_{Q \in \mathcal{P}_n \cap \mathcal{E}} \mathbb{P}(\hat{P}_n = Q) \leq \sum_{Q \in \mathcal{P}_n \cap \mathcal{E}} e^{-nD(Q \| P)} \leq (n+1)^{|\mathcal{X}|} e^{-n\inf_{Q \in \mathcal{E}} D(Q \| P)}.$ $Taking$ \frac{1}{n}\log $gives the upper bound$ -D(Q^* | P)$.

Lower bound sketch

Choose a sequence $Q_n \in \mathcal{P}_n \cap \mathcal{E}^\circ$ converging to $Q^*$ : $\mathbb{P}(\hat{P}_n \in \mathcal{E}) \geq \mathbb{P}(\hat{P}_n = Q_n) \geq \frac{e^{-nD(Q_n \| P)}}{(n+1)^{|\mathcal{X}|}}.$ Since $D(Q_n \| P) \to D(Q^* \| P)$ , the lower bound matches. $\blacksquare$

,

Example: Sanov's Theorem Applied to Hypothesis Testing

A coin has unknown bias. Under $H_0$ : $P = \text{Ber}(1/2)$ , under $H_1$ : $P = \text{Ber}(1/3)$ . We flip $n$ times and reject $H_0$ if the fraction of heads is $\leq 0.4$ . What is the exponential decay rate of the Type I error probability?

Solution

Identify the set

The rejection region in the space of types is $\mathcal{E} = \{Q : Q(1) \leq 0.4\}$ . Under $H_0$ , $P = \text{Ber}(1/2)$ , so we need $D(Q^* \| P)$ where $Q^* = \arg\min_{Q(1) \leq 0.4} D(Q \| \text{Ber}(1/2))$ .

Find the I-projection

Since $D(Q \| P) = Q(1)\log\frac{Q(1)}{1/2} + Q(0)\log\frac{Q(0)}{1/2}$ is minimized at $Q(1) = 1/2$ but the constraint requires $Q(1) \leq 0.4$ , the minimum on the boundary is at $Q^* = \text{Ber}(0.4)$ .

Compute the rate

$D(\text{Ber}(0.4) \| \text{Ber}(0.5)) = 0.4\log\frac{0.4}{0.5} + 0.6\log\frac{0.6}{0.5} \approx 0.0204 \text{ nats}.$ $The Type I error decays as$ e^{-0.0204n} $. For$ n = 100 $, this is roughly$ e^{-2} \approx 0.13$.

Historical Note: I. N. Sanov and the Method of Types

1957--1981

Ivan Nikolaevich Sanov (1919--1968) was a Soviet mathematician who proved his celebrated theorem in 1957. His work established the KL divergence as the natural "cost" for empirical distributions to deviate from the true distribution. The method of types, which provides the combinatorial machinery behind Sanov's theorem, was later systematized by Csiszár and Körner in their landmark 1981 monograph. The method of types has become one of the most powerful tools in information theory, yielding clean proofs of the channel coding theorem, source coding theorem, and multi-user results.

Sanov's Theorem Implies Cram\u00e9r's Theorem (Finite Alphabets)

For finite-alphabet i.i.d. sequences, Cramér's theorem is a corollary of Sanov's theorem. The event $\{\bar{X}_n \geq a\}$ corresponds to $\mathcal{E} = \{Q : \sum_x xQ(x) \geq a\}$ in the space of distributions. The I-projection $Q^*$ onto this set gives $D(Q^* \| P)$ , which equals $I(a)$ from Cramér's theorem. In this sense, Sanov's theorem is the more fundamental result, and KL divergence is the universal rate function for empirical distributions.

Information Projection (I-Projection)

Given a set of distributions $\mathcal{E}$ and a reference distribution $P$ , the I-projection is $Q^* = \arg\min_{Q \in \mathcal{E}} D(Q \| P)$ . It is the distribution in $\mathcal{E}$ that is "closest" to $P$ in the KL sense. When $\mathcal{E}$ is convex, the I-projection is unique.

Common Mistake: The Direction of KL Divergence in Sanov's Theorem

Mistake:

Writing the exponent as $D(P \| Q^*)$ instead of $D(Q^* \| P)$ .

Correction:

Sanov's theorem uses $D(Q^* \| P)$ — the divergence from the true distribution $P$ to the most likely "impostor" $Q^*$ . The order matters because KL divergence is asymmetric. The I-projection minimizes $D(Q \| P)$ over $Q \in \mathcal{E}$ , not $D(P \| Q)$ .

Why This Matters: From Sanov's Theorem to Channel Coding Error Exponents

In channel coding, a decoder errors when the empirical joint type of the transmitted codeword and the received sequence "looks like" it came from a different codeword. Sanov's theorem (and the method of types more broadly) provides the exponential rate at which this confusion probability decays with block length. This is the foundation of error exponent analysis for discrete memoryless channels, treated fully in Book ITA, Chapter 4.

Quick Check

If $X_1, \ldots, X_n$ are i.i.d. Bernoulli(0.5) and $\mathcal{E} = \{Q : Q(1) \geq 0.8\}$ , what is the exponential rate of $\mathbb{P}(\hat{P}_n \in \mathcal{E})$ ?

$D(\text{Ber}(0.8) \| \text{Ber}(0.5)) \approx 0.278$ nats

$D(\text{Ber}(0.5) \| \text{Ber}(0.8)) \approx 0.223$ nats

$(0.8 - 0.5)^2 / (2 \cdot 0.25) = 0.18$

$H(\text{Ber}(0.8)) \approx 0.50$ nats

Correction:

D(\text{Ber}(0.8) \| \text{Ber}(0.5)) \approx 0.278

nats

The I-projection is $Q^* = \text{Ber}(0.8)$ (the boundary point closest to $P$ in KL). $D(\text{Ber}(0.8) \| \text{Ber}(0.5)) = 0.8\ln(0.8/0.5) + 0.2\ln(0.2/0.5) \approx 0.278$ nats.

Key Takeaway

Sanov's theorem establishes KL divergence as the universal rate function for empirical distributions: the probability that $\hat{P}_n$ lands in a set $\mathcal{E}$ decays as $e^{-nD(Q^* \| P)}$ where $Q^*$ is the I-projection of the true distribution onto $\mathcal{E}$ . This makes KL divergence the natural measure of "statistical distance" in exponential-rate problems, and underpins error exponent analysis in information theory.

Sanov's Theorem (Preview)

From Sample Means to Empirical Distributions

Definition: Empirical Distribution (Type)

Definition: Kullback-Leibler Divergence

Type (Empirical Distribution)

Theorem: Probability of a Type

Upper bound

Lower bound

Theorem: Sanov's Theorem

Upper bound sketch

Lower bound sketch

Example: Sanov's Theorem Applied to Hypothesis Testing

Identify the set

Find the I-projection

Compute the rate

Historical Note: I. N. Sanov and the Method of Types

Sanov's Theorem Implies Cram\u00e9r's Theorem (Finite Alphabets)

Information Projection (I-Projection)

Common Mistake: The Direction of KL Divergence in Sanov's Theorem

Why This Matters: From Sanov's Theorem to Channel Coding Error Exponents

Quick Check

Key Takeaway

Definition:
Empirical Distribution (Type)

Definition:
Kullback-Leibler Divergence