Ferkans — Interactive Telecom Tutor

The Law of Large Numbers for Information Theory

The law of large numbers tells us that the sample mean of i.i.d. random variables converges to the true mean. The asymptotic equipartition property (AEP) is the information-theoretic analogue: for an i.i.d. source, the per-symbol log-probability of a typical sequence converges to the entropy $H(X)$ . In other words, among the $|\mathcal{X}|^n$ possible sequences, only about $2^{nH(X)}$ of them carry appreciable probability — and each of these "typical" sequences has probability approximately $2^{-nH(X)}$ .

This concentration phenomenon is the engine behind both source coding (we only need $nH(X)$ bits to describe the typical sequences) and channel coding (the code needs to "pack" only $2^{nR}$ codewords into the typical output set of size $2^{nH(Y)}$ ). Understanding typicality is the single most important step toward understanding the coding theorems.

Theorem: The Weak AEP

Let $X_1, X_2, \ldots, X_n$ be i.i.d. $\sim P_X$ on a finite alphabet $\mathcal{X}$ . Then:

$-\frac{1}{n}\log P_{X^n}(X_1, \ldots, X_n) \xrightarrow{P} H(X).$

That is, for any $\epsilon > 0$ :

$\Pr\!\left(\left|-\frac{1}{n}\log P_{X^n}(\mathbf{X}) - H(X)\right| > \epsilon\right) \to 0 \quad \text{as } n \to \infty.$

For i.i.d. sequences, $-\frac{1}{n}\log P(\mathbf{x}) = -\frac{1}{n}\sum_{i=1}^n \log p(x_i)$ . This is the sample mean of the i.i.d. random variables $-\log p(X_i)$ , each with mean $H(X) = \mathbb{E}[-\log p(X)]$ . The weak law of large numbers directly gives convergence in probability.

Proof

Express as sample mean

Since $P_{X^n}(\mathbf{x}) = \prod_{i=1}^n p(x_i)$ :

$-\frac{1}{n}\log P_{X^n}(\mathbf{x}) = -\frac{1}{n}\sum_{i=1}^n \log p(x_i) = \frac{1}{n}\sum_{i=1}^n \iota(x_i),$

where $\iota(x) = -\log p(x)$ is the self-information.

Apply WLLN

The $\iota(X_i)$ are i.i.d. with mean $\mathbb{E}[\iota(X)] = H(X)$ and finite variance (since $\mathcal{X}$ is finite). By the weak law of large numbers:

$\frac{1}{n}\sum_{i=1}^n \iota(X_i) \xrightarrow{P} H(X).$

Asymptotic equipartition property (AEP)

The property that for i.i.d. sequences, $-\frac{1}{n}\log P(\mathbf{X}) \to H(X)$ in probability. Implies that typical sequences all have roughly equal probability $\approx 2^{-nH(X)}$ .

Related: Typical set

Definition:
Weakly Typical Set

The weakly typical set $A_\epsilon^{(n)}$ with respect to $P_X$ is:

$A_\epsilon^{(n)} = \left\{\mathbf{x} \in \mathcal{X}^n : \left|-\frac{1}{n}\log P_{X^n}(\mathbf{x}) - H(X)\right| \leq \epsilon\right\}.$

Equivalently, $\mathbf{x} \in A_\epsilon^{(n)}$ iff $2^{-n(H(X)+\epsilon)} \leq P_{X^n}(\mathbf{x}) \leq 2^{-n(H(X)-\epsilon)}$ .

Theorem: Properties of the Typical Set

For any $\epsilon > 0$ and sufficiently large $n$ :

High probability: $\Pr(\mathbf{X} \in A_\epsilon^{(n)}) > 1 - \epsilon$ .
Upper bound on size: $|A_\epsilon^{(n)}| \leq 2^{n(H(X) + \epsilon)}$ .
Lower bound on size: $|A_\epsilon^{(n)}| \geq (1-\epsilon)\,2^{n(H(X) - \epsilon)}$ .

Property 1: almost all probability mass concentrates on the typical set. Properties 2-3: the typical set contains about $2^{nH(X)}$ sequences. The total number of sequences is $|\mathcal{X}|^n = 2^{n\log|\mathcal{X}|}$ . Since $H(X) \leq \log|\mathcal{X}|$ , the typical set is an exponentially small fraction of all sequences — yet it carries almost all the probability.

Proof

Property 1 (high probability)

Direct from the AEP: $\Pr(\mathbf{X} \notin A_\epsilon^{(n)}) \to 0$ , so for large enough $n$ , $\Pr(\mathbf{X} \in A_\epsilon^{(n)}) > 1 - \epsilon$ .

Property 2 (upper bound)

$1 \geq \Pr(\mathbf{X} \in A_\epsilon^{(n)}) = \sum_{\mathbf{x} \in A_\epsilon^{(n)}} P(\mathbf{x}) \geq |A_\epsilon^{(n)}| \cdot 2^{-n(H(X)+\epsilon)}$ .

Therefore $|A_\epsilon^{(n)}| \leq 2^{n(H(X)+\epsilon)}$ .

Property 3 (lower bound)

$1 - \epsilon < \Pr(\mathbf{X} \in A_\epsilon^{(n)}) = \sum_{\mathbf{x} \in A_\epsilon^{(n)}} P(\mathbf{x}) \leq |A_\epsilon^{(n)}| \cdot 2^{-n(H(X)-\epsilon)}$ .

Therefore $|A_\epsilon^{(n)}| \geq (1-\epsilon) \cdot 2^{n(H(X)-\epsilon)}$ .

Example: AEP for a Binary Source

Let $X_i \sim \text{Bernoulli}(0.3)$ i.i.d. For $n = 100$ , estimate the size of the typical set and compare with $2^n$ .

Solution

Entropy

$H(X) = h_b(0.3) = -0.3\log 0.3 - 0.7\log 0.7 \approx 0.881$ bits.

Typical set size

$|A_\epsilon^{(n)}| \approx 2^{nH(X)} = 2^{100 \times 0.881} = 2^{88.1}$ .

The total number of binary sequences: $2^{100} \approx 1.27 \times 10^{30}$ .

The typical set fraction: $2^{88.1}/2^{100} = 2^{-11.9} \approx 2.6 \times 10^{-4}$ .

Only about $0.026\%$ of all binary sequences are typical — yet they carry almost all the probability. This is why we can compress the source from $1$ bit/symbol to $\approx 0.881$ bits/symbol.

AEP: Probability Concentration on the Typical Set

Visualize how the probability mass concentrates on the typical set as the sequence length $n$ increases. The histogram shows the distribution of $-\frac{1}{n}\log P(\mathbf{X})$ for random i.i.d. sequences, converging to a spike at $H(X)$ .

Parameters

Source probability p0.3

P(X=1) for Bernoulli source

Sequence length n100

Length of i.i.d. sequences

Historical Note: Shannon's Original Insight

1948

Shannon recognized the AEP in his 1948 paper, calling it the "equipartition property." He observed that for long sequences from an ergodic source, the sequences divide naturally into two classes: a "typical" set of about $2^{nH}$ sequences that are all roughly equally likely, and an "atypical" set that has negligible total probability despite containing vastly more sequences. This dichotomy is the conceptual foundation of both lossless and lossy source coding.

The name "asymptotic equipartition property" was coined later, by analogy with the equipartition of energy in statistical mechanics. The connection is not merely nominal — both phenomena arise from concentration of measure in high dimensions.

Quick Check

For a source with entropy $H(X) = 2$ bits and alphabet size $|\mathcal{X}| = 8$ , approximately what fraction of all length- $n$ sequences are typical?

$2^{-n}$

$1/2$

$2^{2n}/2^{8n}$

Close to $1$ for large $n$

Correction:

2^{-n}

Typical set size $\\approx 2^{2n}$ . Total sequences: $8^n = 2^{3n}$ . Fraction $\\approx 2^{2n}/2^{3n} = 2^{-n}$ , which shrinks exponentially.

Common Mistake: Most Sequences Are Atypical, But Typical Sequences Carry All the Probability

Mistake:

Confusing "fraction of sequences that are typical" (which goes to zero) with "probability that a random sequence is typical" (which goes to one).

Correction:

The typical set contains about $2^{nH}$ sequences out of $|\mathcal{X}|^n = 2^{n\log|\mathcal{X}|}$ total. Since $H < \log|\mathcal{X}|$ for non-uniform sources, the fraction $2^{nH}/2^{n\log|\mathcal{X}|} = 2^{-n(\log|\mathcal{X}| - H)} \to 0$ . Yet $\Pr(\text{typical}) \to 1$ . The resolution: typical sequences are individually more probable than atypical ones, which compensates for their smaller count.

Typical set

The set of sequences whose empirical log-probability is close to the entropy: $|\frac{1}{n}\log P(\mathbf{x}) + H(X)| \leq \epsilon$ . Contains about $2^{nH(X)}$ sequences and captures almost all the probability mass for large $n$ .

Typical Set Concentration as n Grows

Visualizes how the typical set concentrates as the sequence length

n

increases. The typical set shrinks as a fraction of all sequences but carries nearly all the probability.

The Asymptotic Equipartition Property (AEP)

The Law of Large Numbers for Information Theory

Theorem: The Weak AEP

Express as sample mean

Apply WLLN

Asymptotic equipartition property (AEP)

Definition: Weakly Typical Set

Theorem: Properties of the Typical Set

Property 1 (high probability)

Property 2 (upper bound)

Property 3 (lower bound)

Example: AEP for a Binary Source

Entropy

Typical set size

AEP: Probability Concentration on the Typical Set

Parameters

Historical Note: Shannon's Original Insight

Quick Check

Common Mistake: Most Sequences Are Atypical, But Typical Sequences Carry All the Probability

Typical set

Typical Set Concentration as n Grows

Definition:
Weakly Typical Set