Ferkans — Interactive Telecom Tutor

The Covering Lemma at Work

We defined $R$ as an optimization over test channels. But does this quantity actually equal the minimum achievable rate? Shannon's rate-distortion theorem says yes: $R > R$ is achievable, and $R < R$ is not. The achievability proof uses the covering lemma — the dual of the packing lemma used for channel coding. Where packing asks "how many non-overlapping spheres fit?", covering asks "how many spheres do we need to cover the space?" This duality between channel coding and source coding is one of the most beautiful structural features of information theory.

Theorem: Rate-Distortion Theorem (Shannon, 1959)

For a DMS with distribution $P_X$ and distortion measure $d$ :

Achievability: For any $R > R$ and any $\epsilon > 0$ , there exists, for all $n$ sufficiently large, a $(2^{nR}, n)$ lossy code with $\mathbb{E}[d(\mathbf{X}, \hat{\mathbf{X}})] \leq D + \epsilon$ .

Converse: For any sequence of $(2^{nR_{n}}, n)$ codes with $\mathbb{E}[d(\mathbf{X}, \hat{\mathbf{X}})] \leq D$ , we must have $\liminf_{n \to \infty} R_{n} \geq R$ .

The achievability proof is constructive: generate $2^{nR}$ random codewords from the optimal test channel output distribution, then map each source sequence to the nearest codeword. The covering lemma guarantees that $2^{nR}$ codewords suffice to "cover" the typical source sequences with high probability, as long as $R > I(X;\hat{X})$ .

The converse uses Fano's inequality and the data processing inequality to show that $nR \geq I(\mathbf{X}; \hat{\mathbf{X}})$ and that the single-letter mutual information lower-bounds the block mutual information.

Proof

Achievability: random codebook

Let $P^*_{\hat{X}|X}$ achieve $R$ and let $P^*_{\hat{X}}$ be the induced marginal on $\hat{\mathcal{X}}$ . Generate $M = 2^{nR}$ codewords $\hat{\mathbf{x}}_1, \ldots, \hat{\mathbf{x}}_M$ i.i.d. $\sim (P^*_{\hat{X}})^n$ .

Achievability: encoding

Given source $\mathbf{x}$ , find an index $m$ such that $(\mathbf{x}, \hat{\mathbf{x}}_m)$ is jointly typical. If no such index exists, declare an error.

Achievability: covering lemma

By the covering lemma, the probability that no codeword is jointly typical with $\mathbf{x}$ vanishes as long as $R > I(X;\hat{X}) = R$ . When a jointly typical codeword is found, the distortion is close to $\mathbb{E}[d(X, \hat{X})]$ by the typicality of the pair. Since the test channel satisfies $\mathbb{E}[d] \leq D$ , the code achieves distortion $\leq D + \epsilon$ with high probability.

Converse: data processing

The index $W = f(\mathbf{X}) \in \{1, \ldots, 2^{nR}\}$ satisfies $H(W) \leq nR$ . By the data processing inequality: $nR \geq H(W) \geq I(\mathbf{X}; W) \geq I(\mathbf{X}; \hat{\mathbf{X}})$ since $\hat{\mathbf{X}} = g(W)$ is a function of $W$ .

Converse: single-letterization

By the chain rule and the memoryless property: $I(\mathbf{X}; \hat{\mathbf{X}}) = \sum_{i=1}^n I(X_i; \hat{\mathbf{X}} | X_1, \ldots, X_{i-1}) \geq \sum_{i=1}^n I(X_i; \hat{X}_i).$ Each term satisfies $I(X_i; \hat{X}_i) \geq R(D_i)$ where $D_i = \mathbb{E}[d(X_i, \hat{X}_i)]$ . By convexity of $R$ : $\frac{1}{n}\sum R(D_i) \geq R(\frac{1}{n}\sum D_i) \geq R(D)$ . Therefore $R \geq R(D)$ .

Packing vs. Covering: The Grand Duality

Information theory rests on two complementary geometric operations:

Packing (channel coding): How many non-overlapping "decoding spheres" fit in the output space? Answer: $2^{nC}$ . Each codeword needs its own sphere of size $2^{nH(Y|X)}$ , and the output space has $2^{nH(Y)}$ typical sequences. Ratio: $2^{nI(X;Y)}$ .
Covering (source coding): How many "reproduction spheres" do we need to cover the source space? Answer: $2^{nR}$ . Each codeword covers $2^{nH(X|\hat{X})}$ typical source sequences, and the source space has $2^{nH(X)}$ typical sequences. Ratio: $2^{nI(X;\hat{X})}$ .

The same mutual information quantity governs both — but maximized for channels and minimized for sources. This is the source-channel duality at the heart of information theory.

Channel Coding vs. Source Coding Duality

Aspect	Channel Coding	Source Coding (Lossy)
Goal	Maximize rate with $P_e \to 0$	Minimize rate with $D \leq D_0$
Fundamental limit	$C = \max_{P_X} I(X;Y)$	$R = \min_{P_{\hat{X}\|X}} I(X;\hat{X})$
Geometric operation	Packing (non-overlapping spheres)	Covering (overlapping spheres)
Key lemma	Packing lemma	Covering lemma
Optimization variable	Input distribution $P_X$ (maximized)	Test channel $P_{\hat{X}\|X}$ (minimized)
Convexity	$I$ concave in $P_X$	$I$ convex in $P_{\hat{X}\|X}$

Quick Check

Shannon's separation theorem states that for transmitting a source over a channel, source and channel coding can be designed independently (separated) without loss of optimality. The condition for the source to be transmissible is:

$H(X) < C$

$R < C$

$R < H(X)$

$D < C$

Correction:

R < C

The source is transmissible at distortion $D$ iff $R < C$ (assuming one channel use per source symbol). Separation: design a rate- $R$ source code with $R < R < C$ , then a channel code at the same rate.

The Covering Lemma: Rate-Distortion Achievability

Geometric visualization of how

2^{nR}

reproduction codewords "cover" the source typical set. Each codeword's distortion ball covers

\doteq 2^{nH(X|\hat{X})}

source sequences. When

R > I(X;\hat{X})

, covering succeeds with high probability.

Historical Note: Shannon's Separation Theorem

1959

Shannon proved the separation theorem in 1959, showing that source coding and channel coding can be designed independently without losing optimality. This was surprising: one might expect that joint source-channel coding — optimizing encoder and decoder together — would outperform the separated approach. For point-to-point communication with i.i.d. sources and memoryless channels, separation is optimal. However, for multi-user settings (broadcast, multiple access), separation is generally not optimal, leading to the rich theory of joint source-channel coding. Even in the point-to-point setting, separation requires infinite blocklength; for finite-blocklength systems, joint coding can provide modest gains — but the engineering convenience of separation (designing compression and error correction independently) is overwhelming.

Key Takeaway

The rate-distortion theorem confirms that $R$ is the exact minimum rate for lossy compression at distortion $D$ . Achievability uses the covering lemma (random codebook + joint typicality encoding); the converse uses data processing and the convexity of $R$ . The duality with channel coding — packing vs. covering, maximization vs. minimization — is one of the deepest structural insights in information theory. Shannon's separation theorem shows that source and channel coding can be designed independently: the source is transmissible iff $R < C$ .

The Rate-Distortion Theorem