The Rate-Distortion Theorem

The Covering Lemma at Work

We defined RR as an optimization over test channels. But does this quantity actually equal the minimum achievable rate? Shannon's rate-distortion theorem says yes: R>RR > R is achievable, and R<RR < R is not. The achievability proof uses the covering lemma β€” the dual of the packing lemma used for channel coding. Where packing asks "how many non-overlapping spheres fit?", covering asks "how many spheres do we need to cover the space?" This duality between channel coding and source coding is one of the most beautiful structural features of information theory.

Theorem: Rate-Distortion Theorem (Shannon, 1959)

For a DMS with distribution PXP_X and distortion measure dd:

Achievability: For any R>RR > R and any Ο΅>0\epsilon > 0, there exists, for all nn sufficiently large, a (2nR,n)(2^{nR}, n) lossy code with E[d(X,X^)]≀D+Ο΅\mathbb{E}[d(\mathbf{X}, \hat{\mathbf{X}})] \leq D + \epsilon.

Converse: For any sequence of (2nRn,n)(2^{nR_{n}}, n) codes with E[d(X,X^)]≀D\mathbb{E}[d(\mathbf{X}, \hat{\mathbf{X}})] \leq D, we must have lim inf⁑nβ†’βˆžRnβ‰₯R\liminf_{n \to \infty} R_{n} \geq R.

The achievability proof is constructive: generate 2nR2^{nR} random codewords from the optimal test channel output distribution, then map each source sequence to the nearest codeword. The covering lemma guarantees that 2nR2^{nR} codewords suffice to "cover" the typical source sequences with high probability, as long as R>I(X;X^)R > I(X;\hat{X}).

The converse uses Fano's inequality and the data processing inequality to show that nRβ‰₯I(X;X^)nR \geq I(\mathbf{X}; \hat{\mathbf{X}}) and that the single-letter mutual information lower-bounds the block mutual information.

Packing vs. Covering: The Grand Duality

Information theory rests on two complementary geometric operations:

  • Packing (channel coding): How many non-overlapping "decoding spheres" fit in the output space? Answer: 2nC2^{nC}. Each codeword needs its own sphere of size 2nH(Y∣X)2^{nH(Y|X)}, and the output space has 2nH(Y)2^{nH(Y)} typical sequences. Ratio: 2nI(X;Y)2^{nI(X;Y)}.

  • Covering (source coding): How many "reproduction spheres" do we need to cover the source space? Answer: 2nR2^{nR}. Each codeword covers 2nH(X∣X^)2^{nH(X|\hat{X})} typical source sequences, and the source space has 2nH(X)2^{nH(X)} typical sequences. Ratio: 2nI(X;X^)2^{nI(X;\hat{X})}.

The same mutual information quantity governs both β€” but maximized for channels and minimized for sources. This is the source-channel duality at the heart of information theory.

Channel Coding vs. Source Coding Duality

AspectChannel CodingSource Coding (Lossy)
GoalMaximize rate with Peβ†’0P_e \to 0Minimize rate with D≀D0D \leq D_0
Fundamental limitC=max⁑PXI(X;Y)C = \max_{P_X} I(X;Y)R=min⁑PX^∣XI(X;X^)R = \min_{P_{\hat{X}|X}} I(X;\hat{X})
Geometric operationPacking (non-overlapping spheres)Covering (overlapping spheres)
Key lemmaPacking lemmaCovering lemma
Optimization variableInput distribution PXP_X (maximized)Test channel PX^∣XP_{\hat{X}|X} (minimized)
ConvexityII concave in PXP_XII convex in PX^∣XP_{\hat{X}|X}

Quick Check

Shannon's separation theorem states that for transmitting a source over a channel, source and channel coding can be designed independently (separated) without loss of optimality. The condition for the source to be transmissible is:

H(X)<CH(X) < C

R<CR < C

R<H(X)R < H(X)

D<CD < C

The Covering Lemma: Rate-Distortion Achievability

Geometric visualization of how 2nR2^{nR} reproduction codewords "cover" the source typical set. Each codeword's distortion ball covers ≐2nH(X∣X^)\doteq 2^{nH(X|\hat{X})} source sequences. When R>I(X;X^)R > I(X;\hat{X}), covering succeeds with high probability.

Historical Note: Shannon's Separation Theorem

1959

Shannon proved the separation theorem in 1959, showing that source coding and channel coding can be designed independently without losing optimality. This was surprising: one might expect that joint source-channel coding β€” optimizing encoder and decoder together β€” would outperform the separated approach. For point-to-point communication with i.i.d. sources and memoryless channels, separation is optimal. However, for multi-user settings (broadcast, multiple access), separation is generally not optimal, leading to the rich theory of joint source-channel coding. Even in the point-to-point setting, separation requires infinite blocklength; for finite-blocklength systems, joint coding can provide modest gains β€” but the engineering convenience of separation (designing compression and error correction independently) is overwhelming.

Key Takeaway

The rate-distortion theorem confirms that RR is the exact minimum rate for lossy compression at distortion DD. Achievability uses the covering lemma (random codebook + joint typicality encoding); the converse uses data processing and the convexity of RR. The duality with channel coding β€” packing vs. covering, maximization vs. minimization β€” is one of the deepest structural insights in information theory. Shannon's separation theorem shows that source and channel coding can be designed independently: the source is transmissible iff R<CR < C.