The Rate-Distortion Theorem
The Covering Lemma at Work
We defined as an optimization over test channels. But does this quantity actually equal the minimum achievable rate? Shannon's rate-distortion theorem says yes: is achievable, and is not. The achievability proof uses the covering lemma β the dual of the packing lemma used for channel coding. Where packing asks "how many non-overlapping spheres fit?", covering asks "how many spheres do we need to cover the space?" This duality between channel coding and source coding is one of the most beautiful structural features of information theory.
Theorem: Rate-Distortion Theorem (Shannon, 1959)
For a DMS with distribution and distortion measure :
Achievability: For any and any , there exists, for all sufficiently large, a lossy code with .
Converse: For any sequence of codes with , we must have .
The achievability proof is constructive: generate random codewords from the optimal test channel output distribution, then map each source sequence to the nearest codeword. The covering lemma guarantees that codewords suffice to "cover" the typical source sequences with high probability, as long as .
The converse uses Fano's inequality and the data processing inequality to show that and that the single-letter mutual information lower-bounds the block mutual information.
Achievability: random codebook
Let achieve and let be the induced marginal on . Generate codewords i.i.d. .
Achievability: encoding
Given source , find an index such that is jointly typical. If no such index exists, declare an error.
Achievability: covering lemma
By the covering lemma, the probability that no codeword is jointly typical with vanishes as long as . When a jointly typical codeword is found, the distortion is close to by the typicality of the pair. Since the test channel satisfies , the code achieves distortion with high probability.
Converse: data processing
The index satisfies . By the data processing inequality: since is a function of .
Converse: single-letterization
By the chain rule and the memoryless property: Each term satisfies where . By convexity of : . Therefore .
Packing vs. Covering: The Grand Duality
Information theory rests on two complementary geometric operations:
-
Packing (channel coding): How many non-overlapping "decoding spheres" fit in the output space? Answer: . Each codeword needs its own sphere of size , and the output space has typical sequences. Ratio: .
-
Covering (source coding): How many "reproduction spheres" do we need to cover the source space? Answer: . Each codeword covers typical source sequences, and the source space has typical sequences. Ratio: .
The same mutual information quantity governs both β but maximized for channels and minimized for sources. This is the source-channel duality at the heart of information theory.
Channel Coding vs. Source Coding Duality
| Aspect | Channel Coding | Source Coding (Lossy) |
|---|---|---|
| Goal | Maximize rate with | Minimize rate with |
| Fundamental limit | ||
| Geometric operation | Packing (non-overlapping spheres) | Covering (overlapping spheres) |
| Key lemma | Packing lemma | Covering lemma |
| Optimization variable | Input distribution (maximized) | Test channel (minimized) |
| Convexity | concave in | convex in |
Quick Check
Shannon's separation theorem states that for transmitting a source over a channel, source and channel coding can be designed independently (separated) without loss of optimality. The condition for the source to be transmissible is:
The source is transmissible at distortion iff (assuming one channel use per source symbol). Separation: design a rate- source code with , then a channel code at the same rate.
The Covering Lemma: Rate-Distortion Achievability
Historical Note: Shannon's Separation Theorem
1959Shannon proved the separation theorem in 1959, showing that source coding and channel coding can be designed independently without losing optimality. This was surprising: one might expect that joint source-channel coding β optimizing encoder and decoder together β would outperform the separated approach. For point-to-point communication with i.i.d. sources and memoryless channels, separation is optimal. However, for multi-user settings (broadcast, multiple access), separation is generally not optimal, leading to the rich theory of joint source-channel coding. Even in the point-to-point setting, separation requires infinite blocklength; for finite-blocklength systems, joint coding can provide modest gains β but the engineering convenience of separation (designing compression and error correction independently) is overwhelming.
Key Takeaway
The rate-distortion theorem confirms that is the exact minimum rate for lossy compression at distortion . Achievability uses the covering lemma (random codebook + joint typicality encoding); the converse uses data processing and the convexity of . The duality with channel coding β packing vs. covering, maximization vs. minimization β is one of the deepest structural insights in information theory. Shannon's separation theorem shows that source and channel coding can be designed independently: the source is transmissible iff .