Key Achievability Lemmas

The Reusable Building Blocks

Every achievability proof in information theory β€” channel coding, source coding, multiuser settings β€” can be built from a small number of modular lemmas. We now state the two most important: the packing lemma (for channel coding) and the covering lemma (for source coding). Once you understand these two lemmas, you can construct achievability proofs for channels and sources you have never seen before. This is the "proof pattern philosophy" that makes information theory so powerful: the same architecture β€” random codebook + typicality argument β€” works across dozens of settings.

Theorem: The Packing Lemma

Let Un(m)U^n(m), m=1,…,2nRm = 1, \ldots, 2^{nR}, be codewords drawn i.i.d. ∼PUn\sim P_U^n, independently of each other. Let YnY^n be a sequence such that (Un(1),Yn)(U^n(1), Y^n) is jointly typical (i.e., message m=1m=1 was sent). Then the probability that some incorrect codeword Un(m)U^n(m), mβ‰ 1m \neq 1, is also jointly typical with YnY^n satisfies:

Pr⁑ ⁣(⋃m=22nR{(Un(m),Yn)∈TΟ΅(n)(U,Y)})β†’0ifΒ R<I(U;Y).\Pr\!\left(\bigcup_{m=2}^{2^{nR}} \{(U^n(m), Y^n) \in \mathcal{T}_\epsilon^{(n)}(U,Y)\}\right) \to 0 \quad \text{if } R < I(U;Y).

The packing lemma says: if the code rate is below the mutual information, there is enough "room" in the jointly typical set to pack all codewords without confusion. Each incorrect codeword has probability β‰ˆ2βˆ’nI(U;Y)\approx 2^{-nI(U;Y)} of being jointly typical with YnY^n. With 2nR2^{nR} codewords, the union bound gives error probability ≀2nRβ‹…2βˆ’nI(U;Y)=2βˆ’n(I(U;Y)βˆ’R)β†’0\leq 2^{nR} \cdot 2^{-nI(U;Y)} = 2^{-n(I(U;Y) - R)} \to 0 when R<I(U;Y)R < I(U;Y).

Packing lemma

If the code rate R<I(U;Y)R < I(U;Y), then with high probability no incorrect codeword is jointly typical with the channel output. This is the core achievability tool for channel coding.

Related: Covering lemma, Jointly typical sequences

Theorem: The Covering Lemma

Let X^n(m)\hat{X}^n(m), m=1,…,2nRm = 1, \ldots, 2^{nR}, be drawn i.i.d. ∼PX^n\sim P_{\hat{X}}^n. Let XnX^n be a typical source sequence. Then:

Pr⁑ ⁣(noΒ m:(X^n(m),Xn)∈TΟ΅(n)(X^,X))β†’0ifΒ R>I(X;X^).\Pr\!\left(\text{no } m : (\hat{X}^n(m), X^n) \in \mathcal{T}_\epsilon^{(n)}(\hat{X}, X)\right) \to 0 \quad \text{if } R > I(X; \hat{X}).

The covering lemma says: if the codebook is large enough (rate above mutual information), there will be a codeword that is jointly typical with the source sequence. This is the "covering" condition β€” we need enough codewords to "cover" all typical source sequences. Each codeword covers a fraction β‰ˆ2βˆ’nI(X;X^)\approx 2^{-nI(X;\hat{X})} of typical source sequences, so we need β‰ˆ2nI(X;X^)\approx 2^{nI(X;\hat{X})} codewords.

Covering lemma

If the codebook rate R>I(X;X^)R > I(X;\hat{X}), then with high probability at least one codeword is jointly typical with any given typical source sequence. This is the core achievability tool for lossy source coding (rate-distortion).

Related: Packing lemma

Packing vs Covering: A Duality

The packing and covering lemmas are in a sense dual to each other:

  • Packing (channel coding): We need the rate to be below the mutual information to avoid confusion β€” R<I(X;Y)R < I(X;Y).
  • Covering (source coding): We need the rate to be above the mutual information to ensure coverage β€” R>I(X;X^)R > I(X;\hat{X}).

In both cases, the critical threshold is a mutual information. The channel coding theorem says R<C=max⁑PXI(X;Y)R < C = \max_{P_X} I(X;Y) is achievable. The rate-distortion theorem says R>R(D)=min⁑PX^∣XI(X;X^)R > R(D) = \min_{P_{\hat{X}|X}} I(X;\hat{X}) is necessary and sufficient for distortion ≀D\leq D.

This "same proof, different direction" pattern extends to multiuser settings. The MAC uses packing with multiple senders. The broadcast channel uses covering with multiple receivers. Dirty-paper coding combines both.

⚠️Engineering Note

Random Codes vs Practical Codes

The packing and covering lemmas prove existence of good codes via the random coding argument: a random codebook works with high probability. But random codes are impractical β€” encoding and decoding both require exponential complexity (2nR2^{nR} codewords, each of length nn).

Practical codes that approach the Shannon limit include:

  • LDPC codes (1960, rediscovered 1996): sparse graph structure enables iterative decoding in O(n)O(n) per iteration
  • Turbo codes (1993): parallel concatenation with interleaving
  • Polar codes (2009): deterministic construction with O(nlog⁑n)O(n \log n) encoding/decoding, provably capacity-achieving

The gap between random coding theory and practical code design is one of the most fruitful areas in communications engineering.

Practical Constraints
  • β€’

    Random codes require exponential storage and decoding complexity

  • β€’

    Practical codes sacrifice some rate for polynomial-time encoding/decoding

  • β€’

    The coding gap decreases with blocklength but never reaches zero at finite nn

Why This Matters: From Lemmas to 5G NR

Modern wireless standards implement the ideas of this section directly. 5G NR uses LDPC codes for data channels (approaching the packing lemma guarantee) and polar codes for control channels. The code rates are chosen based on the estimated channel mutual information (obtained via channel state information feedback), and adaptive modulation and coding (AMC) selects the combination that operates closest to capacity.

The covering lemma appears in the form of vector quantization of CSI feedback β€” the codebook must "cover" the space of possible channels with enough precision to enable effective beamforming.

Common Mistake: Random Coding Proves Existence, Not Construction

Mistake:

Interpreting the random coding argument as an algorithm. For example, claiming that "we draw a random codebook and it works" provides a practical coding scheme.

Correction:

Random coding is an existential argument: it shows that good codes exist (because the average over random codebooks is good, so at least one specific codebook must be good). Finding that specific codebook, or constructing an equivalent one efficiently, is a separate engineering problem. Polar codes are the first known construction that provably achieves capacity with polynomial complexity.

Quick Check

In the packing lemma, what happens if the rate RR exceeds I(X;Y)I(X;Y)?

The union bound no longer decays to zero β€” too many codewords cause confusion

The code still works but with higher latency

The covering lemma takes over

We need a larger alphabet

Random Coding and the Packing Lemma

Visualizes the random coding argument: codewords scattered in output space, with a jointly typical "cloud" around the true codeword. Incorrect codewords are exponentially unlikely to fall in the cloud when R<I(X;Y)R < I(X;Y).

Packing vs Covering Duality

Side-by-side comparison of the packing lemma (channel coding: don't confuse codewords) and covering lemma (source coding: represent every source sequence). Shows the rate thresholds R<I(X;Y)R < I(X;Y) and R>I(X;X^)R > I(X;\hat{X}).
⚠️Engineering Note

Finite-Blocklength Penalty

The packing and covering lemmas establish achievability in the limit nβ†’βˆžn \to \infty. For finite blocklength nn, the achievable rate is approximately:

Rβ‰ˆCβˆ’VnQβˆ’1(Ξ΅)R \approx C - \sqrt{\frac{V}{n}} Q^{-1}(\varepsilon)

where VV is the channel dispersion (variance of the information density), Ξ΅\varepsilon is the target error probability, and Qβˆ’1Q^{-1} is the inverse Q-function. For the BSC with Ο΅=0.11\epsilon = 0.11 at Ξ΅=10βˆ’3\varepsilon = 10^{-3} and n=1000n = 1000: the rate loss is about 0.050.05 bits/use below capacity. This "square-root penalty" is a fundamental feature of finite-blocklength communication (Chapter 26).

Practical Constraints
  • β€’

    Rate penalty scales as 1/n1/\sqrt{n} β€” doubling blocklength halves the gap only by 2\sqrt{2}

  • β€’

    Channel dispersion VV depends on the channel β€” BSC has V=Ο΅(1βˆ’Ο΅)(log⁑1βˆ’Ο΅Ο΅)2V = \epsilon(1-\epsilon)(\log\frac{1-\epsilon}{\epsilon})^2

  • β€’

    Low-latency systems (URLLC) operate at short blocklengths where this penalty dominates

Common Mistake: The AEP Requires i.i.d. (or at Least Stationarity and Ergodicity)

Mistake:

Applying the AEP to sequences that are not i.i.d. without verifying the required stationarity and ergodicity conditions. For example, assuming typicality arguments work for sources with long-range dependencies or non-stationary channels.

Correction:

The weak AEP as stated requires i.i.d. symbols. Extensions exist for stationary ergodic sources (Shannon-McMillan-Breiman theorem) but the convergence can be much slower. For non-stationary or non-ergodic settings, the notion of typicality must be generalized β€” and the simple 2nH2^{nH} typical set size no longer applies. In wireless channels, quasi-static fading (where the channel is constant for one codeword but varies across codewords) requires outage-capacity analysis instead of the ergodic typicality approach.