Chapter Summary

Chapter Summary

Key Points

  • 1.

    Kraft's inequality and the tree. A binary prefix code with lengths β„“1,…,β„“m\ell_1, \ldots, \ell_m exists if and only if βˆ‘2βˆ’β„“i≀1\sum 2^{-\ell_i} \leq 1. By McMillan's theorem, the same inequality holds for all uniquely decodable codes, so restricting to prefix codes costs nothing. The binary tree provides the geometric intuition: each codeword occupies a subtree, and the subtrees must not overlap.

  • 2.

    Entropy as the compression limit. The expected codeword length of any prefix code satisfies Lβ‰₯H(X)L \geq H(X), with equality achievable when all probabilities are powers of 1/21/2. Shannon codes achieve L<H(X)+1L < H(X) + 1 (within 1 bit), and block codes of length nn achieve L/n<H(X)+1/nL/n < H(X) + 1/n (within 1/n1/n bits). The penalty for using the wrong distribution is exactly D(Pβˆ₯Q)D(P \| Q) bits/symbol.

  • 3.

    Huffman codes: optimal symbol-by-symbol compression. The greedy merge algorithm produces the optimal prefix code among all symbol-by-symbol codes. The proof uses the sibling property and induction on alphabet size. Huffman codes are simple and fast but limited to integer codeword lengths, making them suboptimal for low-entropy sources.

  • 4.

    Arithmetic coding: fractional bits per symbol. By representing an entire sequence as a single number in [0,1)[0,1), arithmetic coding uses approximately βˆ’log⁑P(x)-\log P(\mathbf{x}) bits β€” the information-theoretic ideal β€” with only 2 bits of overhead per sequence. Combined with adaptive probability models, arithmetic coding is the foundation of all modern compression systems (CABAC in H.264/HEVC, ANS in Zstandard).

  • 5.

    Lempel-Ziv: universality without knowledge. LZ78 (and its variants LZ77, LZW) achieves the entropy rate of any stationary ergodic source without knowing the source statistics. The dictionary-based parsing learns structure implicitly, achieving β„“LZ/nβ†’H∞\ell_{LZ}/n \to H_\infty almost surely. Convergence is slow (O(log⁑log⁑n/log⁑n)O(\log\log n / \log n)), but practical implementations (gzip, Zstandard) compensate by combining LZ matching with entropy coding.

  • 6.

    Memory reduces entropy rate. For sources with dependencies, the entropy rate H∞<H(X)H_\infty < H(X) is strictly below the marginal entropy. For first-order Markov chains, H∞=βˆ‘iΟ€iH(rowi)H_\infty = \sum_i \pi_i H(\text{row}_i) β€” a weighted average of conditional entropies. Exploiting memory is essential for efficient compression of real-world sources.

Looking Ahead

Chapter 6 tackles the lossy compression problem: when we allow distortion in the reconstruction, how many bits do we need? The rate-distortion function RR generalizes the lossless limit H(X)H(X) to the entire rate-distortion tradeoff curve. The covering lemma (dual to the packing lemma used in channel coding) will be the key proof tool.