Chapter Summary
Chapter Summary
Key Points
- 1.
Kraft's inequality and the tree. A binary prefix code with lengths exists if and only if . By McMillan's theorem, the same inequality holds for all uniquely decodable codes, so restricting to prefix codes costs nothing. The binary tree provides the geometric intuition: each codeword occupies a subtree, and the subtrees must not overlap.
- 2.
Entropy as the compression limit. The expected codeword length of any prefix code satisfies , with equality achievable when all probabilities are powers of . Shannon codes achieve (within 1 bit), and block codes of length achieve (within bits). The penalty for using the wrong distribution is exactly bits/symbol.
- 3.
Huffman codes: optimal symbol-by-symbol compression. The greedy merge algorithm produces the optimal prefix code among all symbol-by-symbol codes. The proof uses the sibling property and induction on alphabet size. Huffman codes are simple and fast but limited to integer codeword lengths, making them suboptimal for low-entropy sources.
- 4.
Arithmetic coding: fractional bits per symbol. By representing an entire sequence as a single number in , arithmetic coding uses approximately bits β the information-theoretic ideal β with only 2 bits of overhead per sequence. Combined with adaptive probability models, arithmetic coding is the foundation of all modern compression systems (CABAC in H.264/HEVC, ANS in Zstandard).
- 5.
Lempel-Ziv: universality without knowledge. LZ78 (and its variants LZ77, LZW) achieves the entropy rate of any stationary ergodic source without knowing the source statistics. The dictionary-based parsing learns structure implicitly, achieving almost surely. Convergence is slow (), but practical implementations (gzip, Zstandard) compensate by combining LZ matching with entropy coding.
- 6.
Memory reduces entropy rate. For sources with dependencies, the entropy rate is strictly below the marginal entropy. For first-order Markov chains, β a weighted average of conditional entropies. Exploiting memory is essential for efficient compression of real-world sources.
Looking Ahead
Chapter 6 tackles the lossy compression problem: when we allow distortion in the reconstruction, how many bits do we need? The rate-distortion function generalizes the lossless limit to the entire rate-distortion tradeoff curve. The covering lemma (dual to the packing lemma used in channel coding) will be the key proof tool.