Chapter Summary

Key Points

1.
Kraft's inequality and the tree. A binary prefix code with lengths $\ell_1, \ldots, \ell_m$ exists if and only if $\sum 2^{-\ell_i} \leq 1$ . By McMillan's theorem, the same inequality holds for all uniquely decodable codes, so restricting to prefix codes costs nothing. The binary tree provides the geometric intuition: each codeword occupies a subtree, and the subtrees must not overlap.
2.
Entropy as the compression limit. The expected codeword length of any prefix code satisfies $L \geq H(X)$ , with equality achievable when all probabilities are powers of $1/2$ . Shannon codes achieve $L < H(X) + 1$ (within 1 bit), and block codes of length $n$ achieve $L/n < H(X) + 1/n$ (within $1/n$ bits). The penalty for using the wrong distribution is exactly $D(P \| Q)$ bits/symbol.
3.
Huffman codes: optimal symbol-by-symbol compression. The greedy merge algorithm produces the optimal prefix code among all symbol-by-symbol codes. The proof uses the sibling property and induction on alphabet size. Huffman codes are simple and fast but limited to integer codeword lengths, making them suboptimal for low-entropy sources.
4.
Arithmetic coding: fractional bits per symbol. By representing an entire sequence as a single number in $[0,1)$ , arithmetic coding uses approximately $-\log P(\mathbf{x})$ bits — the information-theoretic ideal — with only 2 bits of overhead per sequence. Combined with adaptive probability models, arithmetic coding is the foundation of all modern compression systems (CABAC in H.264/HEVC, ANS in Zstandard).
5.
Lempel-Ziv: universality without knowledge. LZ78 (and its variants LZ77, LZW) achieves the entropy rate of any stationary ergodic source without knowing the source statistics. The dictionary-based parsing learns structure implicitly, achieving $\ell_{LZ}/n \to H_\infty$ almost surely. Convergence is slow ( $O(\log\log n / \log n)$ ), but practical implementations (gzip, Zstandard) compensate by combining LZ matching with entropy coding.
6.
Memory reduces entropy rate. For sources with dependencies, the entropy rate $H_\infty < H(X)$ is strictly below the marginal entropy. For first-order Markov chains, $H_\infty = \sum_i \pi_i H(\text{row}_i)$ — a weighted average of conditional entropies. Exploiting memory is essential for efficient compression of real-world sources.

Looking Ahead

Chapter 6 tackles the lossy compression problem: when we allow distortion in the reconstruction, how many bits do we need? The rate-distortion function $R$ generalizes the lossless limit $H(X)$ to the entire rate-distortion tradeoff curve. The covering lemma (dual to the packing lemma used in channel coding) will be the key proof tool.

Source Coding for Sources with Memory Exercises