Shannon Codes, Huffman Codes, and Optimality
Constructing Optimal Codes
We know from Section 1 that any prefix code satisfies and that Shannon codes achieve . But can we do better? Is there a code that minimizes the expected length exactly? The answer is yes: Huffman's algorithm constructs an optimal prefix code for any source distribution, and the proof is a beautiful application of the greedy method. We also examine what happens when the code is designed for the wrong distribution β the penalty is exactly the KL divergence.
Definition: Shannon Code
Shannon Code
For a source with distribution , the Shannon code assigns codeword lengths . The codewords are constructed by setting to the first bits of the binary expansion of the cumulative probability .
Shannon codes are simple to construct and guarantee , but they are not optimal in general. Their importance is primarily theoretical β they demonstrate achievability of the entropy bound.
Huffman Coding Algorithm
Complexity: with a priority queue (heap); if probabilities are pre-sorted.Huffman's algorithm is greedy: at each step it merges the two least probable nodes. The optimality proof shows that any optimal code must have this property β the two least probable symbols must be siblings at the deepest level of the tree.
Theorem: Optimality of Huffman Codes
Among all prefix codes for a source with distribution , the Huffman code achieves the minimum expected codeword length . Furthermore:
Huffman's algorithm builds the code bottom-up by always merging the two least likely symbols. The optimality proof shows that any deviation from this greedy strategy can only increase the expected length. The proof proceeds by induction on the alphabet size, using the fact that an optimal code for a reduced alphabet (with the two least likely symbols merged) can be "expanded" into an optimal code for the original alphabet.
Two key lemmas
Lemma 1 (Sibling property): In any optimal code, the two longest codewords have the same length and differ only in the last bit (they are siblings in the code tree).
Lemma 2 (Least probable at deepest level): In any optimal code, the two least probable symbols can be assigned to the two longest codewords without increasing .
Proof of Lemma 1
Suppose codeword has maximum length but has no sibling. Then its parent node has only one child, and we can shorten by 1 bit (replace it with the parent), reducing by . This contradicts optimality.
Proof of Lemma 2
If a less probable symbol has a shorter codeword than a more probable symbol ( but ), swapping their codewords changes by , improving the code. So we can assume the two least probable symbols have the two longest codewords.
Inductive step
Combine the two least probable symbols into a super-symbol with probability . By induction, the optimal code for the reduced alphabet exists (call it Huffman). Expanding 's codeword into two codewords (appending 0 and 1) gives a code for the original alphabet. The expected length increases by exactly (one extra bit for each of the two symbols). Any other code for the original alphabet, when "collapsed" at the deepest level, gives a code for the reduced alphabet with equal or worse expected length. Therefore the expanded Huffman code is optimal.
Example: Huffman Code for a Five-Symbol Source
Construct the Huffman code for with probabilities and compute the expected codeword length.
Step-by-step construction
Round 1: Merge the two smallest: . Remaining: . Round 2: Merge smallest: . Remaining: . Round 3: Merge smallest: . Remaining: . Round 4: Merge: . Done.
Read off codewords
Tracing the tree root-to-leaf:
- (0.35): (length 2)
- (0.25): (length 2)
- (0.20): (length 2)
- (0.12): (length 3)
- (0.08): (length 3)
Expected length
bits.
For comparison: bits. Shannon code: . Huffman achieves 2.20 β much closer to entropy. The gap bits, well within the theoretical 1-bit gap.
Theorem: Penalty for Mismatched Source Codes
If a Shannon code is designed for distribution but the true source distribution is , the expected codeword length is The penalty is the number of extra bits per symbol due to the mismatch.
A code designed for assigns length to symbol . Under the true distribution , the expected length is . The KL divergence measures the cost of ignorance about the true distribution. The point is that the KL divergence has yet another operational meaning: it is the compression penalty for using the wrong code.
Direct computation
Shannon code for : lengths , so . Expected length under : The lower bound follows from the source coding converse applied to the lengths .
Competitive Optimality of Shannon Codes
Shannon codes have a remarkable property beyond average-case performance: they are competitively optimal. The codeword length for symbol is at most , which is within 1 bit of the ideal length for every symbol, not just on average. This means that even for individual "unlucky" symbols, the overhead is bounded. More precisely, β the probability of exceeding the ideal length by bits decays exponentially in .
Quick Check
For a source with , what is the expected codeword length of the optimal (Huffman) code?
1.75 bits
2.0 bits
1.5 bits
2.25 bits
The Huffman code assigns lengths , giving . Note that exactly, so the Huffman code is perfect β this happens when all probabilities are powers of 2.
Common Mistake: Huffman Codes Are Optimal Among Symbol-by-Symbol Codes Only
Mistake:
Claiming that Huffman codes are the best possible source codes for all blocklengths and all coding schemes.
Correction:
Huffman codes are optimal among prefix codes operating on individual symbols. Block Huffman codes (applied to blocks of symbols) are optimal among block prefix codes but may be outperformed by arithmetic coding (which can encode fractional bits) and Lempel-Ziv (which adapts to the source without knowing the distribution). For sources with memory, Huffman on individual symbols is far from optimal β block coding or sequential methods like arithmetic coding are necessary.
Huffman code
A prefix code constructed by the greedy algorithm that repeatedly merges the two least probable symbols. Optimal among all symbol-by-symbol prefix codes for a given distribution.
Related: Huffman Coding Algorithm, Optimality of Huffman Codes
Huffman Code Construction Step by Step
Key Takeaway
Huffman codes are the optimal symbol-by-symbol prefix codes: the greedy merge algorithm produces a code with . Using the wrong distribution incurs a penalty of exactly bits per symbol. When probabilities are dyadic (), Huffman matches entropy exactly; otherwise, the 1-bit gap motivates arithmetic coding (Section 3) and block coding.