Shannon Codes, Huffman Codes, and Optimality

Constructing Optimal Codes

We know from Section 1 that any prefix code satisfies Lβ‰₯H(X)L \geq H(X) and that Shannon codes achieve L<H(X)+1L < H(X) + 1. But can we do better? Is there a code that minimizes the expected length exactly? The answer is yes: Huffman's algorithm constructs an optimal prefix code for any source distribution, and the proof is a beautiful application of the greedy method. We also examine what happens when the code is designed for the wrong distribution β€” the penalty is exactly the KL divergence.

Definition:

Shannon Code

For a source XX with distribution P(x1)β‰₯P(x2)β‰₯β‹―β‰₯P(xm)P(x_1) \geq P(x_2) \geq \cdots \geq P(x_m), the Shannon code assigns codeword lengths β„“i=βŒˆβˆ’log⁑P(xi)βŒ‰\ell_i = \lceil -\log P(x_i) \rceil. The codewords are constructed by setting c(xi)c(x_i) to the first β„“i\ell_i bits of the binary expansion of the cumulative probability Fi=βˆ‘j=1iβˆ’1P(xj)F_i = \sum_{j=1}^{i-1} P(x_j).

Shannon codes are simple to construct and guarantee L<H(X)+1L < H(X) + 1, but they are not optimal in general. Their importance is primarily theoretical β€” they demonstrate achievability of the entropy bound.

Huffman Coding Algorithm

Complexity: O(mlog⁑m)O(m \log m) with a priority queue (heap); O(m)O(m) if probabilities are pre-sorted.
Input: Source alphabet X={x1,…,xm}\mathcal{X} = \{x_1, \ldots, x_m\} with probabilities p1β‰₯p2β‰₯β‹―β‰₯pmp_1 \geq p_2 \geq \cdots \geq p_m
Output: Optimal prefix code c:Xβ†’{0,1}βˆ—c : \mathcal{X} \to \{0,1\}^*
1. Create a leaf node for each symbol with weight pip_i
2. While there is more than one node:
a. Take the two nodes with the smallest weights
b. Create a new internal node with these two as children
c. Set the weight of the new node to the sum of children's weights
d. Label the left edge 0 and the right edge 1
3. Read off codewords by tracing root-to-leaf paths

Huffman's algorithm is greedy: at each step it merges the two least probable nodes. The optimality proof shows that any optimal code must have this property β€” the two least probable symbols must be siblings at the deepest level of the tree.

Theorem: Optimality of Huffman Codes

Among all prefix codes for a source with distribution PP, the Huffman code achieves the minimum expected codeword length Lβˆ—L^*. Furthermore: H(X)≀Lβˆ—<H(X)+1.H(X) \leq L^* < H(X) + 1.

Huffman's algorithm builds the code bottom-up by always merging the two least likely symbols. The optimality proof shows that any deviation from this greedy strategy can only increase the expected length. The proof proceeds by induction on the alphabet size, using the fact that an optimal code for a reduced alphabet (with the two least likely symbols merged) can be "expanded" into an optimal code for the original alphabet.

Example: Huffman Code for a Five-Symbol Source

Construct the Huffman code for XX with probabilities P=(0.35,0.25,0.20,0.12,0.08)P = (0.35, 0.25, 0.20, 0.12, 0.08) and compute the expected codeword length.

Theorem: Penalty for Mismatched Source Codes

If a Shannon code is designed for distribution QQ but the true source distribution is PP, the expected codeword length is L=H(P)+D(Pβˆ₯Q).L = H(P) + D(P \| Q). The penalty D(Pβˆ₯Q)D(P \| Q) is the number of extra bits per symbol due to the mismatch.

A code designed for QQ assigns length β‰ˆβˆ’log⁑Q(x)\approx -\log Q(x) to symbol xx. Under the true distribution PP, the expected length is EP[βˆ’log⁑Q(X)]=H(P)+D(Pβˆ₯Q)\mathbb{E}_P[-\log Q(X)] = H(P) + D(P \| Q). The KL divergence measures the cost of ignorance about the true distribution. The point is that the KL divergence has yet another operational meaning: it is the compression penalty for using the wrong code.

Competitive Optimality of Shannon Codes

Shannon codes have a remarkable property beyond average-case performance: they are competitively optimal. The codeword length for symbol xx is at most βŒˆβˆ’log⁑P(x)βŒ‰\lceil -\log P(x) \rceil, which is within 1 bit of the ideal length βˆ’log⁑P(x)-\log P(x) for every symbol, not just on average. This means that even for individual "unlucky" symbols, the overhead is bounded. More precisely, Pr⁑[β„“(X)>βˆ’log⁑P(X)+c]≀2βˆ’c+1\Pr[\ell(X) > -\log P(X) + c] \leq 2^{-c+1} β€” the probability of exceeding the ideal length by cc bits decays exponentially in cc.

Quick Check

For a source with P=(1/2,1/4,1/8,1/8)P = (1/2, 1/4, 1/8, 1/8), what is the expected codeword length of the optimal (Huffman) code?

1.75 bits

2.0 bits

1.5 bits

2.25 bits

Common Mistake: Huffman Codes Are Optimal Among Symbol-by-Symbol Codes Only

Mistake:

Claiming that Huffman codes are the best possible source codes for all blocklengths and all coding schemes.

Correction:

Huffman codes are optimal among prefix codes operating on individual symbols. Block Huffman codes (applied to blocks of nn symbols) are optimal among block prefix codes but may be outperformed by arithmetic coding (which can encode fractional bits) and Lempel-Ziv (which adapts to the source without knowing the distribution). For sources with memory, Huffman on individual symbols is far from optimal β€” block coding or sequential methods like arithmetic coding are necessary.

Huffman code

A prefix code constructed by the greedy algorithm that repeatedly merges the two least probable symbols. Optimal among all symbol-by-symbol prefix codes for a given distribution.

Related: Huffman Coding Algorithm, Optimality of Huffman Codes

Huffman Code Construction Step by Step

Animated construction of a Huffman code for a five-symbol source, showing the greedy merge process, the resulting code tree, and the comparison of expected length LL with entropy H(X)H(X).

Key Takeaway

Huffman codes are the optimal symbol-by-symbol prefix codes: the greedy merge algorithm produces a code with H(X)≀Lβˆ—<H(X)+1H(X) \leq L^* < H(X) + 1. Using the wrong distribution incurs a penalty of exactly D(Pβˆ₯Q)D(P \| Q) bits per symbol. When probabilities are dyadic (2βˆ’k2^{-k}), Huffman matches entropy exactly; otherwise, the 1-bit gap motivates arithmetic coding (Section 3) and block coding.