Ferkans — Interactive Telecom Tutor

Constructing Optimal Codes

We know from Section 1 that any prefix code satisfies $L \geq H(X)$ and that Shannon codes achieve $L < H(X) + 1$ . But can we do better? Is there a code that minimizes the expected length exactly? The answer is yes: Huffman's algorithm constructs an optimal prefix code for any source distribution, and the proof is a beautiful application of the greedy method. We also examine what happens when the code is designed for the wrong distribution — the penalty is exactly the KL divergence.

Definition:
Shannon Code

For a source $X$ with distribution $P(x_1) \geq P(x_2) \geq \cdots \geq P(x_m)$ , the Shannon code assigns codeword lengths $\ell_i = \lceil -\log P(x_i) \rceil$ . The codewords are constructed by setting $c(x_i)$ to the first $\ell_i$ bits of the binary expansion of the cumulative probability $F_i = \sum_{j=1}^{i-1} P(x_j)$ .

Shannon codes are simple to construct and guarantee $L < H(X) + 1$ , but they are not optimal in general. Their importance is primarily theoretical — they demonstrate achievability of the entropy bound.

Huffman Coding Algorithm

Complexity:

O(m \log m)

with a priority queue (heap);

O(m)

if probabilities are pre-sorted.

Input: Source alphabet

\mathcal{X} = \{x_1, \ldots, x_m\}

with probabilities

p_1 \geq p_2 \geq \cdots \geq p_m

Output: Optimal prefix code

c : \mathcal{X} \to \{0,1\}^*

1. Create a leaf node for each symbol with weight

p_i

2. While there is more than one node:

a. Take the two nodes with the smallest weights

b. Create a new internal node with these two as children

c. Set the weight of the new node to the sum of children's weights

d. Label the left edge 0 and the right edge 1

3. Read off codewords by tracing root-to-leaf paths

Huffman's algorithm is greedy: at each step it merges the two least probable nodes. The optimality proof shows that any optimal code must have this property — the two least probable symbols must be siblings at the deepest level of the tree.

Theorem: Optimality of Huffman Codes

Among all prefix codes for a source with distribution $P$ , the Huffman code achieves the minimum expected codeword length $L^*$ . Furthermore: $H(X) \leq L^* < H(X) + 1.$

Huffman's algorithm builds the code bottom-up by always merging the two least likely symbols. The optimality proof shows that any deviation from this greedy strategy can only increase the expected length. The proof proceeds by induction on the alphabet size, using the fact that an optimal code for a reduced alphabet (with the two least likely symbols merged) can be "expanded" into an optimal code for the original alphabet.

Proof

Two key lemmas

Lemma 1 (Sibling property): In any optimal code, the two longest codewords have the same length and differ only in the last bit (they are siblings in the code tree).

Lemma 2 (Least probable at deepest level): In any optimal code, the two least probable symbols can be assigned to the two longest codewords without increasing $L$ .

Proof of Lemma 1

Suppose codeword $c(x_i)$ has maximum length $\ell_{\max}$ but has no sibling. Then its parent node has only one child, and we can shorten $c(x_i)$ by 1 bit (replace it with the parent), reducing $L$ by $p_i > 0$ . This contradicts optimality.

Proof of Lemma 2

If a less probable symbol $x_j$ has a shorter codeword than a more probable symbol $x_i$ ( $p_j < p_i$ but $\ell_j < \ell_i$ ), swapping their codewords changes $L$ by $(p_i - p_j)(\ell_j - \ell_i) < 0$ , improving the code. So we can assume the two least probable symbols have the two longest codewords.

Inductive step

Combine the two least probable symbols $x_{m-1}, x_m$ into a super-symbol $z$ with probability $p_{m-1} + p_m$ . By induction, the optimal code for the reduced alphabet $\{x_1, \ldots, x_{m-2}, z\}$ exists (call it Huffman). Expanding $z$ 's codeword into two codewords (appending 0 and 1) gives a code for the original alphabet. The expected length increases by exactly $p_{m-1} + p_m$ (one extra bit for each of the two symbols). Any other code for the original alphabet, when "collapsed" at the deepest level, gives a code for the reduced alphabet with equal or worse expected length. Therefore the expanded Huffman code is optimal.

Example: Huffman Code for a Five-Symbol Source

Construct the Huffman code for $X$ with probabilities $P = (0.35, 0.25, 0.20, 0.12, 0.08)$ and compute the expected codeword length.

Solution

Step-by-step construction

Round 1: Merge the two smallest: $0.12 + 0.08 = 0.20$ . Remaining: $(0.35, 0.25, 0.20, 0.20)$ . Round 2: Merge smallest: $0.20 + 0.20 = 0.40$ . Remaining: $(0.40, 0.35, 0.25)$ . Round 3: Merge smallest: $0.35 + 0.25 = 0.60$ . Remaining: $(0.60, 0.40)$ . Round 4: Merge: $0.60 + 0.40 = 1.00$ . Done.

Read off codewords

Tracing the tree root-to-leaf:

$x_1$ (0.35): $00$ (length 2)
$x_2$ (0.25): $01$ (length 2)
$x_3$ (0.20): $10$ (length 2)
$x_4$ (0.12): $110$ (length 3)
$x_5$ (0.08): $111$ (length 3)

Expected length

$L = 0.35(2) + 0.25(2) + 0.20(2) + 0.12(3) + 0.08(3) = 0.70 + 0.50 + 0.40 + 0.36 + 0.24 = 2.20$ bits.

For comparison: $H(X) = -\sum p_i \log p_i \approx 2.149$ bits. Shannon code: $L_{Shannon} \leq 2.149 + 1 = 3.149$ . Huffman achieves 2.20 — much closer to entropy. The gap $L^* - H(X) = 2.20 - 2.149 = 0.051$ bits, well within the theoretical 1-bit gap.

Theorem: Penalty for Mismatched Source Codes

If a Shannon code is designed for distribution $Q$ but the true source distribution is $P$ , the expected codeword length is $L = H(P) + D(P \| Q).$ The penalty $D(P \| Q)$ is the number of extra bits per symbol due to the mismatch.

A code designed for $Q$ assigns length $\approx -\log Q(x)$ to symbol $x$ . Under the true distribution $P$ , the expected length is $\mathbb{E}_P[-\log Q(X)] = H(P) + D(P \| Q)$ . The KL divergence measures the cost of ignorance about the true distribution. The point is that the KL divergence has yet another operational meaning: it is the compression penalty for using the wrong code.

Proof

Direct computation

Shannon code for $Q$ : lengths $\ell(x) = \lceil -\log Q(x) \rceil$ , so $\ell(x) \leq -\log Q(x) + 1$ . Expected length under $P$ : $L \leq \sum_x P(x)(-\log Q(x) + 1) = H(P) + D(P \| Q) + 1.$ The lower bound $L \geq H(P) + D(P \| Q)$ follows from the source coding converse applied to the lengths $\ell(x) \geq -\log Q(x)$ .

Competitive Optimality of Shannon Codes

Shannon codes have a remarkable property beyond average-case performance: they are competitively optimal. The codeword length for symbol $x$ is at most $\lceil -\log P(x) \rceil$ , which is within 1 bit of the ideal length $-\log P(x)$ for every symbol, not just on average. This means that even for individual "unlucky" symbols, the overhead is bounded. More precisely, $\Pr[\ell(X) > -\log P(X) + c] \leq 2^{-c+1}$ — the probability of exceeding the ideal length by $c$ bits decays exponentially in $c$ .

Quick Check

For a source with $P = (1/2, 1/4, 1/8, 1/8)$ , what is the expected codeword length of the optimal (Huffman) code?

1.75 bits

2.0 bits

1.5 bits

2.25 bits

Correction:

1.75 bits

The Huffman code assigns lengths $(1, 2, 3, 3)$ , giving $L = 0.5(1) + 0.25(2) + 0.125(3) + 0.125(3) = 0.5 + 0.5 + 0.375 + 0.375 = 1.75$ . Note that $H(X) = 1.75$ exactly, so the Huffman code is perfect — this happens when all probabilities are powers of 2.

Common Mistake: Huffman Codes Are Optimal Among Symbol-by-Symbol Codes Only

Mistake:

Claiming that Huffman codes are the best possible source codes for all blocklengths and all coding schemes.

Correction:

Huffman codes are optimal among prefix codes operating on individual symbols. Block Huffman codes (applied to blocks of $n$ symbols) are optimal among block prefix codes but may be outperformed by arithmetic coding (which can encode fractional bits) and Lempel-Ziv (which adapts to the source without knowing the distribution). For sources with memory, Huffman on individual symbols is far from optimal — block coding or sequential methods like arithmetic coding are necessary.

Huffman code

A prefix code constructed by the greedy algorithm that repeatedly merges the two least probable symbols. Optimal among all symbol-by-symbol prefix codes for a given distribution.

Huffman Code Construction Step by Step

Animated construction of a Huffman code for a five-symbol source, showing the greedy merge process, the resulting code tree, and the comparison of expected length

L

with entropy

H(X)

.

Key Takeaway

Huffman codes are the optimal symbol-by-symbol prefix codes: the greedy merge algorithm produces a code with $H(X) \leq L^* < H(X) + 1$ . Using the wrong distribution incurs a penalty of exactly $D(P \| Q)$ bits per symbol. When probabilities are dyadic ( $2^{-k}$ ), Huffman matches entropy exactly; otherwise, the 1-bit gap motivates arithmetic coding (Section 3) and block coding.

Shannon Codes, Huffman Codes, and Optimality