Prefix Codes and Kraft's Inequality

The Fundamental Question of Source Coding

Suppose we observe a discrete random variable XX and want to describe it using a binary string. How short can we make the description, on average? Shannon's answer β€” H(X)H(X) bits β€” is one of the most elegant results in all of mathematics. But to make this precise, we need to define what we mean by a "code" and what properties it must have to be useful. The key requirement is instantaneous decodability: a decoder must be able to identify each codeword the moment it is received, without waiting for future symbols. This leads us to prefix codes, and Kraft's inequality tells us exactly which codeword length assignments are possible.

Definition:

Source Code

A source code cc for a random variable XX with alphabet X={x1,…,xm}\mathcal{X} = \{x_1, \ldots, x_m\} is a mapping c:Xβ†’Dβˆ—c : \mathcal{X} \to \mathcal{D}^* from source symbols to finite-length strings over a code alphabet D\mathcal{D} (typically D={0,1}\mathcal{D} = \{0, 1\}). The binary string c(x)c(x) is called the codeword for xx, and β„“(x)=∣c(x)∣\ell(x) = |c(x)| is its length.

Definition:

Uniquely Decodable Code

A code cc is uniquely decodable if the extension cβˆ—:Xβˆ—β†’Dβˆ—c^* : \mathcal{X}^* \to \mathcal{D}^* defined by concatenation, cβˆ—(x1x2β‹―xn)=c(x1)c(x2)β‹―c(xn),c^*(x_1 x_2 \cdots x_n) = c(x_1) c(x_2) \cdots c(x_n), is injective β€” that is, no two distinct source sequences produce the same binary string.

Unique decodability is the minimum requirement for a useful code. Without it, the decoder cannot recover the source sequence from the binary representation.

Definition:

Prefix Code (Instantaneous Code)

A code cc is a prefix code (also called an instantaneous code) if no codeword is a prefix of any other codeword. Formally: for all x≠x′x \neq x', c(x)c(x) is not a prefix of c(x′)c(x').

Every prefix code is uniquely decodable, but not every uniquely decodable code is a prefix code. The advantage of prefix codes is that they can be decoded symbol by symbol without looking ahead β€” the decoder recognizes the end of each codeword immediately. This is essential for real-time applications.

Example: Prefix Code vs. Non-Prefix Code

Let X={a,b,c,d}\mathcal{X} = \{a, b, c, d\}. Consider two codes:

Symbol Code 1 Code 2
aa 0 0
bb 10 01
cc 110 011
dd 111 0111

Determine which codes are prefix codes and which are uniquely decodable.

Theorem: Kraft's Inequality

For any prefix code over a DD-ary alphabet with codeword lengths β„“1,β„“2,…,β„“m\ell_1, \ell_2, \ldots, \ell_m: βˆ‘i=1mDβˆ’β„“i≀1.\sum_{i=1}^{m} D^{-\ell_i} \leq 1. Conversely, if the lengths β„“1,…,β„“m\ell_1, \ldots, \ell_m satisfy this inequality, there exists a prefix code with these codeword lengths.

Think of the code as a DD-ary tree. Each codeword is a leaf, and a codeword of length β„“\ell "uses up" a fraction Dβˆ’β„“D^{-\ell} of the tree's capacity. The total capacity is 1 (the root), and the leaves must not overlap (prefix condition), so the fractions must sum to at most 1. The converse says that any length assignment respecting this budget can be realized as an actual prefix code β€” we can always arrange the codewords on the tree.

McMillan's Extension

A remarkable fact: Kraft's inequality holds not only for prefix codes but for all uniquely decodable codes (McMillan, 1956). This means that restricting attention to prefix codes costs nothing in terms of codeword lengths β€” any achievable length vector for a uniquely decodable code is also achievable by a prefix code. We therefore lose nothing by working exclusively with prefix codes, which have the additional advantage of instantaneous decodability.

Theorem: Source Coding Converse

For any uniquely decodable binary code for a source XX with distribution PP, the expected codeword length satisfies L=E[β„“(X)]β‰₯H(X).L = \mathbb{E}[\ell(X)] \geq H(X).

We cannot compress below entropy β€” any code that tries to use fewer than H(X)H(X) bits per symbol on average will fail to be uniquely decodable. Entropy is not just a measure of randomness; it is the operational minimum description length.

Theorem: Shannon's Source Coding Theorem (Symbol-by-Symbol)

For a DMS with distribution PP on alphabet X\mathcal{X}, there exists a prefix code with expected codeword length H(X)≀Lβˆ—<H(X)+1.H(X) \leq L^* < H(X) + 1. More generally, for block coding of nn symbols: H(X)≀1nLnβˆ—<H(X)+1n.H(X) \leq \frac{1}{n}L_n^* < H(X) + \frac{1}{n}.

Shannon's theorem says entropy is achievable: we can get within 1 bit of H(X)H(X) with a symbol-by-symbol code, and within 1/n1/n bits with a block code of length nn. The gap shrinks to zero as nβ†’βˆžn \to \infty, so in the limit, entropy is the exact compression rate.

Binary Code Tree and Kraft's Inequality

Visualize a binary code tree. Select codeword lengths and see the tree structure, the Kraft sum, and whether the assignment is valid. The plot highlights available and occupied branches.

Parameters
1
2
3
3

Common Mistake: Uniquely Decodable Does Not Mean Instantaneous

Mistake:

Assuming that any uniquely decodable code can be decoded symbol-by-symbol in real time.

Correction:

Only prefix codes are instantaneously decodable. A uniquely decodable code that is not prefix-free may require the decoder to wait until the entire message is received before it can determine the first symbol. By McMillan's theorem, any uniquely decodable code has the same length vector as some prefix code, so we never need non-prefix codes in practice.

Prefix code

A code in which no codeword is a prefix of another. Equivalently, the codewords correspond to leaves of a binary tree. Prefix codes are instantaneously decodable: the decoder recognizes each codeword boundary without lookahead.

Related: Prefix Code (Instantaneous Code)

Kraft's inequality

The constraint βˆ‘i2βˆ’β„“i≀1\sum_i 2^{-\ell_i} \leq 1 that any prefix code must satisfy. Equivalently, the necessary and sufficient condition on codeword lengths for a prefix code to exist.

Related: Kraft's Inequality

Historical Note: Kraft, McMillan, and the Tree Argument

1949–1956

Kraft's inequality was proved in Leon Kraft's 1949 MIT master's thesis, supervised by Robert Fano. Brockway McMillan independently proved in 1956 that the same inequality holds for all uniquely decodable codes, not just prefix codes β€” a surprising strengthening that eliminated any reason to consider non-prefix codes. The binary tree argument we used in the proof is due to Kraft; McMillan's proof uses generating functions and is more algebraic in flavor. Together, these results established prefix codes as the "right" class of codes for lossless compression.

Key Takeaway

Prefix codes are the right framework for lossless source coding: they are instantaneously decodable, and by McMillan's theorem, they achieve all codeword lengths achievable by any uniquely decodable code. Kraft's inequality (βˆ‘2βˆ’β„“i≀1\sum 2^{-\ell_i} \leq 1) is both necessary and sufficient for a prefix code to exist. The entropy H(X)H(X) is a sharp lower bound on the expected codeword length, achievable within 1 bit by Shannon codes and to within 1/n1/n by block codes of length nn.