Universal Source Coding: Lempel-Ziv

Compression Without Knowing the Source

Shannon codes and Huffman codes require knowledge of the source distribution PP. Arithmetic coding can adapt to PP on the fly, but still needs a model. Is it possible to achieve entropy-rate compression for any stationary ergodic source, without knowing the source statistics at all? The Lempel-Ziv algorithm (1977–1978) gives a resounding yes. LZ parses the source into phrases, each extending a previously seen phrase by one symbol, and this simple scheme achieves the entropy rate universally. It is the engine behind gzip, PNG, GIF, and the entire compress/deflate family of algorithms.

Definition:

Universal Source Code

A source code is universal for a class C\mathcal{C} of sources if, for every source {Xi}∈C\{X_i\} \in \mathcal{C}, the per-symbol code length converges to the entropy rate: 1nβ„“(Xn)β†’H∞inΒ probabilityΒ (orΒ a.s.)\frac{1}{n}\ell(\mathbf{X}^n) \to H_\infty \quad \text{in probability (or a.s.)} where H∞=lim⁑nβ†’βˆž1nH(X1,…,Xn)H_\infty = \lim_{n \to \infty} \frac{1}{n} H(X_1, \ldots, X_n) is the entropy rate of the source.

Universality is a remarkably strong property: a single code, with no knowledge of the source, matches the performance of the best code designed for that specific source.

Lempel-Ziv 78 (LZ78) Algorithm

Complexity: O(n)O(n) time with a trie data structure.
Input: Source sequence x1,x2,…x_1, x_2, \ldots
Output: Parsed phrases, each encoded as (pointer, new symbol)
Initialize: Dictionary D={βˆ…}\mathcal{D} = \{\emptyset\} (empty string), phrase counter c=0c = 0
Set current string w=βˆ…w = \emptyset
For each source symbol xix_i:
1. Append: w←wβ‹…xiw \leftarrow w \cdot x_i
2. If wβˆ‰Dw \notin \mathcal{D}:
a. c←c+1c \leftarrow c + 1
b. Output phrase: (index of ww minus last symbol, xix_i)
β€” i.e., pointer to longest matching prefix + the new symbol
c. Add ww to D\mathcal{D} as phrase number cc
d. Reset wβ†βˆ…w \leftarrow \emptyset
3. Else continue to next symbol
Encoding: Each phrase is encoded using ⌈log⁑cβŒ‰\lceil \log c \rceil bits for the pointer
and ⌈log⁑∣Xβˆ£βŒ‰\lceil \log |\mathcal{X}| \rceil bits for the new symbol.

LZ78 parses the sequence into phrases where each new phrase is the shortest string not previously seen. The parsed phrases form a tree (trie) that grows as the source is processed. LZ77 (the sliding-window variant) is more commonly used in practice (gzip uses it).

Theorem: Optimality of Lempel-Ziv (LZ78)

For any stationary ergodic source {Xi}\{X_i\} with entropy rate H∞H_\infty, the Lempel-Ziv (LZ78) algorithm achieves: 1nβ„“LZ(Xn)β†’H∞almostΒ surely\frac{1}{n}\ell_{LZ}(\mathbf{X}^n) \to H_\infty \quad \text{almost surely} as nβ†’βˆžn \to \infty.

The LZ algorithm learns the source statistics implicitly through the dictionary. Initially, the phrases are short and the code is inefficient. As the dictionary grows, phrases become longer, and the code rate improves. The key insight is that the number of distinct phrases c(n)c(n) from parsing nn symbols satisfies c(n)log⁑c(n)≀n(H∞+o(1))c(n) \log c(n) \leq n(H_\infty + o(1)), which gives the right compression rate. Intuitively, the parsing tree converges to the optimal suffix tree of the source.

,

Example: LZ78 Parsing Example

Parse the binary sequence x=0100101001010100…\mathbf{x} = 0100101001010100\ldots (the first 16 bits) using LZ78 and compute the encoded length.

Lempel-Ziv Compression Rate vs. Blocklength

Simulate LZ78 compression on a binary memoryless source and plot the compression rate β„“LZ/n\ell_{LZ}/n as a function of sequence length nn. Observe convergence to H(p)H(p).

Parameters
0.3
5000
5

Historical Note: Lempel and Ziv: The Universality Revolution

1977–1978

Abraham Lempel and Jacob Ziv published their two foundational papers in 1977 (LZ77, sliding-window) and 1978 (LZ78, dictionary-based). The impact was enormous: for the first time, a practical algorithm achieved the entropy rate of any stationary ergodic source without knowing the source statistics. Before Lempel-Ziv, universal coding existed in theory (minimax codes, two-part codes) but not in practical form. The LZ family spawned an entire ecosystem: LZW (used in GIF, Unix compress), DEFLATE (used in gzip, ZIP, PNG), LZMA (used in 7-Zip, xz), and Zstandard (used in modern Linux kernels and databases). Ziv received the IEEE Medal of Honor in 1997; the Lempel-Ziv-Welch patent (LZW, 1984) was one of the most commercially significant patents in computing history.

πŸ”§Engineering Note

LZ in Modern Compression Pipelines

Modern compression tools rarely use LZ alone. The typical pipeline is:

  1. Preprocessing β€” transform the data to expose redundancy (e.g., BWT in bzip2, delta coding)
  2. LZ-style matching β€” find repeated patterns (LZ77 sliding window in DEFLATE/Zstandard)
  3. Entropy coding β€” encode the LZ output (lengths, distances) with Huffman or arithmetic coding

This layered approach achieves compression ratios far better than any single method. The key insight: LZ captures structural redundancy (repeated patterns), while entropy coding captures statistical redundancy (non-uniform symbol frequencies). Together they approach the true entropy rate of the source.

Common Mistake: LZ Convergence Is Slow

Mistake:

Expecting LZ compression to achieve near-entropy rates on short sequences (e.g., a few hundred symbols).

Correction:

LZ convergence to the entropy rate is notoriously slow β€” the rate of convergence is O(log⁑log⁑n/log⁑n)O(\log \log n / \log n), much slower than Huffman or arithmetic coding with a known distribution. For short sequences, LZ may actually expand the data (the dictionary overhead exceeds the savings). In practice, this is mitigated by combining LZ with entropy coding and by tuning window sizes and dictionary resets.

Universal source code

A compression scheme that achieves the entropy rate of any source in a specified class (e.g., all stationary ergodic sources) without knowing the source distribution. Lempel-Ziv is the canonical example.

Related: Universal Source Code

Lempel-Ziv Parsing in Action

Step-by-step animation of LZ78 parsing a binary sequence, showing dictionary growth, phrase identification, and the encoding of each phrase as (pointer, new symbol). Demonstrates how the dictionary implicitly learns source statistics.

Key Takeaway

The Lempel-Ziv algorithm achieves the entropy rate of any stationary ergodic source without knowing the source distribution β€” this is universality. The parsing-based approach learns source statistics implicitly through the growing dictionary. While convergence is slow (O(log⁑log⁑n/log⁑n)O(\log\log n / \log n)), the practical impact is immense: LZ variants form the backbone of gzip, PNG, ZIP, Zstandard, and nearly all file compression in use today.