Universal Source Coding: Lempel-Ziv
Compression Without Knowing the Source
Shannon codes and Huffman codes require knowledge of the source distribution . Arithmetic coding can adapt to on the fly, but still needs a model. Is it possible to achieve entropy-rate compression for any stationary ergodic source, without knowing the source statistics at all? The Lempel-Ziv algorithm (1977β1978) gives a resounding yes. LZ parses the source into phrases, each extending a previously seen phrase by one symbol, and this simple scheme achieves the entropy rate universally. It is the engine behind gzip, PNG, GIF, and the entire compress/deflate family of algorithms.
Definition: Universal Source Code
Universal Source Code
A source code is universal for a class of sources if, for every source , the per-symbol code length converges to the entropy rate: where is the entropy rate of the source.
Universality is a remarkably strong property: a single code, with no knowledge of the source, matches the performance of the best code designed for that specific source.
Lempel-Ziv 78 (LZ78) Algorithm
Complexity: time with a trie data structure.LZ78 parses the sequence into phrases where each new phrase is the shortest string not previously seen. The parsed phrases form a tree (trie) that grows as the source is processed. LZ77 (the sliding-window variant) is more commonly used in practice (gzip uses it).
Theorem: Optimality of Lempel-Ziv (LZ78)
For any stationary ergodic source with entropy rate , the Lempel-Ziv (LZ78) algorithm achieves: as .
The LZ algorithm learns the source statistics implicitly through the dictionary. Initially, the phrases are short and the code is inefficient. As the dictionary grows, phrases become longer, and the code rate improves. The key insight is that the number of distinct phrases from parsing symbols satisfies , which gives the right compression rate. Intuitively, the parsing tree converges to the optimal suffix tree of the source.
Counting phrases
Let be the number of phrases when parsing . Each phrase is encoded using bits. The total code length is .
Phrase count bound (Ziv's lemma)
Ziv proved that for any stationary ergodic source: This uses the fact that each phrase is distinct, so the phrases represent distinct strings whose total length is . The average phrase length is , and the logarithm of the number of typical strings of this length is approximately the entropy rate times the phrase length.
Compression rate
\geq H_\inftyH_\infty$ exactly.
Example: LZ78 Parsing Example
Parse the binary sequence (the first 16 bits) using LZ78 and compute the encoded length.
Parsing
Starting with dictionary :
- Phrase 1: (new). Output . Dictionary: .
- Phrase 2: (new). Output . Dictionary: .
- Phrase 3: ( found, new). Output . Dictionary adds .
- Phrase 4: ( found, new). Output . Dictionary adds .
- Phrase 5: ( found, new). Output . Dictionary adds .
- Phrase 6: ( found, new). Output . Dictionary adds .
- Phrase 7: ( found, new). Output . Dictionary adds .
- Phrase 8: (continues).
After 7 phrases we have consumed 15 symbols.
Encoded length
Each phrase needs bits for the pointer plus 1 bit for the new symbol. Phrase 1: bit. Phrase 2: bits. Phrases 3β7: with growing. Total bits for 15 source bits.
The overhead is large for short sequences, but as , the ratio converges to .
Lempel-Ziv Compression Rate vs. Blocklength
Simulate LZ78 compression on a binary memoryless source and plot the compression rate as a function of sequence length . Observe convergence to .
Parameters
Historical Note: Lempel and Ziv: The Universality Revolution
1977β1978Abraham Lempel and Jacob Ziv published their two foundational papers in 1977 (LZ77, sliding-window) and 1978 (LZ78, dictionary-based). The impact was enormous: for the first time, a practical algorithm achieved the entropy rate of any stationary ergodic source without knowing the source statistics. Before Lempel-Ziv, universal coding existed in theory (minimax codes, two-part codes) but not in practical form. The LZ family spawned an entire ecosystem: LZW (used in GIF, Unix compress), DEFLATE (used in gzip, ZIP, PNG), LZMA (used in 7-Zip, xz), and Zstandard (used in modern Linux kernels and databases). Ziv received the IEEE Medal of Honor in 1997; the Lempel-Ziv-Welch patent (LZW, 1984) was one of the most commercially significant patents in computing history.
LZ in Modern Compression Pipelines
Modern compression tools rarely use LZ alone. The typical pipeline is:
- Preprocessing β transform the data to expose redundancy (e.g., BWT in bzip2, delta coding)
- LZ-style matching β find repeated patterns (LZ77 sliding window in DEFLATE/Zstandard)
- Entropy coding β encode the LZ output (lengths, distances) with Huffman or arithmetic coding
This layered approach achieves compression ratios far better than any single method. The key insight: LZ captures structural redundancy (repeated patterns), while entropy coding captures statistical redundancy (non-uniform symbol frequencies). Together they approach the true entropy rate of the source.
Common Mistake: LZ Convergence Is Slow
Mistake:
Expecting LZ compression to achieve near-entropy rates on short sequences (e.g., a few hundred symbols).
Correction:
LZ convergence to the entropy rate is notoriously slow β the rate of convergence is , much slower than Huffman or arithmetic coding with a known distribution. For short sequences, LZ may actually expand the data (the dictionary overhead exceeds the savings). In practice, this is mitigated by combining LZ with entropy coding and by tuning window sizes and dictionary resets.
Universal source code
A compression scheme that achieves the entropy rate of any source in a specified class (e.g., all stationary ergodic sources) without knowing the source distribution. Lempel-Ziv is the canonical example.
Related: Universal Source Code
Lempel-Ziv Parsing in Action
Key Takeaway
The Lempel-Ziv algorithm achieves the entropy rate of any stationary ergodic source without knowing the source distribution β this is universality. The parsing-based approach learns source statistics implicitly through the growing dictionary. While convergence is slow (), the practical impact is immense: LZ variants form the backbone of gzip, PNG, ZIP, Zstandard, and nearly all file compression in use today.