Ferkans — Interactive Telecom Tutor

Compression Without Knowing the Source

Shannon codes and Huffman codes require knowledge of the source distribution $P$ . Arithmetic coding can adapt to $P$ on the fly, but still needs a model. Is it possible to achieve entropy-rate compression for any stationary ergodic source, without knowing the source statistics at all? The Lempel-Ziv algorithm (1977–1978) gives a resounding yes. LZ parses the source into phrases, each extending a previously seen phrase by one symbol, and this simple scheme achieves the entropy rate universally. It is the engine behind gzip, PNG, GIF, and the entire compress/deflate family of algorithms.

Definition:
Universal Source Code

A source code is universal for a class $\mathcal{C}$ of sources if, for every source $\{X_i\} \in \mathcal{C}$ , the per-symbol code length converges to the entropy rate: $\frac{1}{n}\ell(\mathbf{X}^n) \to H_\infty \quad \text{in probability (or a.s.)}$ where $H_\infty = \lim_{n \to \infty} \frac{1}{n} H(X_1, \ldots, X_n)$ is the entropy rate of the source.

Universality is a remarkably strong property: a single code, with no knowledge of the source, matches the performance of the best code designed for that specific source.

Lempel-Ziv 78 (LZ78) Algorithm

Complexity:

O(n)

time with a trie data structure.

Input: Source sequence

x_1, x_2, \ldots

Output: Parsed phrases, each encoded as (pointer, new symbol)

Initialize: Dictionary

\mathcal{D} = \{\emptyset\}

(empty string), phrase counter

c = 0

Set current string

w = \emptyset

For each source symbol

x_i

:

1. Append:

w \leftarrow w \cdot x_i

2. If

w \notin \mathcal{D}

:

a.

c \leftarrow c + 1

b. Output phrase: (index of

w

minus last symbol,

x_i

)

— i.e., pointer to longest matching prefix + the new symbol

c. Add

w

to

\mathcal{D}

as phrase number

c

d. Reset

w \leftarrow \emptyset

3. Else continue to next symbol

Encoding: Each phrase is encoded using

\lceil \log c \rceil

bits for the pointer

and

\lceil \log |\mathcal{X}| \rceil

bits for the new symbol.

LZ78 parses the sequence into phrases where each new phrase is the shortest string not previously seen. The parsed phrases form a tree (trie) that grows as the source is processed. LZ77 (the sliding-window variant) is more commonly used in practice (gzip uses it).

Theorem: Optimality of Lempel-Ziv (LZ78)

For any stationary ergodic source $\{X_i\}$ with entropy rate $H_\infty$ , the Lempel-Ziv (LZ78) algorithm achieves: $\frac{1}{n}\ell_{LZ}(\mathbf{X}^n) \to H_\infty \quad \text{almost surely}$ as $n \to \infty$ .

The LZ algorithm learns the source statistics implicitly through the dictionary. Initially, the phrases are short and the code is inefficient. As the dictionary grows, phrases become longer, and the code rate improves. The key insight is that the number of distinct phrases $c(n)$ from parsing $n$ symbols satisfies $c(n) \log c(n) \leq n(H_\infty + o(1))$ , which gives the right compression rate. Intuitively, the parsing tree converges to the optimal suffix tree of the source.

Proof

Counting phrases

Let $c(n)$ be the number of phrases when parsing $\mathbf{x}^n$ . Each phrase is encoded using $\lceil \log c \rceil + \lceil \log |\mathcal{X}| \rceil$ bits. The total code length is $\ell_{LZ}(n) \leq c(n)(\log c(n) + \log |\mathcal{X}| + 2)$ .

Phrase count bound (Ziv's lemma)

Ziv proved that for any stationary ergodic source: $\frac{c(n) \log c(n)}{n} \to H_\infty \quad \text{a.s.}$ This uses the fact that each phrase is distinct, so the $c(n)$ phrases represent $c(n)$ distinct strings whose total length is $n$ . The average phrase length is $n/c(n)$ , and the logarithm of the number of typical strings of this length is approximately the entropy rate times the phrase length.

Compression rate

$\frac{\ell_{LZ}(n)}{n} \leq \frac{c(n)(\log c(n) + O(1))}{n} \to H_\infty.$ $The converse ($ \geq H_\infty $) follows from the source coding converse: no code can beat entropy rate. Therefore the LZ rate converges to$ H_\infty$ exactly.

,

Example: LZ78 Parsing Example

Parse the binary sequence $\mathbf{x} = 0100101001010100\ldots$ (the first 16 bits) using LZ78 and compute the encoded length.

Solution

Parsing

Starting with dictionary $\mathcal{D} = \{\emptyset\}$ :

Phrase 1: $0$ (new). Output $(0, 0)$ . Dictionary: $\{\emptyset, 0\}$ .
Phrase 2: $1$ (new). Output $(0, 1)$ . Dictionary: $\{\emptyset, 0, 1\}$ .
Phrase 3: $00$ ( $0$ found, $00$ new). Output $(1, 0)$ . Dictionary adds $00$ .
Phrase 4: $10$ ( $1$ found, $10$ new). Output $(2, 0)$ . Dictionary adds $10$ .
Phrase 5: $100$ ( $10$ found, $100$ new). Output $(4, 0)$ . Dictionary adds $100$ .
Phrase 6: $101$ ( $10$ found, $101$ new). Output $(4, 1)$ . Dictionary adds $101$ .
Phrase 7: $01$ ( $0$ found, $01$ new). Output $(1, 1)$ . Dictionary adds $01$ .
Phrase 8: $00\ldots$ (continues).

After 7 phrases we have consumed 15 symbols.

Encoded length

Each phrase needs $\lceil \log c \rceil$ bits for the pointer plus 1 bit for the new symbol. Phrase 1: $\lceil \log 1 \rceil + 1 = 1$ bit. Phrase 2: $\lceil \log 2 \rceil + 1 = 2$ bits. Phrases 3–7: $\lceil \log c \rceil + 1$ with $c$ growing. Total $\approx 20$ bits for 15 source bits.

The overhead is large for short sequences, but as $n \to \infty$ , the ratio $\ell_{LZ}/n$ converges to $H_\infty$ .

Lempel-Ziv Compression Rate vs. Blocklength

Simulate LZ78 compression on a binary memoryless source and plot the compression rate $\ell_{LZ}/n$ as a function of sequence length $n$ . Observe convergence to $H(p)$ .

Parameters

Source probability

P(1) = p

0.3

Max sequence length5000

Number of trials5

Historical Note: Lempel and Ziv: The Universality Revolution

1977–1978

Abraham Lempel and Jacob Ziv published their two foundational papers in 1977 (LZ77, sliding-window) and 1978 (LZ78, dictionary-based). The impact was enormous: for the first time, a practical algorithm achieved the entropy rate of any stationary ergodic source without knowing the source statistics. Before Lempel-Ziv, universal coding existed in theory (minimax codes, two-part codes) but not in practical form. The LZ family spawned an entire ecosystem: LZW (used in GIF, Unix compress), DEFLATE (used in gzip, ZIP, PNG), LZMA (used in 7-Zip, xz), and Zstandard (used in modern Linux kernels and databases). Ziv received the IEEE Medal of Honor in 1997; the Lempel-Ziv-Welch patent (LZW, 1984) was one of the most commercially significant patents in computing history.

🔧Engineering Note

LZ in Modern Compression Pipelines

Modern compression tools rarely use LZ alone. The typical pipeline is:

Preprocessing — transform the data to expose redundancy (e.g., BWT in bzip2, delta coding)
LZ-style matching — find repeated patterns (LZ77 sliding window in DEFLATE/Zstandard)
Entropy coding — encode the LZ output (lengths, distances) with Huffman or arithmetic coding

This layered approach achieves compression ratios far better than any single method. The key insight: LZ captures structural redundancy (repeated patterns), while entropy coding captures statistical redundancy (non-uniform symbol frequencies). Together they approach the true entropy rate of the source.

Common Mistake: LZ Convergence Is Slow

Mistake:

Expecting LZ compression to achieve near-entropy rates on short sequences (e.g., a few hundred symbols).

Correction:

LZ convergence to the entropy rate is notoriously slow — the rate of convergence is $O(\log \log n / \log n)$ , much slower than Huffman or arithmetic coding with a known distribution. For short sequences, LZ may actually expand the data (the dictionary overhead exceeds the savings). In practice, this is mitigated by combining LZ with entropy coding and by tuning window sizes and dictionary resets.

Universal source code

A compression scheme that achieves the entropy rate of any source in a specified class (e.g., all stationary ergodic sources) without knowing the source distribution. Lempel-Ziv is the canonical example.

Related: Universal Source Code

Lempel-Ziv Parsing in Action

Step-by-step animation of LZ78 parsing a binary sequence, showing dictionary growth, phrase identification, and the encoding of each phrase as (pointer, new symbol). Demonstrates how the dictionary implicitly learns source statistics.

Key Takeaway

The Lempel-Ziv algorithm achieves the entropy rate of any stationary ergodic source without knowing the source distribution — this is universality. The parsing-based approach learns source statistics implicitly through the growing dictionary. While convergence is slow ( $O(\log\log n / \log n)$ ), the practical impact is immense: LZ variants form the backbone of gzip, PNG, ZIP, Zstandard, and nearly all file compression in use today.

Universal Source Coding: Lempel-Ziv