Source Coding for Sources with Memory
Exploiting Memory in Sources
Real sources β text, speech, images β are never memoryless. The letter 'q' in English is almost always followed by 'u'; a pixel's value is highly correlated with its neighbors. For sources with memory, the fundamental limit is the entropy rate , which is strictly less than the single-symbol entropy when there are dependencies. Exploiting memory means compressing at bits/symbol instead of β and for highly correlated sources, the savings can be dramatic.
Definition: Entropy Rate
Entropy Rate
The entropy rate of a stationary stochastic process is when the limit exists. Equivalently, for stationary processes: the conditional entropy of the present given the entire past.
The entropy rate measures the "new information" per symbol β the irreducible uncertainty after exploiting all dependencies. For an i.i.d. source, . For sources with strong dependencies, .
Theorem: Existence of the Entropy Rate
For any stationary stochastic process over a finite alphabet:
- The limit exists.
- .
- is non-increasing in .
Non-increasing sequence
By conditioning reduces entropy: where the equality uses stationarity (shifting indices). Since the sequence is non-increasing and bounded below by 0, the limit exists.
CesΓ ro mean equals the limit
By the chain rule: . This is a CesΓ ro average of a convergent sequence, so it converges to the same limit.
Definition: Markov Source
Markov Source
A -th order Markov source is a stationary process satisfying for all . The first-order Markov chain has transition matrix and stationary distribution satisfying .
Markov sources are the simplest models of dependent sources. Despite their simplicity, they capture the essential phenomenon: memory reduces entropy rate below the marginal entropy.
Theorem: Entropy Rate of a Markov Chain
For an irreducible, aperiodic, first-order Markov chain with transition matrix and stationary distribution :
The entropy rate is the weighted average of the conditional entropies from each state, weighted by the stationary probability of being in that state. If some transitions are near-deterministic ( or ), the entropy rate is much lower than the marginal entropy β the chain is predictable.
Direct from definition
For a stationary first-order Markov chain:
Example: Entropy Rate of a Binary Markov Chain
A binary Markov chain has transition probabilities , , , . Compute the stationary distribution, the marginal entropy , and the entropy rate .
Stationary distribution
and . So and .
Marginal entropy
bits/symbol.
Entropy rate
bits/symbol.
The memory saves bits/symbol. A memoryless code wastes almost 30% of its bits!
Entropy Rate of a Binary Markov Chain
Explore how the entropy rate depends on the transition probabilities. Compare with the marginal entropy and observe the savings from exploiting memory.
Parameters
Context Tree Weighting
For higher-order Markov sources, the context tree weighting (CTW) algorithm (Willems, Shtarkov, Tjalkens, 1995) provides a universal code that adapts to the correct Markov order without knowing it in advance. CTW maintains a weighted combination of models at all orders up to a maximum depth , achieving a redundancy of β much faster convergence than LZ. CTW combined with arithmetic coding is one of the most effective known compression algorithms for text-like sources. The PAQ family of compressors, which holds many compression records, uses ideas descended from CTW.
Quick Check
For a first-order binary Markov chain with (deterministic), the entropy rate is:
1 bit
0.5 bits
0 bits
Undefined
If both transitions are deterministic, the chain either stays at 0 forever or at 1 forever. : the entire sequence is determined by the initial state, requiring zero bits per symbol (after the first symbol).
Entropy rate
The asymptotic per-symbol entropy of a stationary process. The fundamental compression limit for sources with memory.
Related: Entropy Rate
Key Takeaway
For sources with memory, the entropy rate is the fundamental compression limit β strictly below the marginal entropy . For Markov sources, , a weighted average of conditional entropies. Exploiting memory requires either block coding, adaptive arithmetic coding, or universal methods like Lempel-Ziv and context tree weighting.