Source Coding for Sources with Memory

Exploiting Memory in Sources

Real sources β€” text, speech, images β€” are never memoryless. The letter 'q' in English is almost always followed by 'u'; a pixel's value is highly correlated with its neighbors. For sources with memory, the fundamental limit is the entropy rate H∞H_\infty, which is strictly less than the single-symbol entropy H(X)H(X) when there are dependencies. Exploiting memory means compressing at H∞H_\infty bits/symbol instead of H(X)H(X) β€” and for highly correlated sources, the savings can be dramatic.

Definition:

Entropy Rate

The entropy rate of a stationary stochastic process {Xi}\{X_i\} is H∞=lim⁑nβ†’βˆž1nH(X1,X2,…,Xn)H_\infty = \lim_{n \to \infty} \frac{1}{n} H(X_1, X_2, \ldots, X_n) when the limit exists. Equivalently, for stationary processes: H∞=lim⁑nβ†’βˆžH(Xn∣Xnβˆ’1,…,X1)=H(Xn∣Xnβˆ’1,Xnβˆ’2,…)H_\infty = \lim_{n \to \infty} H(X_n | X_{n-1}, \ldots, X_1) = H(X_n | X_{n-1}, X_{n-2}, \ldots) the conditional entropy of the present given the entire past.

The entropy rate measures the "new information" per symbol β€” the irreducible uncertainty after exploiting all dependencies. For an i.i.d. source, H∞=H(X)H_\infty = H(X). For sources with strong dependencies, H∞β‰ͺH(X)H_\infty \ll H(X).

Theorem: Existence of the Entropy Rate

For any stationary stochastic process {Xi}\{X_i\} over a finite alphabet:

  1. The limit H∞=lim⁑nβ†’βˆžH(Xn∣Xnβˆ’1,…,X1)H_\infty = \lim_{n \to \infty} H(X_n | X_{n-1}, \ldots, X_1) exists.
  2. H∞=lim⁑nβ†’βˆž1nH(X1,…,Xn)H_\infty = \lim_{n \to \infty} \frac{1}{n} H(X_1, \ldots, X_n).
  3. H(Xn∣Xnβˆ’1,…,X1)H(X_n | X_{n-1}, \ldots, X_1) is non-increasing in nn.

Definition:

Markov Source

A kk-th order Markov source is a stationary process {Xi}\{X_i\} satisfying P(Xn∣Xnβˆ’1,…,X1)=P(Xn∣Xnβˆ’1,…,Xnβˆ’k)P(X_n | X_{n-1}, \ldots, X_1) = P(X_n | X_{n-1}, \ldots, X_{n-k}) for all n>kn > k. The first-order Markov chain has transition matrix Pij=P(Xn=j∣Xnβˆ’1=i)P_{ij} = P(X_n = j | X_{n-1} = i) and stationary distribution Ο€\pi satisfying Ο€=Ο€P\pi = \pi \mathbf{P}.

Markov sources are the simplest models of dependent sources. Despite their simplicity, they capture the essential phenomenon: memory reduces entropy rate below the marginal entropy.

Theorem: Entropy Rate of a Markov Chain

For an irreducible, aperiodic, first-order Markov chain with transition matrix P\mathbf{P} and stationary distribution Ο€\pi: H∞=βˆ‘i∈XΟ€iH(Xn∣Xnβˆ’1=i)=βˆ’βˆ‘i,jΟ€iPijlog⁑Pij.H_\infty = \sum_{i \in \mathcal{X}} \pi_i H(X_n | X_{n-1} = i) = -\sum_{i,j} \pi_i P_{ij} \log P_{ij}.

The entropy rate is the weighted average of the conditional entropies from each state, weighted by the stationary probability of being in that state. If some transitions are near-deterministic (Pijβ‰ˆ0P_{ij} \approx 0 or 11), the entropy rate is much lower than the marginal entropy β€” the chain is predictable.

Example: Entropy Rate of a Binary Markov Chain

A binary Markov chain has transition probabilities P(0∣0)=0.9P(0|0) = 0.9, P(1∣0)=0.1P(1|0) = 0.1, P(0∣1)=0.3P(0|1) = 0.3, P(1∣1)=0.7P(1|1) = 0.7. Compute the stationary distribution, the marginal entropy H(X)H(X), and the entropy rate H∞H_\infty.

Entropy Rate of a Binary Markov Chain

Explore how the entropy rate depends on the transition probabilities. Compare H∞H_\infty with the marginal entropy H(X)H(X) and observe the savings from exploiting memory.

Parameters
0.1
0.3

Context Tree Weighting

For higher-order Markov sources, the context tree weighting (CTW) algorithm (Willems, Shtarkov, Tjalkens, 1995) provides a universal code that adapts to the correct Markov order without knowing it in advance. CTW maintains a weighted combination of models at all orders up to a maximum depth DD, achieving a redundancy of O(∣X∣Dlog⁑n/n)O(|\mathcal{X}|^D \log n / n) β€” much faster convergence than LZ. CTW combined with arithmetic coding is one of the most effective known compression algorithms for text-like sources. The PAQ family of compressors, which holds many compression records, uses ideas descended from CTW.

Quick Check

For a first-order binary Markov chain with P(0∣0)=P(1∣1)=1P(0|0) = P(1|1) = 1 (deterministic), the entropy rate is:

1 bit

0.5 bits

0 bits

Undefined

Entropy rate

The asymptotic per-symbol entropy H∞=lim⁑nβ†’βˆž1nH(X1,…,Xn)H_\infty = \lim_{n \to \infty} \frac{1}{n}H(X_1, \ldots, X_n) of a stationary process. The fundamental compression limit for sources with memory.

Related: Entropy Rate

Key Takeaway

For sources with memory, the entropy rate H∞H_\infty is the fundamental compression limit β€” strictly below the marginal entropy H(X)H(X). For Markov sources, H∞=βˆ‘iΟ€iH(rowi(P))H_\infty = \sum_i \pi_i H(\text{row}_i(\mathbf{P})), a weighted average of conditional entropies. Exploiting memory requires either block coding, adaptive arithmetic coding, or universal methods like Lempel-Ziv and context tree weighting.