Ferkans — Interactive Telecom Tutor

Exploiting Memory in Sources

Real sources — text, speech, images — are never memoryless. The letter 'q' in English is almost always followed by 'u'; a pixel's value is highly correlated with its neighbors. For sources with memory, the fundamental limit is the entropy rate $H_\infty$ , which is strictly less than the single-symbol entropy $H(X)$ when there are dependencies. Exploiting memory means compressing at $H_\infty$ bits/symbol instead of $H(X)$ — and for highly correlated sources, the savings can be dramatic.

Definition:
Entropy Rate

The entropy rate of a stationary stochastic process $\{X_i\}$ is $H_\infty = \lim_{n \to \infty} \frac{1}{n} H(X_1, X_2, \ldots, X_n)$ when the limit exists. Equivalently, for stationary processes: $H_\infty = \lim_{n \to \infty} H(X_n | X_{n-1}, \ldots, X_1) = H(X_n | X_{n-1}, X_{n-2}, \ldots)$ the conditional entropy of the present given the entire past.

The entropy rate measures the "new information" per symbol — the irreducible uncertainty after exploiting all dependencies. For an i.i.d. source, $H_\infty = H(X)$ . For sources with strong dependencies, $H_\infty \ll H(X)$ .

Theorem: Existence of the Entropy Rate

For any stationary stochastic process $\{X_i\}$ over a finite alphabet:

The limit $H_\infty = \lim_{n \to \infty} H(X_n | X_{n-1}, \ldots, X_1)$ exists.
$H_\infty = \lim_{n \to \infty} \frac{1}{n} H(X_1, \ldots, X_n)$ .
$H(X_n | X_{n-1}, \ldots, X_1)$ is non-increasing in $n$ .

Proof

Non-increasing sequence

By conditioning reduces entropy: $H(X_{n+1} | X_n, \ldots, X_1) \leq H(X_{n+1} | X_n, \ldots, X_2) = H(X_n | X_{n-1}, \ldots, X_1)$ where the equality uses stationarity (shifting indices). Since the sequence is non-increasing and bounded below by 0, the limit exists.

Cesàro mean equals the limit

By the chain rule: $\frac{1}{n}H(X_1, \ldots, X_n) = \frac{1}{n}\sum_{i=1}^n H(X_i | X_{i-1}, \ldots, X_1)$ . This is a Cesàro average of a convergent sequence, so it converges to the same limit.

Definition:
Markov Source

A $k$ -th order Markov source is a stationary process $\{X_i\}$ satisfying $P(X_n | X_{n-1}, \ldots, X_1) = P(X_n | X_{n-1}, \ldots, X_{n-k})$ for all $n > k$ . The first-order Markov chain has transition matrix $P_{ij} = P(X_n = j | X_{n-1} = i)$ and stationary distribution $\pi$ satisfying $\pi = \pi \mathbf{P}$ .

Markov sources are the simplest models of dependent sources. Despite their simplicity, they capture the essential phenomenon: memory reduces entropy rate below the marginal entropy.

Theorem: Entropy Rate of a Markov Chain

For an irreducible, aperiodic, first-order Markov chain with transition matrix $\mathbf{P}$ and stationary distribution $\pi$ : $H_\infty = \sum_{i \in \mathcal{X}} \pi_i H(X_n | X_{n-1} = i) = -\sum_{i,j} \pi_i P_{ij} \log P_{ij}.$

The entropy rate is the weighted average of the conditional entropies from each state, weighted by the stationary probability of being in that state. If some transitions are near-deterministic ( $P_{ij} \approx 0$ or $1$ ), the entropy rate is much lower than the marginal entropy — the chain is predictable.

Proof

Direct from definition

For a stationary first-order Markov chain: $H_\infty = H(X_n | X_{n-1}) = \sum_i P(X_{n-1} = i) H(X_n | X_{n-1} = i)$ $= \sum_i \pi_i \left(-\sum_j P_{ij} \log P_{ij}\right) = -\sum_{i,j} \pi_i P_{ij} \log P_{ij}.$

Example: Entropy Rate of a Binary Markov Chain

A binary Markov chain has transition probabilities $P(0|0) = 0.9$ , $P(1|0) = 0.1$ , $P(0|1) = 0.3$ , $P(1|1) = 0.7$ . Compute the stationary distribution, the marginal entropy $H(X)$ , and the entropy rate $H_\infty$ .

Solution

Stationary distribution

$\pi_0 (0.1) = \pi_1 (0.3)$ and $\pi_0 + \pi_1 = 1$ . So $\pi_0 = 0.3/(0.1+0.3) = 0.75$ and $\pi_1 = 0.25$ .

Marginal entropy

$H(X) = -0.75 \log 0.75 - 0.25 \log 0.25 = 0.811$ bits/symbol.

Entropy rate

$H_\infty = \pi_0 H(0.1) + \pi_1 H(0.3)$ $= 0.75 \times 0.469 + 0.25 \times 0.881 = 0.352 + 0.220 = 0.572$ bits/symbol.

The memory saves $H(X) - H_\infty = 0.811 - 0.572 = 0.239$ bits/symbol. A memoryless code wastes almost 30% of its bits!

Entropy Rate of a Binary Markov Chain

Explore how the entropy rate depends on the transition probabilities. Compare $H_\infty$ with the marginal entropy $H(X)$ and observe the savings from exploiting memory.

Parameters

P(1|0)

0.1

P(0|1)

0.3

Context Tree Weighting

For higher-order Markov sources, the context tree weighting (CTW) algorithm (Willems, Shtarkov, Tjalkens, 1995) provides a universal code that adapts to the correct Markov order without knowing it in advance. CTW maintains a weighted combination of models at all orders up to a maximum depth $D$ , achieving a redundancy of $O(|\mathcal{X}|^D \log n / n)$ — much faster convergence than LZ. CTW combined with arithmetic coding is one of the most effective known compression algorithms for text-like sources. The PAQ family of compressors, which holds many compression records, uses ideas descended from CTW.

Quick Check

For a first-order binary Markov chain with $P(0|0) = P(1|1) = 1$ (deterministic), the entropy rate is:

1 bit

0.5 bits

0 bits

Undefined

Correction:

0 bits

If both transitions are deterministic, the chain either stays at 0 forever or at 1 forever. $H_\infty = 0$ : the entire sequence is determined by the initial state, requiring zero bits per symbol (after the first symbol).

Entropy rate

The asymptotic per-symbol entropy $H_\infty = \lim_{n \to \infty} \frac{1}{n}H(X_1, \ldots, X_n)$ of a stationary process. The fundamental compression limit for sources with memory.

Related: Entropy Rate

Key Takeaway

For sources with memory, the entropy rate $H_\infty$ is the fundamental compression limit — strictly below the marginal entropy $H(X)$ . For Markov sources, $H_\infty = \sum_i \pi_i H(\text{row}_i(\mathbf{P}))$ , a weighted average of conditional entropies. Exploiting memory requires either block coding, adaptive arithmetic coding, or universal methods like Lempel-Ziv and context tree weighting.

Source Coding for Sources with Memory

Exploiting Memory in Sources

Definition: Entropy Rate

Theorem: Existence of the Entropy Rate

Non-increasing sequence

Cesàro mean equals the limit

Definition: Markov Source

Theorem: Entropy Rate of a Markov Chain

Direct from definition

Example: Entropy Rate of a Binary Markov Chain

Stationary distribution

Marginal entropy

Entropy rate

Entropy Rate of a Binary Markov Chain

Parameters

Context Tree Weighting

Quick Check

Entropy rate

Key Takeaway

Definition:
Entropy Rate

Definition:
Markov Source