Exercises
ex-ch05-01
EasyVerify Kraft's inequality for the code and determine if there is room to add another codeword.
Compute and check if it is .
Compute Kraft sum
Lengths: . Kraft sum: . Kraft with equality β no room for another codeword. The code tree is complete (every leaf is used).
ex-ch05-02
EasyA source has four symbols with probabilities . Construct the Huffman code and verify that it achieves .
When probabilities are dyadic (powers of 1/2), Huffman achieves entropy exactly.
Huffman construction
Merge 0.125 + 0.125 = 0.25. Remaining: (0.5, 0.25, 0.25). Merge 0.25 + 0.25 = 0.5. Remaining: (0.5, 0.5). Merge 0.5 + 0.5 = 1.0.
Codewords: (length 1), (length 2), (length 3), (length 3).
Verify optimality
bits. bits. exactly. β
ex-ch05-03
EasyCompute the Shannon code (codeword lengths and expected length) for . Compare with entropy.
Shannon code lengths: .
Compute lengths
. . . .
Expected length and comparison
bits. bits. Gap: bits, within the guaranteed 1-bit bound.
ex-ch05-04
EasyA binary Markov chain has , , , . Find the stationary distribution and the entropy rate.
Stationary: .
Stationary distribution
and . , .
Entropy rate
bits/symbol.
Marginal entropy: bits. Memory saves bits/symbol (28% improvement).
ex-ch05-05
MediumProve McMillan's inequality: if a code (not necessarily prefix) is uniquely decodable, then . Use the "power trick": expand and bound the result.
Consider .
Group terms by total length .
The number of terms with total length is the number of source sequences mapping to code strings of length .
Expand the power
A_Lk(i_1, \ldots, i_k)\ell_{i_1} + \cdots + \ell_{i_k} = L$.
Bound $A_L$
counts the number of source sequences whose concatenated codeword has total length . Since the code is uniquely decodable, each binary string of length can decode to at most one source sequence. There are binary strings of length , so .
Conclude
S \leq (k\ell_{\max})^{1/k} \to 1k \to \inftySkS \leq 1$.
ex-ch05-06
MediumA source has 8 equiprobable symbols. Compare the expected codeword length of: (a) A fixed-length code (b) The Huffman code (c) The Shannon code. What do you observe?
For equiprobable symbols, all three codes should give the same result.
All codes are equivalent
With for all :
- Fixed-length: 3 bits each. .
- Huffman: all lengths equal (no asymmetry to exploit). .
- Shannon: for all . .
- Entropy: bits.
For uniform distributions, there is no compression possible β the entropy equals the alphabet size logarithm, and all codes achieve this exactly.
ex-ch05-07
MediumA code is designed for distribution , but the true source distribution is . Compute: (a) the expected codeword length under , (b) the entropy , and (c) the mismatch penalty . Verify that where .
The Shannon code for has lengths .
Compute Shannon code for $Q$
, , .
Expected length under $P$
bits.
Entropy and KL divergence
bits. bits. bits. . β
ex-ch05-08
MediumShow that the entropy rate of a second-order binary Markov chain (where depends on and ) can be written as where is the stationary distribution of the pair .
A second-order Markov chain on can be viewed as a first-order Markov chain on .
Reduce to first-order
Define . Then is a first-order Markov chain on a 4-symbol alphabet, with transition probabilities determined by the second-order conditional probabilities .
Apply first-order formula
The entropy rate of is , but since shares one component with : The entropy rate of equals the entropy rate of (they encode the same information).
ex-ch05-09
Medium(Block Huffman.) For a binary memoryless source with , : (a) Compute . (b) What is the expected length of the symbol-by-symbol Huffman code? (c) Compute the expected length per symbol of the block-2 Huffman code (Huffman on pairs). (d) How does blocking improve the compression rate?
Block-2: the source has 4 "super-symbols" (00, 01, 10, 11) with probabilities .
Entropy
bits.
Symbol-by-symbol Huffman
Only 2 symbols: Huffman assigns 0 and 1. bit/symbol. Gap: bits!
Block-2 Huffman
Probabilities: . Huffman: merge 0.09 + 0.01 = 0.10, then 0.10 + 0.09 = 0.19, then 0.81 + 0.19 = 1.0. Lengths: (1, 2, 3, 3). . Per symbol: bits/symbol. Gap: bits.
Improvement
Blocking from to reduced the gap from 0.531 to 0.176 bits/symbol β a 67% reduction. The theoretical bound for block- Huffman is . For low-entropy sources like this one, blocking (or arithmetic coding) is essential.
ex-ch05-10
MediumProve that arithmetic coding uses at most bits to encode any sequence . (Hint: show that a binary fraction with bits can uniquely identify a point in the interval .)
A -bit binary fraction has precision .
Show that among the fractions for , at least one falls in the interval.
Interval width and precision
The interval for has width . A -bit fraction has spacing . If , then at least one fraction falls in the interval .
Verify the bound
With : .
So the spacing is strictly less than the interval width, guaranteeing at least one fraction in the interval. We use at most bits.
ex-ch05-11
Hard(Competitive optimality.) Prove that for the Shannon code with lengths : for any . Interpret: the probability of using more than extra bits decays exponentially.
Note that always, so requires ... unless we consider block codes.
For block coding of symbols, , and the event is that the per-symbol cost exceeds entropy by bits.
Per-symbol case
For a single symbol: , so the event is impossible for . For , the bound is trivially true since .
Block coding (more interesting)
For block-: . The event means where . This requires the fractional part of to be at most . More precisely, where is the fractional part. For the normalized version , apply Markov's inequality to the random variable : A cleaner approach uses the AEP directly.
ex-ch05-12
Hard(Huffman code for geometric distribution.) Let with for . Show that the Huffman code assigns lengths (unary coding) when , and compute the expected length.
For : .
Show that merging the two least probable symbols at each step yields a new geometric-like distribution.
Structure of Huffman for geometric
Since is decreasing and for , symbol 0 is always more probable than all other symbols combined. Huffman always pairs symbol 0 with the merged tail.
Inductive argument
After merging all symbols , the merged probability is . At each step, we compare with the merged tail . Since iff iff , symbol is always merged with the tail, giving it one more bit than symbol . This produces lengths .
Expected length
. Entropy: . The gap depends on and approaches 0 as .
ex-ch05-13
Hard(Entropy rate of a hidden Markov model.) Consider a binary Markov chain with states and transition probabilities . The observation is a noisy version of : , . Explain why does not have a simple closed form, and describe how to compute it numerically using the forward algorithm.
The observations form a hidden Markov model (HMM).
For HMMs, the entropy rate does not decompose as a simple weighted sum.
Why no closed form
For a first-order Markov chain, because . For an HMM, is not Markov: depends on all past observations through the hidden state . The conditional entropy does not simplify, because the posterior distribution of given depends on the entire observation history.
Numerical computation via forward algorithm
The forward algorithm computes recursively: The entropy rate is , which can be estimated by averaging over a long simulated sequence, where using the normalized forward variables.
ex-ch05-14
Hard(Redundancy of Huffman vs. arithmetic coding.) For a source with , (very low entropy), compute the Huffman redundancy and the arithmetic coding redundancy for block length . Show that Huffman redundancy can approach 1 bit, while arithmetic coding redundancy vanishes.
For very low entropy, as .
Huffman must use at least 1 bit per symbol.
Huffman redundancy
Huffman assigns lengths (1, 1) for a binary source. bit/symbol. for large . Redundancy: as . Almost the entire bit is wasted!
Arithmetic coding redundancy
Arithmetic coding on blocks of : . Per symbol: . Redundancy: as , regardless of how small is.
For example, with : bits/symbol. Huffman: 1 bit/symbol (100x the entropy). Arithmetic with : bits/symbol.
ex-ch05-15
Challenge(Minimax redundancy.) Consider the class of all memoryless sources on a binary alphabet, parameterized by . A universal code is designed without knowing . The redundancy at parameter is . The minimax redundancy is .
Show that , and that the optimal code uses a mixture (Jeffreys prior) over the source parameter.
This connects to the MDL (minimum description length) principle.
The optimal mixture assigns probability (Jeffreys prior) to each source.
The redundancy equals where .
Mixture coding approach
A mixture code assigns length to sequence , where is the Bayesian predictive distribution. The redundancy at is:
Jeffreys prior is minimax optimal
By Laplace's method, the KL divergence is approximately where is the Fisher information. The maximum over is minimized when , i.e., . This is the Jeffreys prior, and the resulting minimax redundancy is .
Interpretation
The term is the cost of "learning" one real parameter from samples. More generally, for a -parameter source family, the minimax redundancy is , and the Jeffreys prior is asymptotically minimax optimal. This connects information theory to the MDL (minimum description length) principle in statistics.
ex-ch05-16
Challenge(LZ78 lower bound on phrase count.) Prove that for a binary i.i.d. source with entropy , the LZ78 phrase count after parsing symbols satisfies almost surely for large and any . Deduce that the average phrase length is at most .
Each of the phrases is distinct, and the total length is .
If the average phrase length is , then (number of distinct binary strings of length ).
Phrase length constraint
The phrases are all distinct and have total length . The number of distinct binary strings of length at most is . If the longest phrase has length at most , then .
Average phrase length
The average phrase length is . Since all phrases are distinct and the -th phrase extends a previous phrase by one symbol, the phrase lengths are non-decreasing on average. By a counting argument, if phrases fill positions: and , giving .
Lower bound on $c(n)$
The phrases of length number at most . So the number of phrases of length is , and they contribute at least to the total length . Choosing : Rearranging: , which for large gives , and more precisely a.s.