Chapter Summary

Key Points

1.
Entropy $H(X)$ quantifies average uncertainty. For a discrete RV with PMF $p$ , $H(X) = -\sum p(x) \log p(x)$ is bounded by $0 \leq H(X) \leq \log|\mathcal{X}|$ , with the upper bound achieved by the uniform distribution. Entropy equals the minimum average description length for an i.i.d. source.
2.
The chain rule decomposes joint entropy. $H(X,Y) = H(X) + H(Y|X)$ , and more generally $H(X_1, \ldots, X_n) = \sum_i H(X_i | X^{i-1})$ . This telescoping structure is the backbone of most converse proofs.
3.
Mutual information $I(X;Y) = H(X) - H(X|Y)$ measures the information one variable provides about another. It is symmetric, non-negative, and equals zero iff $X \perp Y$ . Channel capacity is $C = \max_{P_X} I(X;Y)$ .
4.
KL divergence $D(P \| Q) \geq 0$ is the mother inequality. Non-negativity of mutual information, the entropy upper bound, and the data processing inequality all follow from $D(P \| Q) \geq 0$ .
5.
Concavity of $I(X;Y)$ in $P_X$ makes capacity computable. The capacity optimization is a concave maximization — any local maximum is global, and algorithms like Blahut-Arimoto converge to it.
6.
The data processing inequality says processing cannot create information. For $X \to Y \to Z$ : $I(X;Z) \leq I(X;Y)$ . Equality holds iff $Z$ is a sufficient statistic.
7.
Fano's inequality converts error probability to entropy bounds. $H(X|Y) \leq h_b(P_e) + P_e \log(M-1)$ . This is the key tool for proving that rates above capacity lead to unavoidable errors.
8.
Maximum entropy distributions under constraints form exponential families. Uniform maximizes entropy without constraints; geometric under a mean constraint; discrete Gaussian under mean and variance constraints. This Lagrangian technique reappears as waterfilling in Gaussian channels.

Looking Ahead

Chapter 2 extends these information measures to continuous random variables, where "differential entropy" replaces entropy and the Gaussian distribution plays a starring role as the continuous maximum-entropy distribution under a variance constraint. The fundamental result — that Gaussian noise is the worst-case additive noise — leads directly to the capacity formula $C = \frac{1}{2}\log(1 + \text{SNR})$ for the AWGN channel.

Maximum Entropy Distributions Exercises