Chapter Summary

Chapter Summary

Key Points

  • 1.

    Entropy H(X)H(X) quantifies average uncertainty. For a discrete RV with PMF pp, H(X)=βˆ’βˆ‘p(x)log⁑p(x)H(X) = -\sum p(x) \log p(x) is bounded by 0≀H(X)≀log⁑∣X∣0 \leq H(X) \leq \log|\mathcal{X}|, with the upper bound achieved by the uniform distribution. Entropy equals the minimum average description length for an i.i.d. source.

  • 2.

    The chain rule decomposes joint entropy. H(X,Y)=H(X)+H(Y∣X)H(X,Y) = H(X) + H(Y|X), and more generally H(X1,…,Xn)=βˆ‘iH(Xi∣Xiβˆ’1)H(X_1, \ldots, X_n) = \sum_i H(X_i | X^{i-1}). This telescoping structure is the backbone of most converse proofs.

  • 3.

    Mutual information I(X;Y)=H(X)βˆ’H(X∣Y)I(X;Y) = H(X) - H(X|Y) measures the information one variable provides about another. It is symmetric, non-negative, and equals zero iff XβŠ₯YX \perp Y. Channel capacity is C=max⁑PXI(X;Y)C = \max_{P_X} I(X;Y).

  • 4.

    KL divergence D(Pβˆ₯Q)β‰₯0D(P \| Q) \geq 0 is the mother inequality. Non-negativity of mutual information, the entropy upper bound, and the data processing inequality all follow from D(Pβˆ₯Q)β‰₯0D(P \| Q) \geq 0.

  • 5.

    Concavity of I(X;Y)I(X;Y) in PXP_X makes capacity computable. The capacity optimization is a concave maximization β€” any local maximum is global, and algorithms like Blahut-Arimoto converge to it.

  • 6.

    The data processing inequality says processing cannot create information. For Xβ†’Yβ†’ZX \to Y \to Z: I(X;Z)≀I(X;Y)I(X;Z) \leq I(X;Y). Equality holds iff ZZ is a sufficient statistic.

  • 7.

    Fano's inequality converts error probability to entropy bounds. H(X∣Y)≀hb(Pe)+Pelog⁑(Mβˆ’1)H(X|Y) \leq h_b(P_e) + P_e \log(M-1). This is the key tool for proving that rates above capacity lead to unavoidable errors.

  • 8.

    Maximum entropy distributions under constraints form exponential families. Uniform maximizes entropy without constraints; geometric under a mean constraint; discrete Gaussian under mean and variance constraints. This Lagrangian technique reappears as waterfilling in Gaussian channels.

Looking Ahead

Chapter 2 extends these information measures to continuous random variables, where "differential entropy" replaces entropy and the Gaussian distribution plays a starring role as the continuous maximum-entropy distribution under a variance constraint. The fundamental result β€” that Gaussian noise is the worst-case additive noise β€” leads directly to the capacity formula C=12log⁑(1+SNR)C = \frac{1}{2}\log(1 + \text{SNR}) for the AWGN channel.