Chapter Summary

Chapter 8 Summary

Key Points

1.
Source coding with a helper bridges standard compression and Slepian-Wolf coding: the helper sends a rate-limited description of side information $Y$ to reduce the rate needed for $X$ . Wyner's common information quantifies the shared structure.
2.
Practical distributed video coding uses LDPC syndromes for Slepian-Wolf coding, shifting complexity from encoder to decoder — ideal for power-constrained sensors and multi-camera systems.
3.
The information bottleneck is a rate-distortion problem where distortion measures relevance to a target variable: $\min I(X; T) - \beta \cdot I(T; Y)$ . It provides the right framework for understanding representation learning.
4.
The VAE loss is a variational upper bound on the rate-distortion function with log-loss distortion. The KL term bounds $I(X; Z)$ (rate) and the reconstruction loss is the distortion.
5.
The connections between information theory and machine learning are exact mathematical relationships, not merely analogies. Rate-distortion theory guides the design of learned compression, neural codecs, and semantic communication systems.

Looking Ahead

With Parts I and II complete, we have developed the full machinery of source coding: from entropy and typicality through lossless and lossy compression to distributed source coding and its connections to modern machine learning. Part III turns to the complementary problem: how much information can be reliably communicated over a noisy channel? Chapter 9 develops the channel coding theorem for discrete memoryless channels, establishing the other pillar of Shannon's theory.

Evidence lower bound (ELBO)

A variational lower bound on the log-marginal likelihood $\log p(x)$ : $\text{ELBO} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D(q(z|x) \| p(z))$ . Maximizing the ELBO is equivalent to minimizing the VAE loss, which is a rate-distortion objective.

$\beta$ -VAE

A variant of the variational autoencoder that weights the KL term by $\beta$ : $\mathcal{L} = \text{reconstruction loss} + \beta \cdot D(q(z|x) \| p(z))$ . Varying $\beta$ sweeps the rate-distortion curve, with $\beta > 1$ encouraging disentangled representations (more compression) and $\beta < 1$ favoring reconstruction quality.

🎓CommIT Contribution(1999)

The Information Bottleneck Method

N. Tishby, F. C. Pereira, W. Bialek — Proc. 37th Allerton Conference on Communication, Control, and Computing

Tishby, Pereira, and Bialek introduced the information bottleneck as a principled method for extracting relevant information from data. By framing representation learning as a rate-distortion problem with relevance-based distortion, they connected lossy source coding to statistical learning in a way that has influenced both fields. The IB method was later applied to deep learning theory, clustering, and feature selection, and remains a cornerstone of the information-theoretic approach to machine learning.

information-bottleneckmachine-learningrate-distortion

Connections to Statistics and Machine Learning Exercises