Connections to Statistics and Machine Learning
Information Theory Meets Machine Learning
Rate-distortion theory asks: what is the minimum description length for a source at a given quality level? Representation learning asks: what is the most useful compressed representation of data for a downstream task? These are the same question viewed from different angles.
This section explores three deep connections between lossy source coding and modern machine learning: the information bottleneck (a rate-distortion problem where "distortion" is measured by relevance to a target variable), mutual information as a regularizer in representation learning, and the variational autoencoder as a practical rate-distortion code with learned encoder and decoder.
The Information Bottleneck Curve
Definition: The Information Bottleneck Method
The Information Bottleneck Method
Given a joint distribution , the information bottleneck (IB) seeks a compressed representation of that preserves as much information as possible about a target variable :
subject to the Markov chain .
Here:
- is the compression cost (how many bits uses to describe )
- is the relevance (how much tells us about )
- is a Lagrange multiplier trading off compression against relevance
The IB curve traces the optimal tradeoff between compression and relevance as varies from 0 (maximum compression, is trivial) to (maximum relevance, ).
The IB is a rate-distortion problem where the "distortion" is — the loss of predictive power about . The IB functional can be written as:
At , the IB reduces to minimizing — keep only the information in that is relevant to .
Information bottleneck
An information-theoretic framework for finding compressed representations that preserve task-relevant information. It trades off compression rate against relevance via a Lagrange multiplier .
Related: Rate-distortion theory and the information bottlen…, Representation learning, Sufficient statistics
Theorem: Optimal Information Bottleneck Solution
The optimal conditional distribution for the information bottleneck satisfies the self-consistent equations:
where is a normalizing constant and:
These equations define a Blahut-Arimoto-type alternating minimization that converges to the optimal IB solution.
The optimal encoding assigns to cluster based on how similar the conditional distribution is to the cluster's average . The similarity is measured by KL divergence — the same quantity that appears in rate-distortion theory. The parameter controls the resolution of the clustering: large creates many fine-grained clusters (high relevance), small creates few coarse clusters (high compression).
Lagrangian formulation
The IB Lagrangian is:
Taking the functional derivative with respect to and setting it to zero (with appropriate constraints) yields the self-consistent equations. The structure is identical to the Blahut-Arimoto algorithm for rate-distortion: alternating optimization over and .
Connection to rate-distortion
Define the distortion measure . Then the IB problem becomes:
This is exactly a rate-distortion problem with the KL divergence distortion measure. The IB curve is the rate-distortion function for this choice of distortion.
Example: Gaussian Information Bottleneck
Let be jointly Gaussian with , where . Compute the information bottleneck curve.
Gaussian IB solution
For jointly Gaussian variables, the optimal is also Gaussian: where , independent of .
The compression rate is:
The relevance is:
IB curve
The IB curve parameterized by :
As (no compression): . As (maximum compression): .
For the Gaussian case, the IB curve coincides with the rate-distortion function under MMSE distortion, reflecting the sufficiency of Gaussian test channels.
Information Bottleneck Curve
Explore the information bottleneck tradeoff between compression rate and relevance for the Gaussian case with varying SNR.
Parameters
Signal-to-noise ratio between X and Y in dB
Definition: Variational Autoencoders as Rate-Distortion Codes
Variational Autoencoders as Rate-Distortion Codes
A variational autoencoder (VAE) minimizes the loss:
This is precisely a rate-distortion objective:
- The first term is the distortion: the expected reconstruction error under the learned decoder .
- The second term is the rate: the KL divergence bounds the mutual information , measuring how many bits the latent code uses to describe .
The VAE loss equals: .
The connection to rate-distortion theory is exact: the VAE objective is a variational upper bound on the rate-distortion function, where the "distortion" is log-loss and the "rate" is measured by the KL term. The -VAE () directly traces the rate-distortion curve as varies — this is the same parametrization as the information bottleneck.
Variational autoencoder (VAE)
A generative model that learns an encoder and decoder by maximizing the evidence lower bound (ELBO). From an information-theoretic perspective, the VAE solves a rate-distortion problem with log-loss distortion and a learned codebook.
Related: Rate-distortion theory and the information bottlen…, Information bottleneck, Evidence lower bound (ELBO)
Theorem: VAE as a Rate-Distortion Upper Bound
For data , latent variable , encoder , and decoder :
where is the expected log-loss distortion.
The left inequality is the rate-distortion bound (Shannon's source coding theorem). The right inequality bounds the true mutual information by the variational KL term — tight when equals the true posterior.
The VAE provides a practical (trainable) upper bound on the rate-distortion function. As the encoder and decoder networks become more expressive, the bound tightens and the VAE approaches the fundamental compression limit. The -VAE sweeps the rate-distortion curve by varying the weight on the KL term, exactly as the Lagrange multiplier in the IB.
MI bound
D \geq 0q_\phi = P_{Z|X}$.
Rate-distortion connection
By Shannon's rate-distortion theorem, for any joint distribution achieving distortion . Combined with the MI bound:
The VAE loss is therefore an upper bound on the rate-distortion Lagrangian.
The Information-Theoretic Lens on Deep Learning
The connections between information theory and deep learning go beyond the VAE:
-
Generalization bounds: Mutual information between training data and learned parameters bounds generalization error — a direct analogue of the rate-distortion tradeoff where "rate" is model complexity and "distortion" is generalization gap.
-
Neural network compression: Pruning, quantization, and knowledge distillation are practical rate-distortion problems where the "source" is a large model and the "reconstruction" is a smaller model with bounded accuracy loss.
-
Learned compression: Neural image/video codecs (e.g., those based on hyperprior models) directly optimize the rate-distortion objective with learned encoders and decoders, often outperforming hand-designed codecs like JPEG.
The point is that rate-distortion theory provides the right conceptual framework for understanding why and when compression (of data, models, or representations) helps rather than hurts performance.
Historical Note: The Information Bottleneck: From Physics to Deep Learning
1999-2017The information bottleneck was introduced by Tishby, Pereira, and Bialek in 1999, building on ideas from statistical physics (the IB Lagrangian has the form of a free energy, with playing the role of inverse temperature). For nearly two decades, it was primarily a tool in computational linguistics and bioinformatics.
The IB gained renewed attention in 2017 when Shwartz-Ziv and Tishby proposed that deep neural networks implicitly perform information bottleneck optimization: early in training, the hidden layers increase (fitting the data), and later they decrease (compressing the representation). While this "compression phase" hypothesis has been debated and partially refuted for ReLU networks, the IB framework remains a powerful lens for understanding representation learning.
Quick Check
In the information bottleneck, what happens as the Lagrange multiplier ?
The representation becomes trivial (independent of )
The representation becomes a sufficient statistic for
The IB reduces to the standard rate-distortion problem
Correct! When , the relevance term vanishes and the objective reduces to minimizing alone. The solution is independent of (maximum compression, zero relevance).
Quick Check
In a VAE, the KL divergence term serves as a bound on which information-theoretic quantity?
The entropy of the latent code
The mutual information between input and latent code
The reconstruction error
Correct! \I(X; Z) \\leq \\mathbb{E}[\D(q_\\phi(Z|X) \\| p(Z))], with equality when q_\\phi matches the true posterior. This is the "rate" in the rate-distortion interpretation of VAEs.
Common Mistake: The IB Compression Phase in Deep Networks
Mistake:
Claiming that all deep neural networks undergo an "information compression phase" during training, where decreases in later epochs. This was the original claim of the Shwartz-Ziv and Tishby (2017) paper.
Correction:
Subsequent work (Saxe et al., 2018) showed that the compression phase depends on the activation function: networks with saturating activations (tanh, sigmoid) do compress, but networks with ReLU activations may not. The IB framework remains valuable for understanding representation learning, but the claim of universal compression during training is not supported for all architectures.
Why This Matters: Neural Compression for Wireless Communications
The rate-distortion perspective on deep learning has practical implications for wireless systems. Learned image and video codecs based on VAE-like architectures now match or exceed traditional codecs (JPEG, H.265) in rate-distortion performance. For wireless transmission:
-
Joint source-channel coding: Neural networks can learn end-to-end mappings from source to channel input, bypassing the separation architecture. This is particularly attractive for low-latency, low-SNR regimes.
-
Semantic communication: Rather than transmitting all bits faithfully, the encoder can learn to transmit only task-relevant information (the information bottleneck applied to communication). See Book telecom, Ch. 20 for semantic communication frameworks.
-
Federated learning compression: Gradient compression in distributed training is a rate-distortion problem where the "source" is the gradient and the "distortion" is the impact on convergence.
Neural Image Compression: From Theory to Deployment
Learned image codecs based on VAE architectures now match or exceed traditional codecs (JPEG, HEVC intra) in rate-distortion performance. The encoder and decoder are neural networks (typically convolutional), and the rate is controlled by a learned entropy model (hyperprior).
Key engineering constraints in deployment:
- Decode latency: Neural decoders on mobile GPUs take 50-200ms per 720p frame, compared to <10ms for hardware HEVC decoders.
- Rate control: Unlike traditional codecs with precise rate control via QP, neural codecs require retraining or -sweeping to change the operating point on the rate-distortion curve.
- Standardization: MPEG is developing the Neural Network-based Video Coding (NNVC) standard. Interoperability requires standardized decoder architectures and fixed-point inference.
- Training data bias: Neural codecs trained on natural images may perform poorly on medical images, satellite imagery, or synthetic content.
- •
Decoder inference latency is 5-20x higher than traditional codecs
- •
Rate-distortion operating point requires retraining, not just QP change
- •
Fixed-point quantization needed for hardware deployment
- •
Domain mismatch between training and deployment data
Key Takeaway
Rate-distortion theory and the information bottleneck provide the conceptual foundation for understanding representation learning: compression (low rate) and relevance (low distortion) must be traded off. The VAE loss is a variational upper bound on the rate-distortion function. These connections are not merely analogies — they are exact mathematical relationships that guide the design of learned compression systems and inform our understanding of what neural networks learn.