Connections to Statistics and Machine Learning

Information Theory Meets Machine Learning

Rate-distortion theory asks: what is the minimum description length for a source at a given quality level? Representation learning asks: what is the most useful compressed representation of data for a downstream task? These are the same question viewed from different angles.

This section explores three deep connections between lossy source coding and modern machine learning: the information bottleneck (a rate-distortion problem where "distortion" is measured by relevance to a target variable), mutual information as a regularizer in representation learning, and the variational autoencoder as a practical rate-distortion code with learned encoder and decoder.

The Information Bottleneck Curve

The information bottleneck tradeoff animated: a dot sweeps along the IB curve from maximum compression (β0\beta \to 0, trivial representation) to maximum relevance (β\beta \to \infty, sufficient statistic), showing how the Lagrange multiplier β\beta controls the compression-relevance tradeoff.

Definition:

The Information Bottleneck Method

Given a joint distribution PXYP_{XY}, the information bottleneck (IB) seeks a compressed representation TT of XX that preserves as much information as possible about a target variable YY:

minPTXI(X;T)βI(T;Y)\min_{P_{T|X}} I(X; T) - \beta \cdot I(T; Y)

subject to the Markov chain TXYT - X - Y.

Here:

  • I(X;T)I(X; T) is the compression cost (how many bits TT uses to describe XX)
  • I(T;Y)I(T; Y) is the relevance (how much TT tells us about YY)
  • β>0\beta > 0 is a Lagrange multiplier trading off compression against relevance

The IB curve traces the optimal tradeoff between compression and relevance as β\beta varies from 0 (maximum compression, TT is trivial) to \infty (maximum relevance, T=XT = X).

The IB is a rate-distortion problem where the "distortion" is d(x,t)=logPYT(yt)d(x, t) = -\log P_{Y|T}(y|t) — the loss of predictive power about YY. The IB functional can be written as:

LIB=I(X;T)βI(T;Y)=I(X;T)β[I(X;Y)I(X;YT)]\mathcal{L}_{\text{IB}} = I(X; T) - \beta \cdot I(T; Y) = I(X; T) - \beta [I(X; Y) - I(X; Y | T)]

At β=1\beta = 1, the IB reduces to minimizing I(X;TY)I(X; T | Y) — keep only the information in XX that is relevant to YY.

Information bottleneck

An information-theoretic framework for finding compressed representations that preserve task-relevant information. It trades off compression rate I(X;T)I(X; T) against relevance I(T;Y)I(T; Y) via a Lagrange multiplier β\beta.

Related: Rate-distortion theory and the information bottlen…, Representation learning, Sufficient statistics

Theorem: Optimal Information Bottleneck Solution

The optimal conditional distribution PTXP_{T|X} for the information bottleneck satisfies the self-consistent equations:

PTX(tx)=PT(t)Z(x,β)exp(βD(PYX(x)PYT(t)))P_{T|X}(t|x) = \frac{P_T(t)}{Z(x, \beta)} \exp\left(-\beta \cdot D(P_{Y|X}(\cdot|x) \| P_{Y|T}(\cdot|t))\right)

where Z(x,β)Z(x, \beta) is a normalizing constant and: PT(t)=xPX(x)PTX(tx)P_T(t) = \sum_x P_X(x) P_{T|X}(t|x) PYT(yt)=xPYX(yx)PXT(xt)P_{Y|T}(y|t) = \sum_x P_{Y|X}(y|x) P_{X|T}(x|t)

These equations define a Blahut-Arimoto-type alternating minimization that converges to the optimal IB solution.

The optimal encoding assigns xx to cluster tt based on how similar the conditional distribution PYX=xP_{Y|X=x} is to the cluster's average PYT=tP_{Y|T=t}. The similarity is measured by KL divergence — the same quantity that appears in rate-distortion theory. The parameter β\beta controls the resolution of the clustering: large β\beta creates many fine-grained clusters (high relevance), small β\beta creates few coarse clusters (high compression).

Example: Gaussian Information Bottleneck

Let (X,Y)(X, Y) be jointly Gaussian with XN(0,σX2)X \sim \mathcal{N}(0, \sigma_X^2), Y=X+NY = X + N where NN(0,σN2)N \sim \mathcal{N}(0, \sigma_N^2). Compute the information bottleneck curve.

Information Bottleneck Curve

Explore the information bottleneck tradeoff between compression rate I(X;T)I(X; T) and relevance I(T;Y)I(T; Y) for the Gaussian case with varying SNR.

Parameters
10

Signal-to-noise ratio between X and Y in dB

Definition:

Variational Autoencoders as Rate-Distortion Codes

A variational autoencoder (VAE) minimizes the loss:

LVAE=Eqϕ(zx)[logpθ(xz)]+D(qϕ(zx)p(z))\mathcal{L}_{\text{VAE}} = \mathbb{E}_{q_\phi(z|x)}[-\log p_\theta(x|z)] + D(q_\phi(z|x) \| p(z))

This is precisely a rate-distortion objective:

  • The first term is the distortion: the expected reconstruction error under the learned decoder pθ(xz)p_\theta(x|z).
  • The second term is the rate: the KL divergence D(qϕ(zx)p(z))D(q_\phi(z|x) \| p(z)) bounds the mutual information I(X;Z)I(X; Z), measuring how many bits the latent code ZZ uses to describe XX.

The VAE loss equals: LVAE=ELBO=logpθ(x)+D(qϕ(zx)pθ(zx))\mathcal{L}_{\text{VAE}} = -\text{ELBO} = -\log p_\theta(x) + D(q_\phi(z|x) \| p_\theta(z|x)).

The connection to rate-distortion theory is exact: the VAE objective is a variational upper bound on the rate-distortion function, where the "distortion" is log-loss and the "rate" is measured by the KL term. The β\beta-VAE (L=distortion+βrate\mathcal{L} = \text{distortion} + \beta \cdot \text{rate}) directly traces the rate-distortion curve as β\beta varies — this is the same parametrization as the information bottleneck.

,

Variational autoencoder (VAE)

A generative model that learns an encoder qϕ(zx)q_\phi(z|x) and decoder pθ(xz)p_\theta(x|z) by maximizing the evidence lower bound (ELBO). From an information-theoretic perspective, the VAE solves a rate-distortion problem with log-loss distortion and a learned codebook.

Related: Rate-distortion theory and the information bottlen…, Information bottleneck, Evidence lower bound (ELBO)

Theorem: VAE as a Rate-Distortion Upper Bound

For data XPdataX \sim P_{\text{data}}, latent variable ZZ, encoder qϕ(ZX)q_\phi(Z|X), and decoder pθ(XZ)p_\theta(X|Z):

R(D)I(X;Z)EPdata[D(qϕ(ZX)p(Z))]R(D) \leq I(X; Z) \leq \mathbb{E}_{P_{\text{data}}}[D(q_\phi(Z|X) \| p(Z))]

where D=E[logpθ(XZ)]D = \mathbb{E}[-\log p_\theta(X|Z)] is the expected log-loss distortion.

The left inequality is the rate-distortion bound (Shannon's source coding theorem). The right inequality bounds the true mutual information by the variational KL term — tight when qϕq_\phi equals the true posterior.

The VAE provides a practical (trainable) upper bound on the rate-distortion function. As the encoder and decoder networks become more expressive, the bound tightens and the VAE approaches the fundamental compression limit. The β\beta-VAE sweeps the rate-distortion curve by varying the weight on the KL term, exactly as the Lagrange multiplier β\beta in the IB.

The Information-Theoretic Lens on Deep Learning

The connections between information theory and deep learning go beyond the VAE:

  • Generalization bounds: Mutual information between training data and learned parameters I(S;W)I(S; W) bounds generalization error — a direct analogue of the rate-distortion tradeoff where "rate" is model complexity and "distortion" is generalization gap.

  • Neural network compression: Pruning, quantization, and knowledge distillation are practical rate-distortion problems where the "source" is a large model and the "reconstruction" is a smaller model with bounded accuracy loss.

  • Learned compression: Neural image/video codecs (e.g., those based on hyperprior models) directly optimize the rate-distortion objective with learned encoders and decoders, often outperforming hand-designed codecs like JPEG.

The point is that rate-distortion theory provides the right conceptual framework for understanding why and when compression (of data, models, or representations) helps rather than hurts performance.

Historical Note: The Information Bottleneck: From Physics to Deep Learning

1999-2017

The information bottleneck was introduced by Tishby, Pereira, and Bialek in 1999, building on ideas from statistical physics (the IB Lagrangian has the form of a free energy, with β\beta playing the role of inverse temperature). For nearly two decades, it was primarily a tool in computational linguistics and bioinformatics.

The IB gained renewed attention in 2017 when Shwartz-Ziv and Tishby proposed that deep neural networks implicitly perform information bottleneck optimization: early in training, the hidden layers increase I(T;Y)I(T; Y) (fitting the data), and later they decrease I(X;T)I(X; T) (compressing the representation). While this "compression phase" hypothesis has been debated and partially refuted for ReLU networks, the IB framework remains a powerful lens for understanding representation learning.

Quick Check

In the information bottleneck, what happens as the Lagrange multiplier β0\beta \to 0?

The representation TT becomes trivial (independent of XX)

The representation TT becomes a sufficient statistic for YY

The IB reduces to the standard rate-distortion problem

Quick Check

In a VAE, the KL divergence term D(qϕ(zx)p(z))D(q_\phi(z|x) \| p(z)) serves as a bound on which information-theoretic quantity?

The entropy of the latent code H(Z)H(Z)

The mutual information I(X;Z)I(X; Z) between input and latent code

The reconstruction error E[logpθ(xz)]\mathbb{E}[-\log p_\theta(x|z)]

Common Mistake: The IB Compression Phase in Deep Networks

Mistake:

Claiming that all deep neural networks undergo an "information compression phase" during training, where I(X;T)I(X; T) decreases in later epochs. This was the original claim of the Shwartz-Ziv and Tishby (2017) paper.

Correction:

Subsequent work (Saxe et al., 2018) showed that the compression phase depends on the activation function: networks with saturating activations (tanh, sigmoid) do compress, but networks with ReLU activations may not. The IB framework remains valuable for understanding representation learning, but the claim of universal compression during training is not supported for all architectures.

Why This Matters: Neural Compression for Wireless Communications

The rate-distortion perspective on deep learning has practical implications for wireless systems. Learned image and video codecs based on VAE-like architectures now match or exceed traditional codecs (JPEG, H.265) in rate-distortion performance. For wireless transmission:

  • Joint source-channel coding: Neural networks can learn end-to-end mappings from source to channel input, bypassing the separation architecture. This is particularly attractive for low-latency, low-SNR regimes.

  • Semantic communication: Rather than transmitting all bits faithfully, the encoder can learn to transmit only task-relevant information (the information bottleneck applied to communication). See Book telecom, Ch. 20 for semantic communication frameworks.

  • Federated learning compression: Gradient compression in distributed training is a rate-distortion problem where the "source" is the gradient and the "distortion" is the impact on convergence.

⚠️Engineering Note

Neural Image Compression: From Theory to Deployment

Learned image codecs based on VAE architectures now match or exceed traditional codecs (JPEG, HEVC intra) in rate-distortion performance. The encoder and decoder are neural networks (typically convolutional), and the rate is controlled by a learned entropy model (hyperprior).

Key engineering constraints in deployment:

  • Decode latency: Neural decoders on mobile GPUs take 50-200ms per 720p frame, compared to <10ms for hardware HEVC decoders.
  • Rate control: Unlike traditional codecs with precise rate control via QP, neural codecs require retraining or β\beta-sweeping to change the operating point on the rate-distortion curve.
  • Standardization: MPEG is developing the Neural Network-based Video Coding (NNVC) standard. Interoperability requires standardized decoder architectures and fixed-point inference.
  • Training data bias: Neural codecs trained on natural images may perform poorly on medical images, satellite imagery, or synthetic content.
Practical Constraints
  • Decoder inference latency is 5-20x higher than traditional codecs

  • Rate-distortion operating point requires retraining, not just QP change

  • Fixed-point quantization needed for hardware deployment

  • Domain mismatch between training and deployment data

Key Takeaway

Rate-distortion theory and the information bottleneck provide the conceptual foundation for understanding representation learning: compression (low rate) and relevance (low distortion) must be traded off. The VAE loss is a variational upper bound on the rate-distortion function. These connections are not merely analogies — they are exact mathematical relationships that guide the design of learned compression systems and inform our understanding of what neural networks learn.