Ferkans — Interactive Telecom Tutor

Information Theory Meets Machine Learning

Rate-distortion theory asks: what is the minimum description length for a source at a given quality level? Representation learning asks: what is the most useful compressed representation of data for a downstream task? These are the same question viewed from different angles.

This section explores three deep connections between lossy source coding and modern machine learning: the information bottleneck (a rate-distortion problem where "distortion" is measured by relevance to a target variable), mutual information as a regularizer in representation learning, and the variational autoencoder as a practical rate-distortion code with learned encoder and decoder.

The Information Bottleneck Curve

The information bottleneck tradeoff animated: a dot sweeps along the IB curve from maximum compression (

\beta \to 0

, trivial representation) to maximum relevance (

\beta \to \infty

, sufficient statistic), showing how the Lagrange multiplier

\beta

controls the compression-relevance tradeoff.

Definition:
The Information Bottleneck Method

Given a joint distribution $P_{XY}$ , the information bottleneck (IB) seeks a compressed representation $T$ of $X$ that preserves as much information as possible about a target variable $Y$ :

$\min_{P_{T|X}} I(X; T) - \beta \cdot I(T; Y)$

subject to the Markov chain $T - X - Y$ .

Here:

$I(X; T)$ is the compression cost (how many bits $T$ uses to describe $X$ )
$I(T; Y)$ is the relevance (how much $T$ tells us about $Y$ )
$\beta > 0$ is a Lagrange multiplier trading off compression against relevance

The IB curve traces the optimal tradeoff between compression and relevance as $\beta$ varies from 0 (maximum compression, $T$ is trivial) to $\infty$ (maximum relevance, $T = X$ ).

The IB is a rate-distortion problem where the "distortion" is $d(x, t) = -\log P_{Y|T}(y|t)$ — the loss of predictive power about $Y$ . The IB functional can be written as:

$\mathcal{L}_{\text{IB}} = I(X; T) - \beta \cdot I(T; Y) = I(X; T) - \beta [I(X; Y) - I(X; Y | T)]$

At $\beta = 1$ , the IB reduces to minimizing $I(X; T | Y)$ — keep only the information in $X$ that is relevant to $Y$ .

Information bottleneck

An information-theoretic framework for finding compressed representations that preserve task-relevant information. It trades off compression rate $I(X; T)$ against relevance $I(T; Y)$ via a Lagrange multiplier $\beta$ .

Theorem: Optimal Information Bottleneck Solution

The optimal conditional distribution $P_{T|X}$ for the information bottleneck satisfies the self-consistent equations:

$P_{T|X}(t|x) = \frac{P_T(t)}{Z(x, \beta)} \exp\left(-\beta \cdot D(P_{Y|X}(\cdot|x) \| P_{Y|T}(\cdot|t))\right)$

where $Z(x, \beta)$ is a normalizing constant and: $P_T(t) = \sum_x P_X(x) P_{T|X}(t|x)$ $P_{Y|T}(y|t) = \sum_x P_{Y|X}(y|x) P_{X|T}(x|t)$

These equations define a Blahut-Arimoto-type alternating minimization that converges to the optimal IB solution.

The optimal encoding assigns $x$ to cluster $t$ based on how similar the conditional distribution $P_{Y|X=x}$ is to the cluster's average $P_{Y|T=t}$ . The similarity is measured by KL divergence — the same quantity that appears in rate-distortion theory. The parameter $\beta$ controls the resolution of the clustering: large $\beta$ creates many fine-grained clusters (high relevance), small $\beta$ creates few coarse clusters (high compression).

Proof

Lagrangian formulation

The IB Lagrangian is: $\mathcal{L} = I(X; T) - \beta \cdot I(T; Y)$ $= \sum_{x,t} P_{XT}(x,t) \log\frac{P_{T|X}(t|x)}{P_T(t)} - \beta \sum_{t,y} P_{TY}(t,y) \log\frac{P_{Y|T}(y|t)}{P_Y(y)}$

Taking the functional derivative with respect to $P_{T|X}(t|x)$ and setting it to zero (with appropriate constraints) yields the self-consistent equations. The structure is identical to the Blahut-Arimoto algorithm for rate-distortion: alternating optimization over $P_{T|X}$ and $P_{Y|T}$ .

Connection to rate-distortion

Define the distortion measure $d_{\text{IB}}(x, t) = D(P_{Y|X}(\cdot|x) \| P_{Y|T}(\cdot|t))$ . Then the IB problem becomes:

$\min_{P_{T|X}} I(X; T) \quad \text{s.t.} \quad \mathbb{E}[d_{\text{IB}}(X, T)] \leq D$

This is exactly a rate-distortion problem with the KL divergence distortion measure. The IB curve is the rate-distortion function for this choice of distortion.

Example: Gaussian Information Bottleneck

Let $(X, Y)$ be jointly Gaussian with $X \sim \mathcal{N}(0, \sigma_X^2)$ , $Y = X + N$ where $N \sim \mathcal{N}(0, \sigma_N^2)$ . Compute the information bottleneck curve.

Solution

Gaussian IB solution

For jointly Gaussian variables, the optimal $T$ is also Gaussian: $T = X + W$ where $W \sim \mathcal{N}(0, \sigma_W^2)$ , independent of $(X, Y)$ .

The compression rate is: $I(X; T) = \frac{1}{2}\log\left(1 + \frac{\sigma_X^2}{\sigma_W^2}\right)$

The relevance is: $I(T; Y) = \frac{1}{2}\log\left(1 + \frac{\sigma_X^4/\sigma_W^2}{\sigma_X^2 + \sigma_N^2 + \sigma_X^2\sigma_N^2/\sigma_W^2}\right)$

IB curve

The IB curve parameterized by $\sigma_W^2 \in (0, \infty)$ :

$I(T; Y) = \frac{1}{2}\log\frac{(1 + \sigma_X^2/\sigma_W^2)(1 + \sigma_X^2/\sigma_N^2)}{1 + \sigma_X^2/\sigma_W^2 + \sigma_X^2/\sigma_N^2}$

As $\sigma_W^2 \to 0$ (no compression): $I(T; Y) \to I(X; Y)$ . As $\sigma_W^2 \to \infty$ (maximum compression): $I(T; Y) \to 0$ .

For the Gaussian case, the IB curve coincides with the rate-distortion function under MMSE distortion, reflecting the sufficiency of Gaussian test channels.

Information Bottleneck Curve

Explore the information bottleneck tradeoff between compression rate $I(X; T)$ and relevance $I(T; Y)$ for the Gaussian case with varying SNR.

Parameters

SNR (dB)10

Signal-to-noise ratio between X and Y in dB

Definition:
Variational Autoencoders as Rate-Distortion Codes

A variational autoencoder (VAE) minimizes the loss:

$\mathcal{L}_{\text{VAE}} = \mathbb{E}_{q_\phi(z|x)}[-\log p_\theta(x|z)] + D(q_\phi(z|x) \| p(z))$

This is precisely a rate-distortion objective:

The first term is the distortion: the expected reconstruction error under the learned decoder $p_\theta(x|z)$ .
The second term is the rate: the KL divergence $D(q_\phi(z|x) \| p(z))$ bounds the mutual information $I(X; Z)$ , measuring how many bits the latent code $Z$ uses to describe $X$ .

The VAE loss equals: $\mathcal{L}_{\text{VAE}} = -\text{ELBO} = -\log p_\theta(x) + D(q_\phi(z|x) \| p_\theta(z|x))$ .

The connection to rate-distortion theory is exact: the VAE objective is a variational upper bound on the rate-distortion function, where the "distortion" is log-loss and the "rate" is measured by the KL term. The $\beta$ -VAE ( $\mathcal{L} = \text{distortion} + \beta \cdot \text{rate}$ ) directly traces the rate-distortion curve as $\beta$ varies — this is the same parametrization as the information bottleneck.

,

Variational autoencoder (VAE)

A generative model that learns an encoder $q_\phi(z|x)$ and decoder $p_\theta(x|z)$ by maximizing the evidence lower bound (ELBO). From an information-theoretic perspective, the VAE solves a rate-distortion problem with log-loss distortion and a learned codebook.

Theorem: VAE as a Rate-Distortion Upper Bound

For data $X \sim P_{\text{data}}$ , latent variable $Z$ , encoder $q_\phi(Z|X)$ , and decoder $p_\theta(X|Z)$ :

$R(D) \leq I(X; Z) \leq \mathbb{E}_{P_{\text{data}}}[D(q_\phi(Z|X) \| p(Z))]$

where $D = \mathbb{E}[-\log p_\theta(X|Z)]$ is the expected log-loss distortion.

The left inequality is the rate-distortion bound (Shannon's source coding theorem). The right inequality bounds the true mutual information by the variational KL term — tight when $q_\phi$ equals the true posterior.

The VAE provides a practical (trainable) upper bound on the rate-distortion function. As the encoder and decoder networks become more expressive, the bound tightens and the VAE approaches the fundamental compression limit. The $\beta$ -VAE sweeps the rate-distortion curve by varying the weight on the KL term, exactly as the Lagrange multiplier $\beta$ in the IB.

Proof

MI bound

$I(X; Z) = \mathbb{E}_{P_{XZ}}\left[\log\frac{q_\phi(Z|X)}{p(Z)}\right] - \mathbb{E}_{P_{XZ}}\left[\log\frac{q_\phi(Z|X)}{P_{Z|X}(Z|X)}\right]KATEXPLACEHOLDER0END= \mathbb{E}[D(q_\phi(Z|X) \| p(Z))] - \mathbb{E}[D(q_\phi(Z|X) \| P_{Z|X})]KATEXPLACEHOLDER1END\leq \mathbb{E}[D(q_\phi(Z|X) \| p(Z))]$ $since$ D \geq 0 $. Equality holds when$ q_\phi = P_{Z|X}$.

Rate-distortion connection

By Shannon's rate-distortion theorem, $R(D) \leq I(X; Z)$ for any joint distribution achieving distortion $D$ . Combined with the MI bound:

$R(D) \leq I(X; Z) \leq \text{KL term of VAE}$

The VAE loss $= \text{distortion} + \text{rate bound}$ is therefore an upper bound on the rate-distortion Lagrangian.

The Information-Theoretic Lens on Deep Learning

The connections between information theory and deep learning go beyond the VAE:

Generalization bounds: Mutual information between training data and learned parameters $I(S; W)$ bounds generalization error — a direct analogue of the rate-distortion tradeoff where "rate" is model complexity and "distortion" is generalization gap.
Neural network compression: Pruning, quantization, and knowledge distillation are practical rate-distortion problems where the "source" is a large model and the "reconstruction" is a smaller model with bounded accuracy loss.
Learned compression: Neural image/video codecs (e.g., those based on hyperprior models) directly optimize the rate-distortion objective with learned encoders and decoders, often outperforming hand-designed codecs like JPEG.

The point is that rate-distortion theory provides the right conceptual framework for understanding why and when compression (of data, models, or representations) helps rather than hurts performance.

Historical Note: The Information Bottleneck: From Physics to Deep Learning

1999-2017

The information bottleneck was introduced by Tishby, Pereira, and Bialek in 1999, building on ideas from statistical physics (the IB Lagrangian has the form of a free energy, with $\beta$ playing the role of inverse temperature). For nearly two decades, it was primarily a tool in computational linguistics and bioinformatics.

The IB gained renewed attention in 2017 when Shwartz-Ziv and Tishby proposed that deep neural networks implicitly perform information bottleneck optimization: early in training, the hidden layers increase $I(T; Y)$ (fitting the data), and later they decrease $I(X; T)$ (compressing the representation). While this "compression phase" hypothesis has been debated and partially refuted for ReLU networks, the IB framework remains a powerful lens for understanding representation learning.

Quick Check

In the information bottleneck, what happens as the Lagrange multiplier $\beta \to 0$ ?

The representation $T$ becomes trivial (independent of $X$ )

The representation $T$ becomes a sufficient statistic for $Y$

The IB reduces to the standard rate-distortion problem

Correction:

The representation

T

becomes trivial (independent of

X

)

Correct! When $\\beta \\to 0$ , the relevance term $\\beta \\cdot \I(T; Y)$ vanishes and the objective reduces to minimizing $\I(X; T)$ alone. The solution is $T$ independent of $X$ (maximum compression, zero relevance).

Quick Check

In a VAE, the KL divergence term $D(q_\phi(z|x) \| p(z))$ serves as a bound on which information-theoretic quantity?

The entropy of the latent code $H(Z)$

The mutual information $I(X; Z)$ between input and latent code

The reconstruction error $\mathbb{E}[-\log p_\theta(x|z)]$

Correction:

The mutual information

I(X; Z)

between input and latent code

Correct! $\I(X; Z) \\leq \\mathbb{E}[\D(q_\\phi(Z|X) \\| p(Z))]$ , with equality when $q_\\phi$ matches the true posterior. This is the "rate" in the rate-distortion interpretation of VAEs.

Common Mistake: The IB Compression Phase in Deep Networks

Mistake:

Claiming that all deep neural networks undergo an "information compression phase" during training, where $I(X; T)$ decreases in later epochs. This was the original claim of the Shwartz-Ziv and Tishby (2017) paper.

Correction:

Subsequent work (Saxe et al., 2018) showed that the compression phase depends on the activation function: networks with saturating activations (tanh, sigmoid) do compress, but networks with ReLU activations may not. The IB framework remains valuable for understanding representation learning, but the claim of universal compression during training is not supported for all architectures.

Why This Matters: Neural Compression for Wireless Communications

The rate-distortion perspective on deep learning has practical implications for wireless systems. Learned image and video codecs based on VAE-like architectures now match or exceed traditional codecs (JPEG, H.265) in rate-distortion performance. For wireless transmission:

Joint source-channel coding: Neural networks can learn end-to-end mappings from source to channel input, bypassing the separation architecture. This is particularly attractive for low-latency, low-SNR regimes.
Semantic communication: Rather than transmitting all bits faithfully, the encoder can learn to transmit only task-relevant information (the information bottleneck applied to communication). See Book telecom, Ch. 20 for semantic communication frameworks.
Federated learning compression: Gradient compression in distributed training is a rate-distortion problem where the "source" is the gradient and the "distortion" is the impact on convergence.

⚠️Engineering Note

Neural Image Compression: From Theory to Deployment

Learned image codecs based on VAE architectures now match or exceed traditional codecs (JPEG, HEVC intra) in rate-distortion performance. The encoder and decoder are neural networks (typically convolutional), and the rate is controlled by a learned entropy model (hyperprior).

Key engineering constraints in deployment:

Decode latency: Neural decoders on mobile GPUs take 50-200ms per 720p frame, compared to <10ms for hardware HEVC decoders.
Rate control: Unlike traditional codecs with precise rate control via QP, neural codecs require retraining or $\beta$ -sweeping to change the operating point on the rate-distortion curve.
Standardization: MPEG is developing the Neural Network-based Video Coding (NNVC) standard. Interoperability requires standardized decoder architectures and fixed-point inference.
Training data bias: Neural codecs trained on natural images may perform poorly on medical images, satellite imagery, or synthetic content.

Practical Constraints

•
Decoder inference latency is 5-20x higher than traditional codecs
•
Rate-distortion operating point requires retraining, not just QP change
•
Fixed-point quantization needed for hardware deployment
•
Domain mismatch between training and deployment data

Key Takeaway

Rate-distortion theory and the information bottleneck provide the conceptual foundation for understanding representation learning: compression (low rate) and relevance (low distortion) must be traded off. The VAE loss is a variational upper bound on the rate-distortion function. These connections are not merely analogies — they are exact mathematical relationships that guide the design of learned compression systems and inform our understanding of what neural networks learn.

Connections to Statistics and Machine Learning

Information Theory Meets Machine Learning

The Information Bottleneck Curve

Definition: The Information Bottleneck Method

Information bottleneck

Theorem: Optimal Information Bottleneck Solution

Lagrangian formulation

Connection to rate-distortion

Example: Gaussian Information Bottleneck

Gaussian IB solution

IB curve

Information Bottleneck Curve

Parameters

Definition: Variational Autoencoders as Rate-Distortion Codes

Variational autoencoder (VAE)

Theorem: VAE as a Rate-Distortion Upper Bound

MI bound

Rate-distortion connection

The Information-Theoretic Lens on Deep Learning

Historical Note: The Information Bottleneck: From Physics to Deep Learning

Quick Check

Quick Check

Common Mistake: The IB Compression Phase in Deep Networks

Why This Matters: Neural Compression for Wireless Communications

Neural Image Compression: From Theory to Deployment

Key Takeaway

Definition:
The Information Bottleneck Method

Definition:
Variational Autoencoders as Rate-Distortion Codes