The Information Bottleneck
Why the Information Bottleneck?
Suppose we observe a random variable and want to extract a compact representation that preserves as much information as possible about a relevant target . This is the central question of supervised learning, feature extraction, and representation learning. The information bottleneck (IB) framework, introduced by Tishby, Pereira, and Bialek (1999), answers this question using the language of information theory: compress into while retaining about .
The point is that the IB sits at the intersection of rate-distortion theory and sufficient statistics. When , the IB solution converges to a minimal sufficient statistic of for . When is finite, we get a controlled tradeoff between compression and relevance β and this tradeoff is precisely the kind of problem information theory was built to solve.
Definition: The Information Bottleneck Lagrangian
The Information Bottleneck Lagrangian
Given a joint distribution , the information bottleneck seeks a stochastic mapping that minimizes the IB Lagrangian: where is a Lagrange multiplier controlling the tradeoff between compression ( small) and relevance ( large), subject to the Markov chain .
The constraint means that can only access through . By the data processing inequality, , so perfect preservation of information about requires to retain all information in about .
Theorem: Self-Consistent Equations for the IB Optimum
The optimal IB mapping satisfies the following self-consistent equations: where is a normalization constant, , and .
The optimal mapping assigns to cluster with probability that depends on how similar the conditional is to the cluster's "prototype" , measured by KL divergence. This is exactly the structure of a rate-distortion problem where the distortion is the KL divergence β and the solution has the same exponential tilting form as the Blahut-Arimoto algorithm.
Lagrangian formulation
We write the constrained optimization. The IB seeks to minimize over all satisfying . Expanding the mutual informations using their KL divergence representations:
Functional derivative
Taking the functional derivative of with respect to and setting it to zero (with a Lagrange multiplier for the normalization constraint ), we obtain:
Exponentiate and normalize
Exponentiating both sides and recognizing the sum over as , we arrive at: The three equations (, , ) are coupled and must be solved self-consistently, which motivates an iterative algorithm.
Definition: The Information Curve
The Information Curve
The information curve traces the maximum achievable as a function of the compression level : subject to . This curve is concave and non-decreasing, with and .
The information curve is the dual object to the rate-distortion function. In rate-distortion theory, we fix a distortion level and minimize rate. Here, we fix a compression level and maximize relevance.
BlahutβArimoto Algorithm for the Information Bottleneck
Complexity: Each iteration is . Convergence is guaranteed since is bounded below and decreases monotonically, but the rate of convergence depends on and initialization.This is structurally identical to the Blahut-Arimoto algorithm for rate-distortion: the "distortion" here is the KL divergence . The same alternating minimization pattern we saw in Chapter 6 reappears here.
Example: Information Bottleneck for a Binary Symmetric Source
Let be Bernoulli(), and where is independent noise with . The bottleneck alphabet is . Compute the information curve vs. as varies.
Compute the joint distribution
Since is uniform, and . The mutual information is bits.
Parametrize the bottleneck mapping
By symmetry, the optimal mapping has the form and for some . When , (no compression), and when , is independent of (maximum compression).
Trace the information curve
For each , we compute and (using the Mrs. Gerber lemma for the BSC composition). Sweeping from to traces a concave curve from down to .
Interpretation
The slope of the information curve at any point equals . At small (heavy compression), only the most informative bits about survive. At large , the mapping approaches the identity and .
Information Bottleneck Curve Explorer
Explore the information curve for a binary symmetric channel. Adjust the crossover probability and observe how the IB tradeoff changes.
Parameters
Historical Note: The Information Bottleneck: From Physics to Deep Learning
1999βpresentNaftali Tishby introduced the information bottleneck in 1999, motivated by ideas from statistical mechanics β the parameter plays the role of inverse temperature, and the IB Lagrangian is a free energy. For nearly two decades, the IB was primarily a tool for document clustering and computational linguistics. Then in 2015, Tishby and Shwartz-Ziv proposed the "information plane" hypothesis: that deep neural networks learn by first fitting the training data (increasing ) and then compressing (decreasing ). This claim ignited a fierce debate β Saxe et al. (2018) showed that the compression phase depends on the activation function and does not occur with ReLU networks. The debate remains unresolved, but it brought the IB framework into the mainstream of deep learning theory.
Definition: The Information Plane
The Information Plane
For a deep neural network with layers , the information plane is the two-dimensional plot of for each layer . By the data processing inequality, as we go deeper (composing more non-invertible transformations), can only decrease. The hypothesis is that a well-trained network traces a path in the information plane that approaches the information curve .
The difficulty with this framework is that is notoriously hard to estimate for continuous, high-dimensional representations. Different estimators give qualitatively different results, which is why the debate about compression in deep networks has been so contentious.
Common Mistake: Mutual Information Estimation in High Dimensions
Mistake:
Assuming that mutual information between high-dimensional continuous variables can be accurately estimated from finite samples using binning or KDE.
Correction:
Mutual information estimation in high dimensions is fundamentally hard. The KSG estimator, MINE (mutual information neural estimator), and variational bounds all have significant biases or variances. When interpreting information plane plots, the estimation method matters as much as the network architecture. Always report which estimator was used and its known biases.
Quick Check
In the information bottleneck, what happens as ?
becomes independent of (maximum compression)
converges to a minimal sufficient statistic of for
becomes a deterministic function of
The IB Lagrangian becomes unbounded
As , the relevance term dominates and the optimal preserves all information about , which is exactly what a sufficient statistic does.
The IB as Rate-Distortion with KL Distortion
The IB is not merely analogous to rate-distortion theory β it is a rate-distortion problem. Define the distortion measure . Then: The last term is a constant, so minimizing the IB Lagrangian is equivalent to minimizing , which is a rate-distortion problem with Lagrange multiplier . This is the same structure we studied in Chapter 6 β the only difference is the distortion function.
The Information Bottleneck Tradeoff
Information Bottleneck (IB)
A framework for finding compressed representations that preserve relevant information, formulated as minimizing subject to the Markov chain .
Related: The Information Bottleneck Lagrangian, The Information Curve
Information Plane
A two-dimensional visualization plotting against for each layer of a deep neural network, used to study how networks learn representations.
Related: The Information Plane