The Information Bottleneck

Why the Information Bottleneck?

Suppose we observe a random variable XX and want to extract a compact representation TT that preserves as much information as possible about a relevant target YY. This is the central question of supervised learning, feature extraction, and representation learning. The information bottleneck (IB) framework, introduced by Tishby, Pereira, and Bialek (1999), answers this question using the language of information theory: compress XX into TT while retaining II about YY.

The point is that the IB sits at the intersection of rate-distortion theory and sufficient statistics. When Ξ²β†’βˆž\beta \to \infty, the IB solution converges to a minimal sufficient statistic of XX for YY. When Ξ²\beta is finite, we get a controlled tradeoff between compression and relevance β€” and this tradeoff is precisely the kind of problem information theory was built to solve.

Definition:

The Information Bottleneck Lagrangian

Given a joint distribution PX,YP_{X,Y}, the information bottleneck seeks a stochastic mapping PT∣XP_{T|X} that minimizes the IB Lagrangian: LIB=I(X;T)βˆ’Ξ²β€‰I(T;Y)\mathcal{L}_{\text{IB}} = I(X;T) - \beta \, I(T;Y) where Ξ²>0\beta > 0 is a Lagrange multiplier controlling the tradeoff between compression (I(X;T)I(X;T) small) and relevance (I(T;Y)I(T;Y) large), subject to the Markov chain Y⊸X⊸TY \multimap X \multimap T.

The constraint Y⊸X⊸TY \multimap X \multimap T means that TT can only access YY through XX. By the data processing inequality, I(T;Y)≀I(X;Y)I(T;Y) \leq I(X;Y), so perfect preservation of information about YY requires TT to retain all information in XX about YY.

Theorem: Self-Consistent Equations for the IB Optimum

The optimal IB mapping PT∣Xβˆ—P_{T|X}^* satisfies the following self-consistent equations: PT∣Xβˆ—(t∣x)=PTβˆ—(t)Z(x,Ξ²)exp⁑ ⁣(βˆ’Ξ²β€‰D(PY∣X=x βˆ₯ PY∣T=tβˆ—))P_{T|X}^*(t|x) = \frac{P_T^*(t)}{Z(x,\beta)} \exp\!\bigl(-\beta \, D\bigl(P_{Y|X=x} \,\|\, P_{Y|T=t}^*\bigr)\bigr) where Z(x,Ξ²)Z(x,\beta) is a normalization constant, PTβˆ—(t)=βˆ‘xPX(x)PT∣Xβˆ—(t∣x)P_T^*(t) = \sum_x P_X(x) P_{T|X}^*(t|x), and PY∣T=tβˆ—=βˆ‘xPY∣X=xPX∣T=tβˆ—(x)P_{Y|T=t}^* = \sum_x P_{Y|X=x} P_{X|T=t}^*(x).

The optimal mapping assigns xx to cluster tt with probability that depends on how similar the conditional PY∣X=xP_{Y|X=x} is to the cluster's "prototype" PY∣T=tβˆ—P_{Y|T=t}^*, measured by KL divergence. This is exactly the structure of a rate-distortion problem where the distortion is the KL divergence β€” and the solution has the same exponential tilting form as the Blahut-Arimoto algorithm.

Definition:

The Information Curve

The information curve I(I(X;T))\mathcal{I}(I(X;T)) traces the maximum achievable I(T;Y)I(T;Y) as a function of the compression level I(X;T)I(X;T): I(R)=max⁑PT∣X: I(X;T)≀RI(T;Y)\mathcal{I}(R) = \max_{P_{T|X}:\, I(X;T) \leq R} I(T;Y) subject to Y⊸X⊸TY \multimap X \multimap T. This curve is concave and non-decreasing, with I(0)=0\mathcal{I}(0) = 0 and I(I(X;Y))=I(X;Y)\mathcal{I}(I(X;Y)) = I(X;Y).

The information curve is the dual object to the rate-distortion function. In rate-distortion theory, we fix a distortion level and minimize rate. Here, we fix a compression level and maximize relevance.

Blahut–Arimoto Algorithm for the Information Bottleneck

Complexity: Each iteration is O(∣Xβˆ£β‹…βˆ£Tβˆ£β‹…βˆ£Y∣)O(|\mathcal{X}| \cdot |\mathcal{T}| \cdot |\mathcal{Y}|). Convergence is guaranteed since LIB\mathcal{L}_{\text{IB}} is bounded below and decreases monotonically, but the rate of convergence depends on Ξ²\beta and initialization.
Input: Joint distribution PX,YP_{X,Y}, tradeoff parameter Ξ²\beta, tolerance Ξ΅\varepsilon
Output: Optimal mapping PT∣Xβˆ—P_{T|X}^*, bottleneck representation PTβˆ—P_T^*
1. Initialize PT∣X(0)P_{T|X}^{(0)} (e.g., random soft assignment to ∣T∣|\mathcal{T}| clusters)
2. Repeat until convergence (change in LIB<Ξ΅\mathcal{L}_{\text{IB}} < \varepsilon):
a. Compute marginal: PT(k)(t)=βˆ‘xPX(x)PT∣X(k)(t∣x)P_T^{(k)}(t) = \sum_x P_X(x) P_{T|X}^{(k)}(t|x)
b. Compute decoder: PY∣T(k)(y∣t)=1PT(k)(t)βˆ‘xPY∣X(y∣x)PX(x)PT∣X(k)(t∣x)P_{Y|T}^{(k)}(y|t) = \frac{1}{P_T^{(k)}(t)} \sum_x P_{Y|X}(y|x) P_X(x) P_{T|X}^{(k)}(t|x)
c. Update encoder:
PT∣X(k+1)(t∣x)=PT(k)(t)Z(k)(x)exp⁑ ⁣(βˆ’Ξ²β€‰DKL(PY∣X=xβˆ₯PY∣T=t(k)))P_{T|X}^{(k+1)}(t|x) = \frac{P_T^{(k)}(t)}{Z^{(k)}(x)} \exp\!\bigl(-\beta\, D_{\text{KL}}(P_{Y|X=x} \| P_{Y|T=t}^{(k)})\bigr)
3. Return PT∣Xβˆ—P_{T|X}^*, PTβˆ—P_T^*, PY∣Tβˆ—P_{Y|T}^*

This is structurally identical to the Blahut-Arimoto algorithm for rate-distortion: the "distortion" here is the KL divergence D(PY∣X=xβˆ₯PY∣T=t)D(P_{Y|X=x} \| P_{Y|T=t}). The same alternating minimization pattern we saw in Chapter 6 reappears here.

Example: Information Bottleneck for a Binary Symmetric Source

Let X∈{0,1}X \in \{0,1\} be Bernoulli(1/21/2), and Y=XβŠ•NY = X \oplus N where N∼Bernoulli(p)N \sim \text{Bernoulli}(p) is independent noise with p=0.1p = 0.1. The bottleneck alphabet is T={0,1}\mathcal{T} = \{0, 1\}. Compute the information curve I(T;Y)I(T;Y) vs. I(X;T)I(X;T) as Ξ²\beta varies.

Information Bottleneck Curve Explorer

Explore the information curve for a binary symmetric channel. Adjust the crossover probability pp and observe how the IB tradeoff changes.

Parameters
0.1
200

Historical Note: The Information Bottleneck: From Physics to Deep Learning

1999–present

Naftali Tishby introduced the information bottleneck in 1999, motivated by ideas from statistical mechanics β€” the parameter Ξ²\beta plays the role of inverse temperature, and the IB Lagrangian is a free energy. For nearly two decades, the IB was primarily a tool for document clustering and computational linguistics. Then in 2015, Tishby and Shwartz-Ziv proposed the "information plane" hypothesis: that deep neural networks learn by first fitting the training data (increasing I(T;Y)I(T;Y)) and then compressing (decreasing I(X;T)I(X;T)). This claim ignited a fierce debate β€” Saxe et al. (2018) showed that the compression phase depends on the activation function and does not occur with ReLU networks. The debate remains unresolved, but it brought the IB framework into the mainstream of deep learning theory.

,

Definition:

The Information Plane

For a deep neural network with layers T1,T2,…,TLT_1, T_2, \ldots, T_L, the information plane is the two-dimensional plot of (I(X;Tβ„“),I(Tβ„“;Y))(I(X; T_\ell), I(T_\ell; Y)) for each layer β„“=1,…,L\ell = 1, \ldots, L. By the data processing inequality, as we go deeper (composing more non-invertible transformations), I(X;Tβ„“)I(X; T_\ell) can only decrease. The hypothesis is that a well-trained network traces a path in the information plane that approaches the information curve I\mathcal{I}.

The difficulty with this framework is that I(X;Tβ„“)I(X; T_\ell) is notoriously hard to estimate for continuous, high-dimensional representations. Different estimators give qualitatively different results, which is why the debate about compression in deep networks has been so contentious.

Common Mistake: Mutual Information Estimation in High Dimensions

Mistake:

Assuming that mutual information I(X;T)I(X;T) between high-dimensional continuous variables can be accurately estimated from finite samples using binning or KDE.

Correction:

Mutual information estimation in high dimensions is fundamentally hard. The KSG estimator, MINE (mutual information neural estimator), and variational bounds all have significant biases or variances. When interpreting information plane plots, the estimation method matters as much as the network architecture. Always report which estimator was used and its known biases.

Quick Check

In the information bottleneck, what happens as Ξ²β†’βˆž\beta \to \infty?

TT becomes independent of XX (maximum compression)

TT converges to a minimal sufficient statistic of XX for YY

TT becomes a deterministic function of YY

The IB Lagrangian becomes unbounded

The IB as Rate-Distortion with KL Distortion

The IB is not merely analogous to rate-distortion theory β€” it is a rate-distortion problem. Define the distortion measure d(x,t)=D(PY∣X=xβˆ₯PY∣T=t)d(x, t) = D(P_{Y|X=x} \| P_{Y|T=t}). Then: I(X;T)βˆ’Ξ²I(T;Y)=I(X;T)+Ξ²E[d(X,T)]βˆ’Ξ²I(X;Y)I(X;T) - \beta I(T;Y) = I(X;T) + \beta \mathbb{E}[d(X,T)] - \beta I(X;Y) The last term is a constant, so minimizing the IB Lagrangian is equivalent to minimizing I(X;T)+Ξ²E[d(X,T)]I(X;T) + \beta \mathbb{E}[d(X,T)], which is a rate-distortion problem with Lagrange multiplier Ξ²\beta. This is the same structure we studied in Chapter 6 β€” the only difference is the distortion function.

The Information Bottleneck Tradeoff

Animates the IB information curve as the tradeoff parameter Ξ²\beta sweeps from 0 (maximum compression, TT independent of XX) to infinity (maximum relevance, TT is a sufficient statistic). The curve traces the Pareto frontier of compression vs. relevance for a binary symmetric source.

Information Bottleneck (IB)

A framework for finding compressed representations that preserve relevant information, formulated as minimizing I(X;T)βˆ’Ξ²I(T;Y)I(X;T) - \beta I(T;Y) subject to the Markov chain Y⊸X⊸TY \multimap X \multimap T.

Related: The Information Bottleneck Lagrangian, The Information Curve

Information Plane

A two-dimensional visualization plotting I(X;Tβ„“)I(X;T_\ell) against I(Tβ„“;Y)I(T_\ell;Y) for each layer Tβ„“T_\ell of a deep neural network, used to study how networks learn representations.

Related: The Information Plane