Ferkans — Interactive Telecom Tutor

Why the Information Bottleneck?

Suppose we observe a random variable $X$ and want to extract a compact representation $T$ that preserves as much information as possible about a relevant target $Y$ . This is the central question of supervised learning, feature extraction, and representation learning. The information bottleneck (IB) framework, introduced by Tishby, Pereira, and Bialek (1999), answers this question using the language of information theory: compress $X$ into $T$ while retaining $I$ about $Y$ .

The point is that the IB sits at the intersection of rate-distortion theory and sufficient statistics. When $\beta \to \infty$ , the IB solution converges to a minimal sufficient statistic of $X$ for $Y$ . When $\beta$ is finite, we get a controlled tradeoff between compression and relevance — and this tradeoff is precisely the kind of problem information theory was built to solve.

Definition:
The Information Bottleneck Lagrangian

Given a joint distribution $P_{X,Y}$ , the information bottleneck seeks a stochastic mapping $P_{T|X}$ that minimizes the IB Lagrangian: $\mathcal{L}_{\text{IB}} = I(X;T) - \beta \, I(T;Y)$ where $\beta > 0$ is a Lagrange multiplier controlling the tradeoff between compression ( $I(X;T)$ small) and relevance ( $I(T;Y)$ large), subject to the Markov chain $Y \multimap X \multimap T$ .

The constraint $Y \multimap X \multimap T$ means that $T$ can only access $Y$ through $X$ . By the data processing inequality, $I(T;Y) \leq I(X;Y)$ , so perfect preservation of information about $Y$ requires $T$ to retain all information in $X$ about $Y$ .

Theorem: Self-Consistent Equations for the IB Optimum

The optimal IB mapping $P_{T|X}^*$ satisfies the following self-consistent equations: $P_{T|X}^*(t|x) = \frac{P_T^*(t)}{Z(x,\beta)} \exp\!\bigl(-\beta \, D\bigl(P_{Y|X=x} \,\|\, P_{Y|T=t}^*\bigr)\bigr)$ where $Z(x,\beta)$ is a normalization constant, $P_T^*(t) = \sum_x P_X(x) P_{T|X}^*(t|x)$ , and $P_{Y|T=t}^* = \sum_x P_{Y|X=x} P_{X|T=t}^*(x)$ .

The optimal mapping assigns $x$ to cluster $t$ with probability that depends on how similar the conditional $P_{Y|X=x}$ is to the cluster's "prototype" $P_{Y|T=t}^*$ , measured by KL divergence. This is exactly the structure of a rate-distortion problem where the distortion is the KL divergence — and the solution has the same exponential tilting form as the Blahut-Arimoto algorithm.

Proof

Lagrangian formulation

We write the constrained optimization. The IB seeks to minimize $I(X;T) - \beta I(T;Y)$ over all $P_{T|X}$ satisfying $Y \multimap X \multimap T$ . Expanding the mutual informations using their KL divergence representations: $I(X;T) = \sum_{x,t} P_{X,T}(x,t) \log \frac{P_{T|X}(t|x)}{P_T(t)}$ $I(T;Y) = \sum_{t,y} P_{T,Y}(t,y) \log \frac{P_{Y|T}(y|t)}{P_Y(y)}$

Functional derivative

Taking the functional derivative of $\mathcal{L}_{\text{IB}}$ with respect to $P_{T|X}(t|x)$ and setting it to zero (with a Lagrange multiplier for the normalization constraint $\sum_t P_{T|X}(t|x) = 1$ ), we obtain: $\log \frac{P_{T|X}^*(t|x)}{P_T^*(t)} = -\beta \sum_y P_{Y|X}(y|x) \log \frac{P_{Y|T}^*(y|t)}{P_Y(y)} + \text{const}(x)$

Exponentiate and normalize

Exponentiating both sides and recognizing the sum over $y$ as $D(P_{Y|X=x} \| P_{Y|T=t}^*) + \text{const}$ , we arrive at: $P_{T|X}^*(t|x) = \frac{P_T^*(t)}{Z(x,\beta)} \exp\!\bigl(-\beta \, D(P_{Y|X=x} \| P_{Y|T=t}^*)\bigr)$ The three equations ( $P_{T|X}^*$ , $P_T^*$ , $P_{Y|T}^*$ ) are coupled and must be solved self-consistently, which motivates an iterative algorithm.

Definition:
The Information Curve

The information curve $\mathcal{I}(I(X;T))$ traces the maximum achievable $I(T;Y)$ as a function of the compression level $I(X;T)$ : $\mathcal{I}(R) = \max_{P_{T|X}:\, I(X;T) \leq R} I(T;Y)$ subject to $Y \multimap X \multimap T$ . This curve is concave and non-decreasing, with $\mathcal{I}(0) = 0$ and $\mathcal{I}(I(X;Y)) = I(X;Y)$ .

The information curve is the dual object to the rate-distortion function. In rate-distortion theory, we fix a distortion level and minimize rate. Here, we fix a compression level and maximize relevance.

Blahut–Arimoto Algorithm for the Information Bottleneck

Complexity: Each iteration is

O(|\mathcal{X}| \cdot |\mathcal{T}| \cdot |\mathcal{Y}|)

. Convergence is guaranteed since

\mathcal{L}_{\text{IB}}

is bounded below and decreases monotonically, but the rate of convergence depends on

\beta

and initialization.

Input: Joint distribution

P_{X,Y}

, tradeoff parameter

\beta

, tolerance

\varepsilon

Output: Optimal mapping

P_{T|X}^*

, bottleneck representation

P_T^*

1. Initialize

P_{T|X}^{(0)}

(e.g., random soft assignment to

|\mathcal{T}|

clusters)

2. Repeat until convergence (change in

\mathcal{L}_{\text{IB}} < \varepsilon

):

a. Compute marginal:

P_T^{(k)}(t) = \sum_x P_X(x) P_{T|X}^{(k)}(t|x)

b. Compute decoder:

P_{Y|T}^{(k)}(y|t) = \frac{1}{P_T^{(k)}(t)} \sum_x P_{Y|X}(y|x) P_X(x) P_{T|X}^{(k)}(t|x)

c. Update encoder:

P_{T|X}^{(k+1)}(t|x) = \frac{P_T^{(k)}(t)}{Z^{(k)}(x)} \exp\!\bigl(-\beta\, D_{\text{KL}}(P_{Y|X=x} \| P_{Y|T=t}^{(k)})\bigr)

3. Return

P_{T|X}^*

,

P_T^*

,

P_{Y|T}^*

This is structurally identical to the Blahut-Arimoto algorithm for rate-distortion: the "distortion" here is the KL divergence $D(P_{Y|X=x} \| P_{Y|T=t})$ . The same alternating minimization pattern we saw in Chapter 6 reappears here.

Example: Information Bottleneck for a Binary Symmetric Source

Let $X \in \{0,1\}$ be Bernoulli( $1/2$ ), and $Y = X \oplus N$ where $N \sim \text{Bernoulli}(p)$ is independent noise with $p = 0.1$ . The bottleneck alphabet is $\mathcal{T} = \{0, 1\}$ . Compute the information curve $I(T;Y)$ vs. $I(X;T)$ as $\beta$ varies.

Solution

Compute the joint distribution

Since $X$ is uniform, $P_{X,Y}(0,0) = P_{X,Y}(1,1) = (1-p)/2 = 0.45$ and $P_{X,Y}(0,1) = P_{X,Y}(1,0) = p/2 = 0.05$ . The mutual information is $I(X;Y) = 1 - H_{b}(p) = 1 - H_{b}(0.1) \approx 0.531$ bits.

Parametrize the bottleneck mapping

By symmetry, the optimal mapping has the form $P_{T|X}(0|0) = P_{T|X}(1|1) = 1-\alpha$ and $P_{T|X}(1|0) = P_{T|X}(0|1) = \alpha$ for some $\alpha \in [0, 1/2]$ . When $\alpha = 0$ , $T = X$ (no compression), and when $\alpha = 1/2$ , $T$ is independent of $X$ (maximum compression).

Trace the information curve

For each $\alpha$ , we compute $I(X;T) = 1 - H_{b}(\alpha)$ and $I(T;Y) = H_{b}(p * (1-\alpha) + (1-p)*\alpha) - H_{b}(p)$ (using the Mrs. Gerber lemma for the BSC composition). Sweeping $\alpha$ from $0$ to $1/2$ traces a concave curve from $(I(X;Y), I(X;Y))$ down to $(0, 0)$ .

Interpretation

The slope of the information curve at any point equals $1/\beta$ . At small $\beta$ (heavy compression), only the most informative bits about $Y$ survive. At large $\beta$ , the mapping approaches the identity and $T \approx X$ .

Information Bottleneck Curve Explorer

Explore the information curve for a binary symmetric channel. Adjust the crossover probability $p$ and observe how the IB tradeoff changes.

Parameters

Crossover probability p0.1

Number of curve points200

Historical Note: The Information Bottleneck: From Physics to Deep Learning

1999–present

Naftali Tishby introduced the information bottleneck in 1999, motivated by ideas from statistical mechanics — the parameter $\beta$ plays the role of inverse temperature, and the IB Lagrangian is a free energy. For nearly two decades, the IB was primarily a tool for document clustering and computational linguistics. Then in 2015, Tishby and Shwartz-Ziv proposed the "information plane" hypothesis: that deep neural networks learn by first fitting the training data (increasing $I(T;Y)$ ) and then compressing (decreasing $I(X;T)$ ). This claim ignited a fierce debate — Saxe et al. (2018) showed that the compression phase depends on the activation function and does not occur with ReLU networks. The debate remains unresolved, but it brought the IB framework into the mainstream of deep learning theory.

,

Definition:
The Information Plane

For a deep neural network with layers $T_1, T_2, \ldots, T_L$ , the information plane is the two-dimensional plot of $(I(X; T_\ell), I(T_\ell; Y))$ for each layer $\ell = 1, \ldots, L$ . By the data processing inequality, as we go deeper (composing more non-invertible transformations), $I(X; T_\ell)$ can only decrease. The hypothesis is that a well-trained network traces a path in the information plane that approaches the information curve $\mathcal{I}$ .

The difficulty with this framework is that $I(X; T_\ell)$ is notoriously hard to estimate for continuous, high-dimensional representations. Different estimators give qualitatively different results, which is why the debate about compression in deep networks has been so contentious.

Common Mistake: Mutual Information Estimation in High Dimensions

Mistake:

Assuming that mutual information $I(X;T)$ between high-dimensional continuous variables can be accurately estimated from finite samples using binning or KDE.

Correction:

Mutual information estimation in high dimensions is fundamentally hard. The KSG estimator, MINE (mutual information neural estimator), and variational bounds all have significant biases or variances. When interpreting information plane plots, the estimation method matters as much as the network architecture. Always report which estimator was used and its known biases.

Quick Check

In the information bottleneck, what happens as $\beta \to \infty$ ?

$T$ becomes independent of $X$ (maximum compression)

$T$ converges to a minimal sufficient statistic of $X$ for $Y$

$T$ becomes a deterministic function of $Y$

The IB Lagrangian becomes unbounded

Correction:

T

converges to a minimal sufficient statistic of

X

for

Y

As $\beta \to \infty$ , the relevance term dominates and the optimal $T$ preserves all information about $Y$ , which is exactly what a sufficient statistic does.

The IB as Rate-Distortion with KL Distortion

The IB is not merely analogous to rate-distortion theory — it is a rate-distortion problem. Define the distortion measure $d(x, t) = D(P_{Y|X=x} \| P_{Y|T=t})$ . Then: $I(X;T) - \beta I(T;Y) = I(X;T) + \beta \mathbb{E}[d(X,T)] - \beta I(X;Y)$ The last term is a constant, so minimizing the IB Lagrangian is equivalent to minimizing $I(X;T) + \beta \mathbb{E}[d(X,T)]$ , which is a rate-distortion problem with Lagrange multiplier $\beta$ . This is the same structure we studied in Chapter 6 — the only difference is the distortion function.

The Information Bottleneck Tradeoff

Animates the IB information curve as the tradeoff parameter

\beta

sweeps from 0 (maximum compression,

T

independent of

X

) to infinity (maximum relevance,

T

is a sufficient statistic). The curve traces the Pareto frontier of compression vs. relevance for a binary symmetric source.

Information Bottleneck (IB)

A framework for finding compressed representations that preserve relevant information, formulated as minimizing $I(X;T) - \beta I(T;Y)$ subject to the Markov chain $Y \multimap X \multimap T$ .

Information Plane

A two-dimensional visualization plotting $I(X;T_\ell)$ against $I(T_\ell;Y)$ for each layer $T_\ell$ of a deep neural network, used to study how networks learn representations.

Related: The Information Plane

The Information Bottleneck