Ferkans — Interactive Telecom Tutor

Why Information Theory for Generalization?

The central question of statistical learning theory is: when does a model that performs well on training data also perform well on unseen data? Classical answers (VC dimension, Rademacher complexity) are combinatorial and often vacuous for overparameterized models like deep networks. Information theory offers a different perspective: the generalization error is bounded by the mutual information between the training data and the learned hypothesis. The point is that a learning algorithm that "memorizes" the training data (high $I$ ) will generalize poorly, while one that extracts only the essential statistical structure (low $I$ ) will generalize well. This is compression, viewed through the lens of learning.

Definition:
Generalization Error

Let $S = (Z_1, \ldots, Z_n)$ be a training set of $n$ i.i.d. samples from distribution $P_Z$ , and let $W = A(S)$ be the output of a (possibly randomized) learning algorithm $A$ . The generalization error is: $\text{gen}(W, S) = \mathbb{E}_{Z \sim P_Z}[\ell(W, Z)] - \frac{1}{n}\sum_{i=1}^n \ell(W, Z_i)$ where $\ell(w, z)$ is the loss function. The generalization error measures the gap between the population risk and the empirical risk.

Theorem: Mutual Information Bound on Generalization (Xu–Raginsky)

Let $\ell(w, z) \in [0, 1]$ be a bounded loss. Then the expected generalization error of algorithm $A$ satisfies: $|\mathbb{E}[\text{gen}(W, S)]| \leq \sqrt{\frac{2 I(W; S)}{n}}$ where $I(W; S)$ is the mutual information between the learned hypothesis $W$ and the training set $S$ .

If the algorithm's output $W$ is nearly independent of the training data $S$ (low mutual information), then the empirical risk is a good estimator of the population risk. Intuitively, an algorithm with low $I(W; S)$ cannot "overfit" because it has not memorized the specific training samples. This bound makes precise the intuition that generalization requires compression.

Proof

Decoupling via change of measure

Let $P_{W,S}$ be the joint distribution and $P_W \otimes P_S$ the product of marginals. Define $f(w, s) = \text{gen}(w, s)$ . By the Donsker-Varadhan variational representation of KL divergence: $\mathbb{E}_{P_{W,S}}[f(W,S)] \leq D(P_{W,S} \| P_W \otimes P_S) + \log \mathbb{E}_{P_W \otimes P_S}[e^{f(W,S)}]$

Bound the cumulant generating function

Under the product measure $P_W \otimes P_S$ , the hypothesis $W$ is independent of $S$ . Therefore $\mathbb{E}_{P_W \otimes P_S}[\text{gen}(W,S)] = 0$ (the empirical risk is an unbiased estimator of the population risk when evaluated on an independent hypothesis). By Hoeffding's lemma, for $\ell \in [0,1]$ : $\log \mathbb{E}_{P_W \otimes P_S}[e^{\lambda \cdot \text{gen}(W,S)}] \leq \frac{\lambda^2}{8n}$

Apply the variational bound

Since $D(P_{W,S} \| P_W \otimes P_S) = I(W; S)$ , we get for any $\lambda > 0$ : $\mathbb{E}[\text{gen}(W,S)] \leq \frac{I(W;S)}{\lambda} + \frac{\lambda}{8n}$ Optimizing over $\lambda$ (setting $\lambda = \sqrt{8n \cdot I(W;S)}$ ) yields: $\mathbb{E}[\text{gen}(W,S)] \leq \sqrt{\frac{2I(W;S)}{n}}$ The same argument applied to $-f$ gives the two-sided bound.

Generalization as Compression

The Xu-Raginsky bound reveals a deep connection between learning and source coding. An algorithm that achieves low mutual information $I(W;S)$ is, in effect, compressing the training data into a compact representation $W$ . The bound says that the "number of bits" the algorithm extracts from the data (measured by $I(W;S)$ ) controls generalization. This is the information-theoretic version of Occam's razor: simpler models (fewer bits) generalize better.

Intuitively, what happens is this: if we could describe the learned model in $B$ bits, then $I(W;S) \leq B$ , and generalization error scales as $\sqrt{2B/n}$ . This recovers (and generalizes) classical sample complexity bounds.

Definition:
PAC-Bayes Bound

Let $P$ be a prior distribution over hypotheses (chosen before seeing data) and $Q$ be a posterior distribution (depending on data $S$ ). For any $\delta > 0$ , with probability at least $1 - \delta$ over $S$ : $\mathbb{E}_{W \sim Q}[\text{risk}(W)] \leq \mathbb{E}_{W \sim Q}[\hat{\text{risk}}(W, S)] + \sqrt{\frac{D(Q \| P) + \log(2\sqrt{n}/\delta)}{2n}}$ where $\text{risk}(W) = \mathbb{E}_{Z}[\ell(W,Z)]$ and $\hat{\text{risk}}(W,S) = \frac{1}{n}\sum_{i=1}^n \ell(W, Z_i)$ .

The PAC-Bayes bound is the individual-sequence (high-probability) counterpart of the Xu-Raginsky bound. The KL divergence $D(Q \| P)$ plays the role of "description complexity" — how far the learned posterior has moved from the prior. When $Q$ is close to $P$ (the algorithm has not changed much from its initialization), generalization is guaranteed.

Theorem: Individual-Sample Mutual Information Bound

Under the same setup as the Xu-Raginsky bound, a tighter bound holds: $|\mathbb{E}[\text{gen}(W, S)]| \leq \frac{1}{n} \sum_{i=1}^n \sqrt{2 I(W; Z_i)}$ where $I(W; Z_i)$ is the mutual information between the output hypothesis and each individual training sample.

This bound is tighter because it captures the idea that some training samples contribute more to the learned model than others. An outlier that the model memorizes contributes a large $I(W; Z_i)$ , while a "typical" sample contributes less. The per-sample MI bound distinguishes between these cases, while the global bound $I(W; S)$ averages them.

Proof

Decompose via the chain rule

Write $I(W; S) = \sum_{i=1}^n I(W; Z_i | Z_1, \ldots, Z_{i-1})$ . This is the chain rule of mutual information.

Apply the single-sample bound

For each sample $Z_i$ , we can apply the same Donsker-Varadhan argument conditionally. Because the samples are i.i.d., the conditional MI satisfies $I(W; Z_i | Z^{i-1}) \leq I(W; Z_i)$ (dropping conditioning can only increase MI for independent components). Applying the square-root bound to each term and using Jensen's inequality completes the proof.

Example: Mutual Information of SGD with Gaussian Perturbation

Consider a learning algorithm that runs SGD and then adds Gaussian noise: $W = W_{\text{SGD}} + \xi$ , where $\xi \sim \mathcal{N}(0, \sigma^2 I_d)$ . If the SGD output satisfies $\|W_{\text{SGD}}\|^2 \leq B^2$ almost surely, bound the generalization error using the MI bound.

Solution

Bound the mutual information

Since $W = W_{\text{SGD}}(S) + \xi$ with independent Gaussian noise $\xi$ , we have: $I(W; S) \leq I(W_{\text{SGD}}; S) \leq \frac{d}{2} \log\!\left(1 + \frac{B^2}{\sigma^2}\right)$ The first inequality uses data processing ( $W$ is a noisy version of $W_{\text{SGD}}$ ). The second uses the fact that for a bounded variable perturbed by Gaussian noise, the MI is at most the capacity of an AWGN channel with SNR $= B^2/\sigma^2$ .

Apply the Xu-Raginsky bound

$|\mathbb{E}[\text{gen}(W, S)]| \leq \sqrt{\frac{d \log(1 + B^2/\sigma^2)}{n}}$ $This tells us that adding more noise ($ \sigma^2 \uparrow $) improves generalization but hurts training accuracy. The dimension$ d$ of the model appears naturally — more parameters require more samples, as expected.

Interpretation

This result connects to the AWGN channel capacity $C = \frac{1}{2}\log(1 + \text{SNR})$ . The learning algorithm is essentially communicating through a Gaussian channel with capacity $d \cdot C$ bits per training run. The generalization bound says you need $n \gg d \cdot C$ samples to generalize.

Mutual Information Generalization Bound

Visualize the MI generalization bound as a function of the number of training samples, model dimension, and noise level.

Parameters

Model dimension d100

Norm bound B1

Noise std σ0.1

Max samples10000

Generalization Bounds: Classical vs. Information-Theoretic

Bound	Complexity Measure	Rate	Strengths	Limitations
VC dimension	$d_{\text{VC}}$ (combinatorial)	$O(\sqrt{d_{\text{VC}}/n})$	Distribution-free, sharp for linear classifiers	Vacuous for overparameterized models
Rademacher	$\mathcal{R}_n(\mathcal{F})$ (empirical)	$O(\mathcal{R}_n + \sqrt{\log(1/\delta)/n})$	Data-dependent, tighter than VC	Hard to compute for deep networks
PAC-Bayes	$D_{\text{KL}}(Q \\| P)$	$O(\sqrt{D_{\text{KL}}/n})$	Non-vacuous for deep networks with right prior	Requires choosing prior $P$ before training
MI (Xu-Raginsky)	$I(W;S)$	$O(\sqrt{I(W;S)/n})$	Algorithm-dependent, captures compression	MI hard to estimate; can be infinite

Common Mistake: Mutual Information Can Be Infinite for Deterministic Algorithms

Mistake:

Applying the MI generalization bound directly to a deterministic learning algorithm (e.g., standard SGD without noise injection), expecting a finite bound.

Correction:

For a deterministic algorithm, $W = A(S)$ is a function of $S$ , so $I(W; S) = H(W)$ , which is infinite for continuous-valued $W$ . The MI bound is meaningful only for randomized algorithms or when using the conditional MI version with supersample techniques. Alternatively, use the PAC-Bayes bound, which remains finite through the KL divergence to a fixed prior.

Quick Check

A learning algorithm adds Gaussian noise $\xi \sim \mathcal{N}(0, \sigma^2 I_d)$ to its output. If we double $\sigma^2$ , what happens to the MI generalization bound?

It doubles

It decreases, roughly by a factor of $\sqrt{2}$ for large SNR

It stays the same because noise does not affect generalization

It increases because more noise means worse performance

Correction:

It decreases, roughly by a factor of

\sqrt{2}

for large SNR

For large $B^2/\sigma^2$ , $\log(1+B^2/\sigma^2) \approx \log(B^2/\sigma^2)$ , and doubling $\sigma^2$ reduces this by $\log 2$ , shrinking the bound.

Mutual Information Generalization Bound

Visualizes how the MI generalization bound

\sqrt{2I(W;S)/n}

decreases with sample size, and how noise injection (reducing

I(W;S)

) shifts the curve downward — the information-theoretic view of regularization.

Why This Matters: Information Theory Meets Statistical Inference

The MI generalization bounds in this section connect deeply to the estimation theory in Book FSI. The Cramér-Rao bound limits parameter estimation accuracy via Fisher information, while the MI bound limits learning accuracy via mutual information. Both are information-theoretic lower bounds on the cost of extracting structure from noisy data — the difference is that CRB assumes a known parametric model, while MI bounds are model-free. See Book FSI, Ch. 3 for the Bayesian estimation connections.

Generalization Error

The difference between the population risk (expected loss on new data) and the empirical risk (average loss on training data). Bounded by the mutual information between the learned model and the training set.

Related: Generalization Error

PAC-Bayes Bound

A family of generalization bounds that control the gap between train and test error using the KL divergence between a data-dependent posterior and a data-independent prior over hypotheses.

Related: PAC-Bayes Bound

PAC-Bayes and Mutual Information Bounds for Generalization

Why Information Theory for Generalization?

Definition: Generalization Error

Theorem: Mutual Information Bound on Generalization (Xu–Raginsky)

Decoupling via change of measure

Bound the cumulant generating function

Apply the variational bound

Generalization as Compression

Definition: PAC-Bayes Bound

Theorem: Individual-Sample Mutual Information Bound

Decompose via the chain rule

Apply the single-sample bound

Example: Mutual Information of SGD with Gaussian Perturbation

Bound the mutual information

Apply the Xu-Raginsky bound

Interpretation

Mutual Information Generalization Bound

Parameters

Generalization Bounds: Classical vs. Information-Theoretic

Common Mistake: Mutual Information Can Be Infinite for Deterministic Algorithms

Quick Check

Mutual Information Generalization Bound

Why This Matters: Information Theory Meets Statistical Inference

Generalization Error

PAC-Bayes Bound

Definition:
Generalization Error

Definition:
PAC-Bayes Bound