PAC-Bayes and Mutual Information Bounds for Generalization

Why Information Theory for Generalization?

The central question of statistical learning theory is: when does a model that performs well on training data also perform well on unseen data? Classical answers (VC dimension, Rademacher complexity) are combinatorial and often vacuous for overparameterized models like deep networks. Information theory offers a different perspective: the generalization error is bounded by the mutual information between the training data and the learned hypothesis. The point is that a learning algorithm that "memorizes" the training data (high II) will generalize poorly, while one that extracts only the essential statistical structure (low II) will generalize well. This is compression, viewed through the lens of learning.

Definition:

Generalization Error

Let S=(Z1,…,Zn)S = (Z_1, \ldots, Z_n) be a training set of nn i.i.d. samples from distribution PZP_Z, and let W=A(S)W = A(S) be the output of a (possibly randomized) learning algorithm AA. The generalization error is: gen(W,S)=EZ∼PZ[β„“(W,Z)]βˆ’1nβˆ‘i=1nβ„“(W,Zi)\text{gen}(W, S) = \mathbb{E}_{Z \sim P_Z}[\ell(W, Z)] - \frac{1}{n}\sum_{i=1}^n \ell(W, Z_i) where β„“(w,z)\ell(w, z) is the loss function. The generalization error measures the gap between the population risk and the empirical risk.

Theorem: Mutual Information Bound on Generalization (Xu–Raginsky)

Let β„“(w,z)∈[0,1]\ell(w, z) \in [0, 1] be a bounded loss. Then the expected generalization error of algorithm AA satisfies: ∣E[gen(W,S)]βˆ£β‰€2I(W;S)n|\mathbb{E}[\text{gen}(W, S)]| \leq \sqrt{\frac{2 I(W; S)}{n}} where I(W;S)I(W; S) is the mutual information between the learned hypothesis WW and the training set SS.

If the algorithm's output WW is nearly independent of the training data SS (low mutual information), then the empirical risk is a good estimator of the population risk. Intuitively, an algorithm with low I(W;S)I(W; S) cannot "overfit" because it has not memorized the specific training samples. This bound makes precise the intuition that generalization requires compression.

Generalization as Compression

The Xu-Raginsky bound reveals a deep connection between learning and source coding. An algorithm that achieves low mutual information I(W;S)I(W;S) is, in effect, compressing the training data into a compact representation WW. The bound says that the "number of bits" the algorithm extracts from the data (measured by I(W;S)I(W;S)) controls generalization. This is the information-theoretic version of Occam's razor: simpler models (fewer bits) generalize better.

Intuitively, what happens is this: if we could describe the learned model in BB bits, then I(W;S)≀BI(W;S) \leq B, and generalization error scales as 2B/n\sqrt{2B/n}. This recovers (and generalizes) classical sample complexity bounds.

Definition:

PAC-Bayes Bound

Let PP be a prior distribution over hypotheses (chosen before seeing data) and QQ be a posterior distribution (depending on data SS). For any Ξ΄>0\delta > 0, with probability at least 1βˆ’Ξ΄1 - \delta over SS: EW∼Q[risk(W)]≀EW∼Q[risk^(W,S)]+D(Qβˆ₯P)+log⁑(2n/Ξ΄)2n\mathbb{E}_{W \sim Q}[\text{risk}(W)] \leq \mathbb{E}_{W \sim Q}[\hat{\text{risk}}(W, S)] + \sqrt{\frac{D(Q \| P) + \log(2\sqrt{n}/\delta)}{2n}} where risk(W)=EZ[β„“(W,Z)]\text{risk}(W) = \mathbb{E}_{Z}[\ell(W,Z)] and risk^(W,S)=1nβˆ‘i=1nβ„“(W,Zi)\hat{\text{risk}}(W,S) = \frac{1}{n}\sum_{i=1}^n \ell(W, Z_i).

The PAC-Bayes bound is the individual-sequence (high-probability) counterpart of the Xu-Raginsky bound. The KL divergence D(Qβˆ₯P)D(Q \| P) plays the role of "description complexity" β€” how far the learned posterior has moved from the prior. When QQ is close to PP (the algorithm has not changed much from its initialization), generalization is guaranteed.

Theorem: Individual-Sample Mutual Information Bound

Under the same setup as the Xu-Raginsky bound, a tighter bound holds: ∣E[gen(W,S)]βˆ£β‰€1nβˆ‘i=1n2I(W;Zi)|\mathbb{E}[\text{gen}(W, S)]| \leq \frac{1}{n} \sum_{i=1}^n \sqrt{2 I(W; Z_i)} where I(W;Zi)I(W; Z_i) is the mutual information between the output hypothesis and each individual training sample.

This bound is tighter because it captures the idea that some training samples contribute more to the learned model than others. An outlier that the model memorizes contributes a large I(W;Zi)I(W; Z_i), while a "typical" sample contributes less. The per-sample MI bound distinguishes between these cases, while the global bound I(W;S)I(W; S) averages them.

Example: Mutual Information of SGD with Gaussian Perturbation

Consider a learning algorithm that runs SGD and then adds Gaussian noise: W=WSGD+ΞΎW = W_{\text{SGD}} + \xi, where ξ∼N(0,Οƒ2Id)\xi \sim \mathcal{N}(0, \sigma^2 I_d). If the SGD output satisfies βˆ₯WSGDβˆ₯2≀B2\|W_{\text{SGD}}\|^2 \leq B^2 almost surely, bound the generalization error using the MI bound.

Mutual Information Generalization Bound

Visualize the MI generalization bound as a function of the number of training samples, model dimension, and noise level.

Parameters
100
1
0.1
10000

Generalization Bounds: Classical vs. Information-Theoretic

BoundComplexity MeasureRateStrengthsLimitations
VC dimensiondVCd_{\text{VC}} (combinatorial)O(dVC/n)O(\sqrt{d_{\text{VC}}/n})Distribution-free, sharp for linear classifiersVacuous for overparameterized models
RademacherRn(F)\mathcal{R}_n(\mathcal{F}) (empirical)O(Rn+log⁑(1/δ)/n)O(\mathcal{R}_n + \sqrt{\log(1/\delta)/n})Data-dependent, tighter than VCHard to compute for deep networks
PAC-BayesDKL(Qβˆ₯P)D_{\text{KL}}(Q \| P)O(DKL/n)O(\sqrt{D_{\text{KL}}/n})Non-vacuous for deep networks with right priorRequires choosing prior PP before training
MI (Xu-Raginsky)I(W;S)I(W;S)O(I(W;S)/n)O(\sqrt{I(W;S)/n})Algorithm-dependent, captures compressionMI hard to estimate; can be infinite

Common Mistake: Mutual Information Can Be Infinite for Deterministic Algorithms

Mistake:

Applying the MI generalization bound directly to a deterministic learning algorithm (e.g., standard SGD without noise injection), expecting a finite bound.

Correction:

For a deterministic algorithm, W=A(S)W = A(S) is a function of SS, so I(W;S)=H(W)I(W; S) = H(W), which is infinite for continuous-valued WW. The MI bound is meaningful only for randomized algorithms or when using the conditional MI version with supersample techniques. Alternatively, use the PAC-Bayes bound, which remains finite through the KL divergence to a fixed prior.

Quick Check

A learning algorithm adds Gaussian noise ξ∼N(0,Οƒ2Id)\xi \sim \mathcal{N}(0, \sigma^2 I_d) to its output. If we double Οƒ2\sigma^2, what happens to the MI generalization bound?

It doubles

It decreases, roughly by a factor of 2\sqrt{2} for large SNR

It stays the same because noise does not affect generalization

It increases because more noise means worse performance

Mutual Information Generalization Bound

Visualizes how the MI generalization bound 2I(W;S)/n\sqrt{2I(W;S)/n} decreases with sample size, and how noise injection (reducing I(W;S)I(W;S)) shifts the curve downward β€” the information-theoretic view of regularization.

Why This Matters: Information Theory Meets Statistical Inference

The MI generalization bounds in this section connect deeply to the estimation theory in Book FSI. The CramΓ©r-Rao bound limits parameter estimation accuracy via Fisher information, while the MI bound limits learning accuracy via mutual information. Both are information-theoretic lower bounds on the cost of extracting structure from noisy data β€” the difference is that CRB assumes a known parametric model, while MI bounds are model-free. See Book FSI, Ch. 3 for the Bayesian estimation connections.

Generalization Error

The difference between the population risk (expected loss on new data) and the empirical risk (average loss on training data). Bounded by the mutual information between the learned model and the training set.

Related: Generalization Error

PAC-Bayes Bound

A family of generalization bounds that control the gap between train and test error using the KL divergence between a data-dependent posterior and a data-independent prior over hypotheses.

Related: PAC-Bayes Bound