PAC-Bayes and Mutual Information Bounds for Generalization
Why Information Theory for Generalization?
The central question of statistical learning theory is: when does a model that performs well on training data also perform well on unseen data? Classical answers (VC dimension, Rademacher complexity) are combinatorial and often vacuous for overparameterized models like deep networks. Information theory offers a different perspective: the generalization error is bounded by the mutual information between the training data and the learned hypothesis. The point is that a learning algorithm that "memorizes" the training data (high ) will generalize poorly, while one that extracts only the essential statistical structure (low ) will generalize well. This is compression, viewed through the lens of learning.
Definition: Generalization Error
Generalization Error
Let be a training set of i.i.d. samples from distribution , and let be the output of a (possibly randomized) learning algorithm . The generalization error is: where is the loss function. The generalization error measures the gap between the population risk and the empirical risk.
Theorem: Mutual Information Bound on Generalization (XuβRaginsky)
Let be a bounded loss. Then the expected generalization error of algorithm satisfies: where is the mutual information between the learned hypothesis and the training set .
If the algorithm's output is nearly independent of the training data (low mutual information), then the empirical risk is a good estimator of the population risk. Intuitively, an algorithm with low cannot "overfit" because it has not memorized the specific training samples. This bound makes precise the intuition that generalization requires compression.
Decoupling via change of measure
Let be the joint distribution and the product of marginals. Define . By the Donsker-Varadhan variational representation of KL divergence:
Bound the cumulant generating function
Under the product measure , the hypothesis is independent of . Therefore (the empirical risk is an unbiased estimator of the population risk when evaluated on an independent hypothesis). By Hoeffding's lemma, for :
Apply the variational bound
Since , we get for any : Optimizing over (setting ) yields: The same argument applied to gives the two-sided bound.
Generalization as Compression
The Xu-Raginsky bound reveals a deep connection between learning and source coding. An algorithm that achieves low mutual information is, in effect, compressing the training data into a compact representation . The bound says that the "number of bits" the algorithm extracts from the data (measured by ) controls generalization. This is the information-theoretic version of Occam's razor: simpler models (fewer bits) generalize better.
Intuitively, what happens is this: if we could describe the learned model in bits, then , and generalization error scales as . This recovers (and generalizes) classical sample complexity bounds.
Definition: PAC-Bayes Bound
PAC-Bayes Bound
Let be a prior distribution over hypotheses (chosen before seeing data) and be a posterior distribution (depending on data ). For any , with probability at least over : where and .
The PAC-Bayes bound is the individual-sequence (high-probability) counterpart of the Xu-Raginsky bound. The KL divergence plays the role of "description complexity" β how far the learned posterior has moved from the prior. When is close to (the algorithm has not changed much from its initialization), generalization is guaranteed.
Theorem: Individual-Sample Mutual Information Bound
Under the same setup as the Xu-Raginsky bound, a tighter bound holds: where is the mutual information between the output hypothesis and each individual training sample.
This bound is tighter because it captures the idea that some training samples contribute more to the learned model than others. An outlier that the model memorizes contributes a large , while a "typical" sample contributes less. The per-sample MI bound distinguishes between these cases, while the global bound averages them.
Decompose via the chain rule
Write . This is the chain rule of mutual information.
Apply the single-sample bound
For each sample , we can apply the same Donsker-Varadhan argument conditionally. Because the samples are i.i.d., the conditional MI satisfies (dropping conditioning can only increase MI for independent components). Applying the square-root bound to each term and using Jensen's inequality completes the proof.
Example: Mutual Information of SGD with Gaussian Perturbation
Consider a learning algorithm that runs SGD and then adds Gaussian noise: , where . If the SGD output satisfies almost surely, bound the generalization error using the MI bound.
Bound the mutual information
Since with independent Gaussian noise , we have: The first inequality uses data processing ( is a noisy version of ). The second uses the fact that for a bounded variable perturbed by Gaussian noise, the MI is at most the capacity of an AWGN channel with SNR .
Apply the Xu-Raginsky bound
\sigma^2 \uparrowd$ of the model appears naturally β more parameters require more samples, as expected.
Interpretation
This result connects to the AWGN channel capacity . The learning algorithm is essentially communicating through a Gaussian channel with capacity bits per training run. The generalization bound says you need samples to generalize.
Mutual Information Generalization Bound
Visualize the MI generalization bound as a function of the number of training samples, model dimension, and noise level.
Parameters
Generalization Bounds: Classical vs. Information-Theoretic
| Bound | Complexity Measure | Rate | Strengths | Limitations |
|---|---|---|---|---|
| VC dimension | (combinatorial) | Distribution-free, sharp for linear classifiers | Vacuous for overparameterized models | |
| Rademacher | (empirical) | Data-dependent, tighter than VC | Hard to compute for deep networks | |
| PAC-Bayes | Non-vacuous for deep networks with right prior | Requires choosing prior before training | ||
| MI (Xu-Raginsky) | Algorithm-dependent, captures compression | MI hard to estimate; can be infinite |
Common Mistake: Mutual Information Can Be Infinite for Deterministic Algorithms
Mistake:
Applying the MI generalization bound directly to a deterministic learning algorithm (e.g., standard SGD without noise injection), expecting a finite bound.
Correction:
For a deterministic algorithm, is a function of , so , which is infinite for continuous-valued . The MI bound is meaningful only for randomized algorithms or when using the conditional MI version with supersample techniques. Alternatively, use the PAC-Bayes bound, which remains finite through the KL divergence to a fixed prior.
Quick Check
A learning algorithm adds Gaussian noise to its output. If we double , what happens to the MI generalization bound?
It doubles
It decreases, roughly by a factor of for large SNR
It stays the same because noise does not affect generalization
It increases because more noise means worse performance
For large , , and doubling reduces this by , shrinking the bound.
Mutual Information Generalization Bound
Why This Matters: Information Theory Meets Statistical Inference
The MI generalization bounds in this section connect deeply to the estimation theory in Book FSI. The CramΓ©r-Rao bound limits parameter estimation accuracy via Fisher information, while the MI bound limits learning accuracy via mutual information. Both are information-theoretic lower bounds on the cost of extracting structure from noisy data β the difference is that CRB assumes a known parametric model, while MI bounds are model-free. See Book FSI, Ch. 3 for the Bayesian estimation connections.
Generalization Error
The difference between the population risk (expected loss on new data) and the empirical risk (average loss on training data). Bounded by the mutual information between the learned model and the training set.
Related: Generalization Error
PAC-Bayes Bound
A family of generalization bounds that control the gap between train and test error using the KL divergence between a data-dependent posterior and a data-independent prior over hypotheses.
Related: PAC-Bayes Bound