References & Further Reading

References

  1. N. Tishby, F. C. Pereira, and W. Bialek, The Information Bottleneck Method, 1999

    The foundational paper introducing the IB framework. Derives the self-consistent equations and the connection to rate-distortion theory.

  2. R. Shwartz-Ziv and N. Tishby, Opening the Black Box of Deep Neural Networks via Information, 2017

    Proposes the information plane hypothesis for deep learning: networks first fit then compress. Sparked intense debate about the role of compression in deep learning.

  3. A. Xu and M. Raginsky, Information-Theoretic Analysis of Generalization Capability of Learning Algorithms, 2017

    Derives the mutual information bound on generalization error, connecting learning theory to information theory via the Donsker-Varadhan representation.

  4. Y. Bu, S. Zou, and V. V. Veeravalli, Tightening Mutual Information-Based Bounds on Generalization Error, 2020

    Introduces the individual-sample MI bound, which is tighter than the Xu-Raginsky bound by capturing per-sample memorization.

  5. Y. Zhang, J. Duchi, M. I. Jordan, and M. J. Wainwright, Information-Theoretic Lower Bounds for Distributed Statistical Estimation with Communication Constraints, 2013

    Establishes the minimax lower bounds for distributed estimation under communication constraints, showing the phase transition between statistical and communication regimes.

  6. A. T. Suresh, F. X. Yu, S. Kumar, and H. B. McMahan, Distributed Mean Estimation with Limited Communication, 2017

    Designs practical quantization schemes for distributed mean estimation and proves near-optimal communication complexity.

  7. B. Nazer and M. Gastpar, Computation over Multiple-Access Channels, 2007

    Foundational work on function computation over the MAC. Introduces the computation capacity and characterizes it for structured functions.

  8. H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, Communication-Efficient Learning of Deep Networks from Decentralized Data, 2017

    Introduces the FedAvg algorithm and the federated learning framework. The paper that launched modern federated learning research.

  9. T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley, 2nd ed., 2006

    The standard reference for information theory. Chapters on rate-distortion theory and the data processing inequality are prerequisites for this chapter.

  10. A. El Gamal and Y.-H. Kim, Network Information Theory, Cambridge University Press, 2011

    Comprehensive treatment of multi-user information theory including the MAC capacity region and distributed source coding, which are foundations for the AirComp analysis.

Further Reading

The intersection of information theory and machine learning is a rapidly growing field. These resources provide deeper coverage of specific topics.

  • Information-theoretic generalization bounds

    Y. Steinke, "Reasoning About Generalization via Conditional Mutual Information," COLT 2020. A comprehensive treatment of tighter MI-based bounds.

    Extends the Xu-Raginsky framework with conditional MI and supersample techniques, achieving non-vacuous bounds for modern deep learning architectures.

  • The information bottleneck and deep learning

    A. M. Saxe et al., "On the Information Bottleneck Theory of Deep Learning," ICLR 2018. The counter-argument to the Shwartz-Ziv compression hypothesis.

    Essential reading for understanding the debate about whether deep networks compress. Shows that the observed compression depends on the activation function and estimator.

  • Over-the-air computation: comprehensive survey

    G. Zhu et al., "Over-the-Air Computation for 6G: Turning Air into a Computer," IEEE Communications Surveys & Tutorials, 2024.

    Comprehensive survey covering AirComp theory, practical schemes, and applications to federated learning, consensus, and distributed inference.

  • Communication-efficient distributed optimization

    J. Konečný et al., "Randomized Distributed Mean Estimation: Accuracy vs. Communication," Frontiers in Applied Mathematics and Statistics, 2018.

    Practical algorithms for gradient compression including random sparsification, quantization, and error feedback, with rigorous convergence guarantees.