References & Further Reading

References

  1. R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, Gradient Coding: Avoiding Stragglers in Distributed Learning, 2017. [Link]

    The headline reference. Introduces the cyclic gradient-coding construction and proves the matching lower bound. Should be read in full alongside this chapter.

  2. Z. Charles, D. Papailiopoulos, and J. Ellenberg, Approximate Gradient Coding via Sparse Random Graphs, 2017. [Link]

    Introduces approximate gradient coding with random sparse encoding. The basis for §6.3 of this chapter.

  3. N. Raviv, R. Tandon, A. Dimakis, and I. Tamo, Gradient Coding from Cyclic MDS Codes and Expander Graphs, 2020

    Connects gradient coding to the larger MDS-code and expander-graph literature. Important for understanding the engineering choices in production deployments.

  4. W. Halbawi, N. Azizan-Ruhi, F. Salehi, and B. Hassibi, Improving Distributed Gradient Descent Using Reed-Solomon Codes, 2018

    Reed-Solomon-based gradient-coding construction with reduced storage. Useful comparison to Tandon et al.'s cyclic construction.

  5. M. Ye and E. Abbe, Communication-Computation Efficient Gradient Coding, 2018. [Link]

    Generalizes gradient coding to non-uniform partitions, optimizing over the storage / decoder tradeoff. Influential for handling heterogeneous workers.

  6. K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, Speeding Up Distributed Machine Learning Using Codes, 2018

    The original coded-computation paper for ML straggler mitigation. Background reading; gradient coding is the matrix-vector specialization of this earlier work.

  7. Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, Polynomial Codes: An Optimal Design for High-Dimensional Coded Matrix Multiplication, 2017. [Link]

    Polynomial codes — the matrix-multiplication parent framework that gradient coding specializes from. See Chapter 5 for the full development.

  8. S. U. Stich, J.-B. Cordonnier, and M. Jaggi, Sparsified SGD with Memory, 2018. [Link]

    Foundational reference for gradient sparsification with error feedback. Comparison and complement to the approximate gradient-coding approach of §6.3.

  9. J. So, B. Güler, and A. S. Avestimehr, Byzantine-Resilient Secure Federated Learning, 2021

    Byzantine-resilient FL framework that composes coded gradient computation with secure aggregation. Useful background for Chapter 11.

  10. T. Jahani-Nezhad, M. A. Maddah-Ali, and G. Caire, Byzantine-Resilient Secure Aggregation for Federated Learning Based on Ramp Sharing, 2023

    ByzSecAgg — the CommIT-group result that builds on coded gradient computation (this chapter) and ramp secret sharing (Chapter 3). Full development in Chapter 11.

  11. J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, Federated Learning: Strategies for Improving Communication Efficiency, 2016. [Link]

    Quantization, sparsification, and other communication- efficient federated-learning techniques that complement coded gradient computation.

  12. S. Li and A. S. Avestimehr, Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning, 2020

    Comprehensive survey including a chapter on gradient coding. The standard reference for the field.

Further Reading

Resources for going deeper into coded distributed gradient descent and related techniques.

  • Gradient coding — modern survey

    Li & Avestimehr, *Coded Computing*, FnT 2020, Ch. 4

    Comprehensive treatment of gradient coding including the cyclic, fractional, and Reed-Solomon-based variants — all in a unified framework.

  • Approximate gradient coding — extensions

    Wang et al., 'Approximate Gradient Coding for Federated Learning,' IEEE T-IT 2022

    Extends approximate gradient coding to the federated- learning setting with explicit FL convergence rates for non-convex losses.

  • Asynchronous and stale gradient methods

    Lian et al., 'Asynchronous Decentralized Parallel SGD,' NeurIPS 2018

    The complementary axis: how to handle stragglers via asynchrony rather than coding. Useful contrast to understand when each approach is preferred.

  • Production federated-learning systems

    Bonawitz et al., 'Towards Federated Learning at Scale: System Design,' MLSys 2019

    Engineering perspective on federated learning at Google scale. Provides the production context where coded gradient computation is sometimes (and sometimes not) deployed.