References & Further Reading
References
- R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, Gradient Coding: Avoiding Stragglers in Distributed Learning, 2017. [Link]
The headline reference. Introduces the cyclic gradient-coding construction and proves the matching lower bound. Should be read in full alongside this chapter.
- Z. Charles, D. Papailiopoulos, and J. Ellenberg, Approximate Gradient Coding via Sparse Random Graphs, 2017. [Link]
Introduces approximate gradient coding with random sparse encoding. The basis for §6.3 of this chapter.
- N. Raviv, R. Tandon, A. Dimakis, and I. Tamo, Gradient Coding from Cyclic MDS Codes and Expander Graphs, 2020
Connects gradient coding to the larger MDS-code and expander-graph literature. Important for understanding the engineering choices in production deployments.
- W. Halbawi, N. Azizan-Ruhi, F. Salehi, and B. Hassibi, Improving Distributed Gradient Descent Using Reed-Solomon Codes, 2018
Reed-Solomon-based gradient-coding construction with reduced storage. Useful comparison to Tandon et al.'s cyclic construction.
- M. Ye and E. Abbe, Communication-Computation Efficient Gradient Coding, 2018. [Link]
Generalizes gradient coding to non-uniform partitions, optimizing over the storage / decoder tradeoff. Influential for handling heterogeneous workers.
- K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, Speeding Up Distributed Machine Learning Using Codes, 2018
The original coded-computation paper for ML straggler mitigation. Background reading; gradient coding is the matrix-vector specialization of this earlier work.
- Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, Polynomial Codes: An Optimal Design for High-Dimensional Coded Matrix Multiplication, 2017. [Link]
Polynomial codes — the matrix-multiplication parent framework that gradient coding specializes from. See Chapter 5 for the full development.
- S. U. Stich, J.-B. Cordonnier, and M. Jaggi, Sparsified SGD with Memory, 2018. [Link]
Foundational reference for gradient sparsification with error feedback. Comparison and complement to the approximate gradient-coding approach of §6.3.
- J. So, B. Güler, and A. S. Avestimehr, Byzantine-Resilient Secure Federated Learning, 2021
Byzantine-resilient FL framework that composes coded gradient computation with secure aggregation. Useful background for Chapter 11.
- T. Jahani-Nezhad, M. A. Maddah-Ali, and G. Caire, Byzantine-Resilient Secure Aggregation for Federated Learning Based on Ramp Sharing, 2023
ByzSecAgg — the CommIT-group result that builds on coded gradient computation (this chapter) and ramp secret sharing (Chapter 3). Full development in Chapter 11.
- J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, Federated Learning: Strategies for Improving Communication Efficiency, 2016. [Link]
Quantization, sparsification, and other communication- efficient federated-learning techniques that complement coded gradient computation.
- S. Li and A. S. Avestimehr, Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning, 2020
Comprehensive survey including a chapter on gradient coding. The standard reference for the field.
Further Reading
Resources for going deeper into coded distributed gradient descent and related techniques.
Gradient coding — modern survey
Li & Avestimehr, *Coded Computing*, FnT 2020, Ch. 4
Comprehensive treatment of gradient coding including the cyclic, fractional, and Reed-Solomon-based variants — all in a unified framework.
Approximate gradient coding — extensions
Wang et al., 'Approximate Gradient Coding for Federated Learning,' IEEE T-IT 2022
Extends approximate gradient coding to the federated- learning setting with explicit FL convergence rates for non-convex losses.
Asynchronous and stale gradient methods
Lian et al., 'Asynchronous Decentralized Parallel SGD,' NeurIPS 2018
The complementary axis: how to handle stragglers via asynchrony rather than coding. Useful contrast to understand when each approach is preferred.
Production federated-learning systems
Bonawitz et al., 'Towards Federated Learning at Scale: System Design,' MLSys 2019
Engineering perspective on federated learning at Google scale. Provides the production context where coded gradient computation is sometimes (and sometimes not) deployed.