References & Further Reading

References

R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, Gradient Coding: Avoiding Stragglers in Distributed Learning, 2017. [Link]
The headline reference. Introduces the cyclic gradient-coding construction and proves the matching lower bound. Should be read in full alongside this chapter.
Z. Charles, D. Papailiopoulos, and J. Ellenberg, Approximate Gradient Coding via Sparse Random Graphs, 2017. [Link]
Introduces approximate gradient coding with random sparse encoding. The basis for §6.3 of this chapter.
N. Raviv, R. Tandon, A. Dimakis, and I. Tamo, Gradient Coding from Cyclic MDS Codes and Expander Graphs, 2020
Connects gradient coding to the larger MDS-code and expander-graph literature. Important for understanding the engineering choices in production deployments.
W. Halbawi, N. Azizan-Ruhi, F. Salehi, and B. Hassibi, Improving Distributed Gradient Descent Using Reed-Solomon Codes, 2018
Reed-Solomon-based gradient-coding construction with reduced storage. Useful comparison to Tandon et al.'s cyclic construction.
M. Ye and E. Abbe, Communication-Computation Efficient Gradient Coding, 2018. [Link]
Generalizes gradient coding to non-uniform partitions, optimizing over the storage / decoder tradeoff. Influential for handling heterogeneous workers.
K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, Speeding Up Distributed Machine Learning Using Codes, 2018
The original coded-computation paper for ML straggler mitigation. Background reading; gradient coding is the matrix-vector specialization of this earlier work.
Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, Polynomial Codes: An Optimal Design for High-Dimensional Coded Matrix Multiplication, 2017. [Link]
Polynomial codes — the matrix-multiplication parent framework that gradient coding specializes from. See Chapter 5 for the full development.
S. U. Stich, J.-B. Cordonnier, and M. Jaggi, Sparsified SGD with Memory, 2018. [Link]
Foundational reference for gradient sparsification with error feedback. Comparison and complement to the approximate gradient-coding approach of §6.3.
J. So, B. Güler, and A. S. Avestimehr, Byzantine-Resilient Secure Federated Learning, 2021
Byzantine-resilient FL framework that composes coded gradient computation with secure aggregation. Useful background for Chapter 11.
T. Jahani-Nezhad, M. A. Maddah-Ali, and G. Caire, Byzantine-Resilient Secure Aggregation for Federated Learning Based on Ramp Sharing, 2023
ByzSecAgg — the CommIT-group result that builds on coded gradient computation (this chapter) and ramp secret sharing (Chapter 3). Full development in Chapter 11.
J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, Federated Learning: Strategies for Improving Communication Efficiency, 2016. [Link]
Quantization, sparsification, and other communication- efficient federated-learning techniques that complement coded gradient computation.
S. Li and A. S. Avestimehr, Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning, 2020
Comprehensive survey including a chapter on gradient coding. The standard reference for the field.