References & Further Reading

References

K. Wan, D. Tuninetti, and G. Caire, Fundamental Limits of Distributed Data Shuffling, 2020
Foundational CommIT paper on coded data shuffling. Establishes the $(1 + Ks)$ communication reduction factor. Core result of Chapter 15.
R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, Gradient Coding: Avoiding Stragglers in Distributed Learning, 2017
Original gradient coding paper. Introduces straggler-tolerant redundant computation. Relevant for §15.3.
K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, Speeding Up Distributed Machine Learning Using Codes, 2017
Coded matrix multiplication for distributed ML. Complementary to gradient coding.
S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr, A Fundamental Tradeoff Between Computation and Communication in Distributed Computing, 2018
Coded MapReduce — memory-communication tradeoff in distributed computing. Sets up the framework generalized by coded shuffling.
M. A. Attia and R. Tandon, Near Optimal Coded Data Shuffling for Distributed Learning, 2019
Alternative analysis and scheme for coded data shuffling. Slightly different but complementary to Wan-Tuninetti-Caire.
M. A. Maddah-Ali and U. Niesen, Fundamental Limits of Caching, 2014
MAN scheme. Chapter 15 applies the same combinatorial structure to data shuffling.
J. Dean and L. A. Barroso, The Tail at Scale, 2013
Introduces straggler problem in distributed systems. Motivation for gradient coding (§15.3).
A. Sergeev and M. Del Balzo, Horovod: Fast and Easy Distributed Deep Learning in TensorFlow, 2018
Horovod (Uber) — production all-reduce framework. Context for §15.4 deployment discussion.
M. Li, D. G. Andersen, A. J. Smola, and K. Yu, Parameter Server for Distributed Machine Learning, 2014
Parameter-server architecture; production ML frameworks (BytePS, Tencent Angel) are variants.
H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, Communication-Efficient Learning of Deep Networks from Decentralized Data, 2017
Foundational federated-learning paper (Google). Relevant for coded-FL extensions discussed in §15.4.
J. Kim and G. Caire, Coded Federated Learning, 2023
Caire-group collaboration extending coded techniques to FL. Combines coded shuffling + differential privacy.
P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, 2017
Hyperscale ImageNet training. Practical scale for which coded techniques could apply. Context for §15.4.
R. Tandon et al., Gradient Coding: Avoiding Stragglers, 2017
Same as tandon-gradient-2017; alternative citation for §15.3.