References & Further Reading
References
- K. Wan, D. Tuninetti, and G. Caire, Fundamental Limits of Distributed Data Shuffling, 2020
Foundational CommIT paper on coded data shuffling. Establishes the $(1 + Ks)$ communication reduction factor. Core result of Chapter 15.
- R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, Gradient Coding: Avoiding Stragglers in Distributed Learning, 2017
Original gradient coding paper. Introduces straggler-tolerant redundant computation. Relevant for §15.3.
- K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, Speeding Up Distributed Machine Learning Using Codes, 2017
Coded matrix multiplication for distributed ML. Complementary to gradient coding.
- S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr, A Fundamental Tradeoff Between Computation and Communication in Distributed Computing, 2018
Coded MapReduce — memory-communication tradeoff in distributed computing. Sets up the framework generalized by coded shuffling.
- M. A. Attia and R. Tandon, Near Optimal Coded Data Shuffling for Distributed Learning, 2019
Alternative analysis and scheme for coded data shuffling. Slightly different but complementary to Wan-Tuninetti-Caire.
- M. A. Maddah-Ali and U. Niesen, Fundamental Limits of Caching, 2014
MAN scheme. Chapter 15 applies the same combinatorial structure to data shuffling.
- J. Dean and L. A. Barroso, The Tail at Scale, 2013
Introduces straggler problem in distributed systems. Motivation for gradient coding (§15.3).
- A. Sergeev and M. Del Balzo, Horovod: Fast and Easy Distributed Deep Learning in TensorFlow, 2018
Horovod (Uber) — production all-reduce framework. Context for §15.4 deployment discussion.
- M. Li, D. G. Andersen, A. J. Smola, and K. Yu, Parameter Server for Distributed Machine Learning, 2014
Parameter-server architecture; production ML frameworks (BytePS, Tencent Angel) are variants.
- H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, Communication-Efficient Learning of Deep Networks from Decentralized Data, 2017
Foundational federated-learning paper (Google). Relevant for coded-FL extensions discussed in §15.4.
- J. Kim and G. Caire, Coded Federated Learning, 2023
Caire-group collaboration extending coded techniques to FL. Combines coded shuffling + differential privacy.
- P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, 2017
Hyperscale ImageNet training. Practical scale for which coded techniques could apply. Context for §15.4.
- R. Tandon et al., Gradient Coding: Avoiding Stragglers, 2017
Same as tandon-gradient-2017; alternative citation for §15.3.
Further Reading
Resources for deeper study of coded computing and shuffling.
The Wan-Tuninetti-Caire foundational paper
K. Wan, D. Tuninetti, G. Caire, 'Fundamental Limits of Distributed Data Shuffling,' IEEE TIT 2020
Core CommIT contribution. Theorem 1 and the achievable scheme are the basis of Chapter 15.
Coded computing survey
S. Li et al., 'Coded Computing: Tutorial and Recent Advances,' Foundations and Trends 2020
Survey of the coded-computing field. Connects gradient coding, shuffling, MapReduce.
Gradient coding origin
R. Tandon, Q. Lei, A. G. Dimakis, N. Karampatziakis, 'Gradient Coding,' ICML 2017
Gradient coding paper; essential for §15.3.
Practical federated learning
H. B. McMahan et al., 'Communication-Efficient Learning of Deep Networks from Decentralized Data,' AISTATS 2017
FL baseline against which coded techniques are benchmarked.