References & Further Reading

References

  1. K. Wan, D. Tuninetti, and G. Caire, Fundamental Limits of Distributed Data Shuffling, 2020

    Foundational CommIT paper on coded data shuffling. Establishes the $(1 + Ks)$ communication reduction factor. Core result of Chapter 15.

  2. R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, Gradient Coding: Avoiding Stragglers in Distributed Learning, 2017

    Original gradient coding paper. Introduces straggler-tolerant redundant computation. Relevant for §15.3.

  3. K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, Speeding Up Distributed Machine Learning Using Codes, 2017

    Coded matrix multiplication for distributed ML. Complementary to gradient coding.

  4. S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr, A Fundamental Tradeoff Between Computation and Communication in Distributed Computing, 2018

    Coded MapReduce — memory-communication tradeoff in distributed computing. Sets up the framework generalized by coded shuffling.

  5. M. A. Attia and R. Tandon, Near Optimal Coded Data Shuffling for Distributed Learning, 2019

    Alternative analysis and scheme for coded data shuffling. Slightly different but complementary to Wan-Tuninetti-Caire.

  6. M. A. Maddah-Ali and U. Niesen, Fundamental Limits of Caching, 2014

    MAN scheme. Chapter 15 applies the same combinatorial structure to data shuffling.

  7. J. Dean and L. A. Barroso, The Tail at Scale, 2013

    Introduces straggler problem in distributed systems. Motivation for gradient coding (§15.3).

  8. A. Sergeev and M. Del Balzo, Horovod: Fast and Easy Distributed Deep Learning in TensorFlow, 2018

    Horovod (Uber) — production all-reduce framework. Context for §15.4 deployment discussion.

  9. M. Li, D. G. Andersen, A. J. Smola, and K. Yu, Parameter Server for Distributed Machine Learning, 2014

    Parameter-server architecture; production ML frameworks (BytePS, Tencent Angel) are variants.

  10. H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, Communication-Efficient Learning of Deep Networks from Decentralized Data, 2017

    Foundational federated-learning paper (Google). Relevant for coded-FL extensions discussed in §15.4.

  11. J. Kim and G. Caire, Coded Federated Learning, 2023

    Caire-group collaboration extending coded techniques to FL. Combines coded shuffling + differential privacy.

  12. P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, 2017

    Hyperscale ImageNet training. Practical scale for which coded techniques could apply. Context for §15.4.

  13. R. Tandon et al., Gradient Coding: Avoiding Stragglers, 2017

    Same as tandon-gradient-2017; alternative citation for §15.3.

Further Reading

Resources for deeper study of coded computing and shuffling.

  • The Wan-Tuninetti-Caire foundational paper

    K. Wan, D. Tuninetti, G. Caire, 'Fundamental Limits of Distributed Data Shuffling,' IEEE TIT 2020

    Core CommIT contribution. Theorem 1 and the achievable scheme are the basis of Chapter 15.

  • Coded computing survey

    S. Li et al., 'Coded Computing: Tutorial and Recent Advances,' Foundations and Trends 2020

    Survey of the coded-computing field. Connects gradient coding, shuffling, MapReduce.

  • Gradient coding origin

    R. Tandon, Q. Lei, A. G. Dimakis, N. Karampatziakis, 'Gradient Coding,' ICML 2017

    Gradient coding paper; essential for §15.3.

  • Practical federated learning

    H. B. McMahan et al., 'Communication-Efficient Learning of Deep Networks from Decentralized Data,' AISTATS 2017

    FL baseline against which coded techniques are benchmarked.