Chapter Summary
Chapter Summary
Key Points
- 1.
Data shuffling in distributed ML is a large communication cost: each worker receives a new random dataset subset per epoch, sending / receiving large data volumes across the cluster.
- 2.
Wan-Tuninetti-Caire (2020) CommIT result: coded shuffling reduces per-epoch communication from to — factor improvement.
- 3.
Coded-caching analogy. Worker memory = cache, new assignment = demand, shuffling = delivery. The MAN-style XOR messages simultaneously shuffle data for workers, where .
- 4.
Practical impact at scale. For , : 10-fold reduction in shuffling bandwidth. For hyperscale (1000 GPUs, PB datasets): billions of dollars per year in saved inter-DC bandwidth.
- 5.
Gradient coding. Redundant data storage per worker (factor ) tolerates stragglers. Different coded-computing primitive; storage-for-reliability tradeoff.
- 6.
Coded computing umbrella. Coded shuffling, gradient coding, coded matrix multiplication, coded MapReduce — all share the theme that memory/storage can replace communication or recomputation. CommIT framework unifies them.
- 7.
Deployment reality. Production ML lags theory; gradient coding and coded shuffling are research-stage. Practical adoption awaits cluster-scale integration of coded communication primitives.
Looking Ahead
Chapters 16-18 cover additional extensions — coded computing in detail, secure delivery, multi-access networks. Chapters 19-22 move to research frontiers: ISAC, online coded caching, video streaming, and open problems. This completes the tour of coded caching from foundational MAN theory to practical ML-cluster deployments — a journey through 10 years of CommIT research and its impact on the broader information-theory + computer-science communities.