Prerequisites & Notation
Before You Begin
This chapter applies coded caching principles to distributed machine learning. Prerequisites: MAN basics and familiarity with distributed ML (stochastic gradient descent, parameter server architectures).
- Stochastic gradient descent / distributed ML
Self-check: What is data shuffling between epochs and why is it needed?
- Parameter server architecture(Review ch26)
Self-check: Can you describe the all-reduce operation in PyTorch distributed training?
- Basic combinatorics / XOR operations(Review ch02)
Self-check: Why do XOR-coded messages work in MAN delivery?
- Index coding perspective (Ch 4)(Review ch04)
Self-check: What is the role of the conflict graph in coded delivery?
Notation for This Chapter
Symbols for distributed ML data shuffling.
| Symbol | Meaning | Introduced |
|---|---|---|
| Number of workers in distributed ML | s01 | |
| Total dataset (analogous to library in MAN) | s01 | |
| Fraction of dataset stored at each worker; in MAN terms | s01 | |
| Communication cost per shuffling epoch (data units) | s01 | |
| Number of epochs (rounds of shuffling + training) | s01 | |
| Straggler tolerance in gradient coding | s03 | |
| Caching-gain-like parameter for shuffling | s02 |