The Data Shuffling Problem
ML Training Is Hungry for Communication
Distributed machine learning at scale requires splitting data across many workers. Each worker trains on its portion; gradients are aggregated to update a shared model. Between training epochs, data must be shuffled (randomly reassigned to workers) to ensure diverse exposure per worker and improve convergence.
This shuffling is communication-intensive: in a naive implementation, each worker exchanges a large fraction of its data with others at every epoch. For a 100-worker cluster training on terabytes of data, shuffling can dominate training time and bandwidth.
The CommIT insight (Wan-Tuninetti-Caire): worker memory in distributed ML plays the same role as cache memory in coded caching. Coded shuffling β XOR'ing shuffled data with cached portions β reduces communication cost by the same factor as MAN's coded multicasting.
This chapter shows how. The result is a surprising: coded caching techniques apply directly to distributed ML, saving compute- cluster bandwidth.
Definition: Distributed Data Shuffling
Distributed Data Shuffling
Consider workers participating in distributed ML training. The dataset of samples is partitioned across workers such that each worker stores a fraction of the dataset: .
Between epochs, each worker must receive a new random subset of the dataset for the next epoch. Under epoch-wise shuffling, worker 's new assignment is a random size- subset of independent of its previous assignment .
The data shuffling problem is: how does the cluster transfer new assignments to workers with minimum communication?
In practice, a full random permutation per epoch is not needed; approximate shuffling with smaller memory works fine. But the theoretical framework assumes full re-shuffling per epoch.
Theorem: Uncoded Data Shuffling Cost
Under uncoded shuffling (each worker receives its new assignment as raw data from other workers), the total communication cost per epoch is or per worker (each needing fraction of its new assignment as new data).
Each worker's new assignment contains samples. On average, fraction of these are samples the worker did not have previously, so must be received. Across workers, total communication scales as .
Per-worker analysis
Under random re-shuffling, worker 's new assignment and old are independent random subsets. Expected overlap: samples in common. New samples needed: per worker.
Aggregate cost
workers Γ new samples/worker: . Simplifying for the communication accounting: each new sample must be received by 1 worker; sent by 1 worker. Total communication: data units.
Per-epoch rate
Per worker: . Per epoch total: . Unchanged by the overlap; dominant term is .
Example: Uncoded Shuffling at Scale
Training a ResNet-50 on ImageNet (1.28M samples) with 256 workers, each storing 5% ( samples). Dataset is 150 GB. Per-epoch communication cost (uncoded)?
Per-worker
Dataset size GB GB per worker per epoch.
Aggregate
TB per epoch. At 100 Gbps network: seconds of shuffling per epoch. Substantial.
Annual training
Typical training: 90 epochs. Shuffling alone: PB of communication. Significant fraction of total training traffic.
Motivation
If coded shuffling can reduce this by factor , total shuffling traffic drops to ~230 TB β a huge operational saving.
The Coded-Caching β Coded-Shuffling Analogy
The deep insight of Wan-Tuninetti-Caire (2020) is that data shuffling and coded caching are structurally the same problem:
| Coded Caching | Coded Shuffling |
|---|---|
| users | workers |
| files | dataset samples |
| cache size | worker storage |
| Placement phase | Initial data partition |
| Delivery phase | Shuffling between epochs |
| Demand | New assignment |
| Coded XOR messages | Coded shuffling messages |
The MAN analysis transfers with minimal changes. The factor becomes . Coded multicasting saves bandwidth in shuffling just as it saves bandwidth in CDN delivery.
This cross-domain insight opens a new application of coded caching beyond content delivery: distributed computing.
Key Takeaway
Coded caching generalizes to distributed ML shuffling. Worker memory replaces cache; shuffled data replaces delivery. The coded-multicasting gain saves ML-cluster bandwidth. The CommIT Wan-Tuninetti-Caire insight extends coded caching from CDN theory to compute-cluster optimization.