The Data Shuffling Problem

ML Training Is Hungry for Communication

Distributed machine learning at scale requires splitting data across many workers. Each worker trains on its portion; gradients are aggregated to update a shared model. Between training epochs, data must be shuffled (randomly reassigned to workers) to ensure diverse exposure per worker and improve convergence.

This shuffling is communication-intensive: in a naive implementation, each worker exchanges a large fraction of its data with others at every epoch. For a 100-worker cluster training on terabytes of data, shuffling can dominate training time and bandwidth.

The CommIT insight (Wan-Tuninetti-Caire): worker memory in distributed ML plays the same role as cache memory in coded caching. Coded shuffling β€” XOR'ing shuffled data with cached portions β€” reduces communication cost by the same 1+Ks1 + Ks factor as MAN's coded multicasting.

This chapter shows how. The result is a surprising: coded caching techniques apply directly to distributed ML, saving compute- cluster bandwidth.

Definition:

Distributed Data Shuffling

Consider KK workers participating in distributed ML training. The dataset D\mathcal{D} of DD samples is partitioned across workers such that each worker stores a fraction ss of the dataset: ∣Dk∣=sD|D_k| = sD.

Between epochs, each worker must receive a new random subset of the dataset for the next epoch. Under epoch-wise shuffling, worker kk's new assignment D~k\tilde D_k is a random size-sDsD subset of D\mathcal{D} independent of its previous assignment DkD_k.

The data shuffling problem is: how does the cluster transfer new assignments to workers with minimum communication?

In practice, a full random permutation per epoch is not needed; approximate shuffling with smaller memory works fine. But the theoretical framework assumes full re-shuffling per epoch.

Theorem: Uncoded Data Shuffling Cost

Under uncoded shuffling (each worker receives its new assignment as raw data from other workers), the total communication cost per epoch is Runcodedβ€…β€Š=β€…β€ŠK(1βˆ’s)β‹…DΒ dataΒ units,R_\text{uncoded} \;=\; K(1 - s) \cdot D \text{ data units}, or (1βˆ’s)β‹…D(1-s) \cdot D per worker (each needing (1βˆ’s)(1-s) fraction of its new assignment as new data).

Each worker's new assignment contains sDsD samples. On average, (1βˆ’s)(1-s) fraction of these are samples the worker did not have previously, so must be received. Across KK workers, total communication scales as K(1βˆ’s)DK(1-s)D.

Example: Uncoded Shuffling at Scale

Training a ResNet-50 on ImageNet (1.28M samples) with 256 workers, each storing 5% (sD=64,000sD = 64{,}000 samples). Dataset is 150 GB. Per-epoch communication cost (uncoded)?

The Coded-Caching ↔ Coded-Shuffling Analogy

The deep insight of Wan-Tuninetti-Caire (2020) is that data shuffling and coded caching are structurally the same problem:

Coded Caching Coded Shuffling
KK users KK workers
NN files DD dataset samples
MM cache size M=sDM = sD worker storage
Placement phase Initial data partition
Delivery phase Shuffling between epochs
Demand dkd_k New assignment D~k\tilde D_k
Coded XOR messages Coded shuffling messages

The MAN analysis transfers with minimal changes. The 1+KM/N1 + KM/N factor becomes 1+Ks1 + Ks. Coded multicasting saves bandwidth in shuffling just as it saves bandwidth in CDN delivery.

This cross-domain insight opens a new application of coded caching beyond content delivery: distributed computing.

Key Takeaway

Coded caching generalizes to distributed ML shuffling. Worker memory replaces cache; shuffled data replaces delivery. The 1+Ks1 + Ks coded-multicasting gain saves ML-cluster bandwidth. The CommIT Wan-Tuninetti-Caire insight extends coded caching from CDN theory to compute-cluster optimization.