Coded Shuffling (CommIT Wan-Tuninetti-Caire)

Adapting MAN to Shuffling

With the coded-caching analogy in mind, the scheme for coded shuffling is clear: apply MAN-style placement and delivery to the data-shuffling setting. The twist is that both the "cache" (worker memory) and the "demands" (new assignments) change every epoch. We need a scheme that reconfigures efficiently.

The Wan-Tuninetti-Caire 2020 paper establishes the exact rate formula and a matching achievable scheme. The rate matches MAN's (Kβˆ’t)/(t+1)(K-t)/(t+1) with t=Kst = Ks β€” a direct analog.

Theorem: Wan-Tuninetti-Caire Coded Shuffling Rate

For data shuffling with KK workers and per-worker storage fraction ss, the achievable communication cost per epoch is Rcoded(s)β€…β€Š=β€…β€ŠK(1βˆ’s)1+Ksβ‹…DΒ dataΒ units,R_\text{coded}(s) \;=\; \frac{K (1 - s)}{1 + K s} \cdot D \text{ data units}, where DD is the total dataset size. The reduction factor over uncoded shuffling is 1+Ks1 + K s.

Treat each sample as a "file" split into coded subfragments. MAN-style coded XOR messages simultaneously shuffle data for multiple workers. The "caching gain" parameter t=Kst = Ks drives the reduction factor 1+Ks1 + Ks.

πŸŽ“CommIT Contribution(2020)

Coded Data Shuffling

K. Wan, D. Tuninetti, G. Caire β€” IEEE Transactions on Information Theory

The Wan-Tuninetti-Caire 2020 CommIT paper establishes the fundamental limits of distributed data shuffling, recasting the problem in the coded-caching framework:

  1. Coded shuffling rate. Communication cost per epoch reduces from K(1βˆ’s)DK(1-s)D (uncoded) to K(1βˆ’s)D/(1+Ks)K(1-s)D/(1+Ks) (coded) β€” a factor-of-(1+Ks)(1+Ks) improvement.
  2. Order-optimal. The achievable rate matches the cut-set lower bound for small ss; within factor 2 for large ss.
  3. Cross-domain insight. Coded caching techniques apply to distributed computing. Worker memory is like cache; data reshuffling is like content delivery.

The paper opened the door to coded computing as an application of coded-caching theory. Subsequent work has extended this to federated learning, all-reduce communication, and gradient coding. The CommIT group continues to extend this framework with Tuninetti at UIC.

Practical impact: for 100+ worker ML clusters with 10% per-worker storage, coded shuffling can reduce inter-epoch bandwidth by 10Γ—. At ML-scale (terabyte datasets, hundreds of epochs), this is substantial operational savings.

coded-shufflingcommitdistributed-mlView Paper β†’

Coded Data Shuffling for Distributed ML

Aggregator (left) and KK workers (right) with per-worker storage sDsD. XOR-coded shuffle message simultaneously transfers new data to multiple workers. Bandwidth saving factor: 1+Ks1 + Ks β€” the CommIT Wan-Tuninetti-Caire result.

Uncoded vs Coded Shuffling Cost

Communication cost per epoch: uncoded (linear in KK) vs coded (saturating at (1βˆ’s)/s(1-s)/s). Coded saturation means: beyond moderate KK, adding workers doesn't add aggregate communication. Major practical saving.

Parameters
0.25
100

Coded Shuffling Gain Factor

Gain (1+Ks)(1+Ks) vs per-worker storage ss, for varying KK. Larger KK and ss give larger gain. For typical ML clusters (K=100K = 100, s=0.1s = 0.1): gain factor 11 β€” substantial.

Parameters
20

Example: Shuffling Savings in Production ML

Revisit the ResNet-50 / ImageNet example: K=256K = 256, s=0.05s = 0.05. Compute coded shuffling rate and total training savings.

Cumulative Training Communication

Cumulative communication cost over training epochs. Coded shuffling reduces the slope; at 50 epochs the gap is pronounced.

Parameters
20
0.3
50

Key Takeaway

The Wan-Tuninetti-Caire coded shuffling scheme saves bandwidth by (1+Ks)(1 + Ks) factor. Worker memory serves as the coded-cache; XOR shuffling messages replace raw transfers. For realistic ML clusters, this is 10-20Γ— bandwidth reduction β€” a major operational gain imported directly from coded caching theory.