Ferkans — Interactive Telecom Tutor

Adapting MAN to Shuffling

With the coded-caching analogy in mind, the scheme for coded shuffling is clear: apply MAN-style placement and delivery to the data-shuffling setting. The twist is that both the "cache" (worker memory) and the "demands" (new assignments) change every epoch. We need a scheme that reconfigures efficiently.

The Wan-Tuninetti-Caire 2020 paper establishes the exact rate formula and a matching achievable scheme. The rate matches MAN's $(K-t)/(t+1)$ with $t = Ks$ — a direct analog.

Theorem: Wan-Tuninetti-Caire Coded Shuffling Rate

For data shuffling with $K$ workers and per-worker storage fraction $s$ , the achievable communication cost per epoch is $R_\text{coded}(s) \;=\; \frac{K (1 - s)}{1 + K s} \cdot D \text{ data units},$ where $D$ is the total dataset size. The reduction factor over uncoded shuffling is $1 + K s$ .

Treat each sample as a "file" split into coded subfragments. MAN-style coded XOR messages simultaneously shuffle data for multiple workers. The "caching gain" parameter $t = Ks$ drives the reduction factor $1 + Ks$ .

Proof

Setup

Each worker stores fraction $s$ of $D$ samples. Over an epoch, each worker needs new samples to fill its $(1-s)$ vacancy (to receive the next random subset).

MAN-style placement

Partition each sample into $\binom{K}{t}$ subparts, with $t = Ks$ . Worker $k$ stores subparts indexed by $t$ -subsets $\mathcal{S}$ with $k \in \mathcal{S}$ .

Coded shuffle messages

Server (or any aggregator) sends XOR messages: for each $(t+1)$ -subset $\mathcal{S}' \subseteq [K]$ , broadcast $X_{\mathcal{S}'} = \bigoplus_{k \in \mathcal{S}'} \tilde D_k[\mathcal{S}' \setminus \{k\}]$ , where $\tilde D_k$ is worker $k$ 's new assignment.

Decoding

Each worker extracts its needed subparts by XOR-cancellation using its stored subparts. Follows MAN logic exactly.

Rate

Number of XOR messages: $\binom{K}{t+1}$ . Each of size $D/\binom{K}{t}$ data units. Total: $D \binom{K}{t+1}/\binom{K}{t} = D(K-t)/(t+1)$ . Plugging $t = Ks$ : $K(1-s)/(1+Ks) \cdot D$ . $\blacksquare$

🎓CommIT Contribution(2020)

Coded Data Shuffling

K. Wan, D. Tuninetti, G. Caire — IEEE Transactions on Information Theory

The Wan-Tuninetti-Caire 2020 CommIT paper establishes the fundamental limits of distributed data shuffling, recasting the problem in the coded-caching framework:

Coded shuffling rate. Communication cost per epoch reduces from $K(1-s)D$ (uncoded) to $K(1-s)D/(1+Ks)$ (coded) — a factor-of- $(1+Ks)$ improvement.
Order-optimal. The achievable rate matches the cut-set lower bound for small $s$ ; within factor 2 for large $s$ .
Cross-domain insight. Coded caching techniques apply to distributed computing. Worker memory is like cache; data reshuffling is like content delivery.

The paper opened the door to coded computing as an application of coded-caching theory. Subsequent work has extended this to federated learning, all-reduce communication, and gradient coding. The CommIT group continues to extend this framework with Tuninetti at UIC.

Practical impact: for 100+ worker ML clusters with 10% per-worker storage, coded shuffling can reduce inter-epoch bandwidth by 10×. At ML-scale (terabyte datasets, hundreds of epochs), this is substantial operational savings.

coded-shufflingcommitdistributed-mlView Paper →

Coded Data Shuffling for Distributed ML

Aggregator (left) and

K

workers (right) with per-worker storage

sD

. XOR-coded shuffle message simultaneously transfers new data to multiple workers. Bandwidth saving factor:

1 + Ks

— the CommIT Wan-Tuninetti-Caire result.

Uncoded vs Coded Shuffling Cost

Communication cost per epoch: uncoded (linear in $K$ ) vs coded (saturating at $(1-s)/s$ ). Coded saturation means: beyond moderate $K$ , adding workers doesn't add aggregate communication. Major practical saving.

Parameters

Per-worker storage fraction s0.25

Max workers K100

Coded Shuffling Gain Factor

Gain $(1+Ks)$ vs per-worker storage $s$ , for varying $K$ . Larger $K$ and $s$ give larger gain. For typical ML clusters ( $K = 100$ , $s = 0.1$ ): gain factor 11 — substantial.

Parameters

Workers K20

Example: Shuffling Savings in Production ML

Revisit the ResNet-50 / ImageNet example: $K = 256$ , $s = 0.05$ . Compute coded shuffling rate and total training savings.

Solution

Coded rate

$1 + Ks = 1 + 12.8 = 13.8$ . Coded/uncoded ratio: $1/13.8 \approx 0.072$ . Coded rate: $7.2\%$ of uncoded.

Per-epoch

Uncoded: 36.5 TB. Coded: 36.5/13.8 $\approx$ 2.64 TB per epoch.

Training total

90 epochs: uncoded 3.3 PB, coded 240 TB. Savings: ~3 PB per training run.

Cluster bandwidth

Shuffling time reduces from 12 s per epoch to $\sim 1$ s. Aggregate training speedup: 5-10% (shuffling is not the entire bottleneck, but savings are meaningful).

Engineering payoff

For a 1000-run production model factory, coded shuffling saves petabytes of cross-server traffic per year. Not transformative for model training, but substantial for data center operators.

Cumulative Training Communication

Cumulative communication cost over training epochs. Coded shuffling reduces the slope; at 50 epochs the gap is pronounced.

Parameters

Workers K20

Storage fraction s0.3

Epochs50

Key Takeaway

The Wan-Tuninetti-Caire coded shuffling scheme saves bandwidth by $(1 + Ks)$ factor. Worker memory serves as the coded-cache; XOR shuffling messages replace raw transfers. For realistic ML clusters, this is 10-20× bandwidth reduction — a major operational gain imported directly from coded caching theory.

Coded Shuffling (CommIT Wan-Tuninetti-Caire)

Adapting MAN to Shuffling

Theorem: Wan-Tuninetti-Caire Coded Shuffling Rate

Setup

MAN-style placement

Coded shuffle messages

Decoding

Rate

Coded Data Shuffling

Coded Data Shuffling for Distributed ML

Uncoded vs Coded Shuffling Cost

Parameters

Coded Shuffling Gain Factor

Parameters

Example: Shuffling Savings in Production ML

Coded rate

Per-epoch

Training total

Cluster bandwidth

Engineering payoff

Cumulative Training Communication

Parameters

Key Takeaway