Ferkans — Interactive Telecom Tutor

ML Training Is Hungry for Communication

Distributed machine learning at scale requires splitting data across many workers. Each worker trains on its portion; gradients are aggregated to update a shared model. Between training epochs, data must be shuffled (randomly reassigned to workers) to ensure diverse exposure per worker and improve convergence.

This shuffling is communication-intensive: in a naive implementation, each worker exchanges a large fraction of its data with others at every epoch. For a 100-worker cluster training on terabytes of data, shuffling can dominate training time and bandwidth.

The CommIT insight (Wan-Tuninetti-Caire): worker memory in distributed ML plays the same role as cache memory in coded caching. Coded shuffling — XOR'ing shuffled data with cached portions — reduces communication cost by the same $1 + Ks$ factor as MAN's coded multicasting.

This chapter shows how. The result is a surprising: coded caching techniques apply directly to distributed ML, saving compute- cluster bandwidth.

Definition:
Distributed Data Shuffling

Consider $K$ workers participating in distributed ML training. The dataset $\mathcal{D}$ of $D$ samples is partitioned across workers such that each worker stores a fraction $s$ of the dataset: $|D_k| = sD$ .

Between epochs, each worker must receive a new random subset of the dataset for the next epoch. Under epoch-wise shuffling, worker $k$ 's new assignment $\tilde D_k$ is a random size- $sD$ subset of $\mathcal{D}$ independent of its previous assignment $D_k$ .

The data shuffling problem is: how does the cluster transfer new assignments to workers with minimum communication?

In practice, a full random permutation per epoch is not needed; approximate shuffling with smaller memory works fine. But the theoretical framework assumes full re-shuffling per epoch.

Theorem: Uncoded Data Shuffling Cost

Under uncoded shuffling (each worker receives its new assignment as raw data from other workers), the total communication cost per epoch is $R_\text{uncoded} \;=\; K(1 - s) \cdot D \text{ data units},$ or $(1-s) \cdot D$ per worker (each needing $(1-s)$ fraction of its new assignment as new data).

Each worker's new assignment contains $sD$ samples. On average, $(1-s)$ fraction of these are samples the worker did not have previously, so must be received. Across $K$ workers, total communication scales as $K(1-s)D$ .

Proof

Per-worker analysis

Under random re-shuffling, worker $k$ 's new assignment $\tilde D_k$ and old $D_k$ are independent random subsets. Expected overlap: $sD \cdot s = s^2 D$ samples in common. New samples needed: $sD (1-s) = sD - s^2 D$ per worker.

Aggregate cost

$K$ workers × new samples/worker: $K s(1-s)D$ . Simplifying for the communication accounting: each new sample must be received by 1 worker; sent by 1 worker. Total communication: $K(1-s) D$ data units.

Per-epoch rate

Per worker: $(1-s)D$ . Per epoch total: $K(1-s)D$ . Unchanged by the $s^2$ overlap; dominant term is $s(1-s)D \sim sD$ .

Example: Uncoded Shuffling at Scale

Training a ResNet-50 on ImageNet (1.28M samples) with 256 workers, each storing 5% ( $sD = 64{,}000$ samples). Dataset is 150 GB. Per-epoch communication cost (uncoded)?

Solution

Per-worker

$(1 - s) \cdot$ Dataset size $= 0.95 \cdot 150$ GB $= 142.5$ GB per worker per epoch.

Aggregate

$K \cdot 142.5 = 36.5$ TB per epoch. At 100 Gbps network: $36.5 \times 10^{12} \cdot 8 / (100 \times 10^9 \cdot 256) \approx 12$ seconds of shuffling per epoch. Substantial.

Annual training

Typical training: 90 epochs. Shuffling alone: $90 \cdot 36.5 = 3.3$ PB of communication. Significant fraction of total training traffic.

Motivation

If coded shuffling can reduce this by factor $1 + Ks = 1 + 13 = 14×$ , total shuffling traffic drops to ~230 TB — a huge operational saving.

The Coded-Caching ↔ Coded-Shuffling Analogy

The deep insight of Wan-Tuninetti-Caire (2020) is that data shuffling and coded caching are structurally the same problem:

Coded Caching	Coded Shuffling
$K$ users	$K$ workers
$N$ files	$D$ dataset samples
$M$ cache size	$M = sD$ worker storage
Placement phase	Initial data partition
Delivery phase	Shuffling between epochs
Demand $d_k$	New assignment $\tilde D_k$
Coded XOR messages	Coded shuffling messages

The MAN analysis transfers with minimal changes. The $1 + KM/N$ factor becomes $1 + Ks$ . Coded multicasting saves bandwidth in shuffling just as it saves bandwidth in CDN delivery.

This cross-domain insight opens a new application of coded caching beyond content delivery: distributed computing.

Key Takeaway

Coded caching generalizes to distributed ML shuffling. Worker memory replaces cache; shuffled data replaces delivery. The $1 + Ks$ coded-multicasting gain saves ML-cluster bandwidth. The CommIT Wan-Tuninetti-Caire insight extends coded caching from CDN theory to compute-cluster optimization.

The Data Shuffling Problem