Why ML Training Re-Shuffles Data
The Shuffling Step is a Real Distributed-Systems Cost
Every production SGD loop re-shuffles the training data at the start of each epoch. The reason comes straight from the convergence theory: without re-shuffling, stochastic gradients become dependent across iterations and SGD's convergence rate degrades from to or worse (Bertsekas, Nedić, Recht, Ré). In short: shuffling is not a cosmetic choice — it is a requirement for good convergence.
When training is distributed across workers, shuffling means moving data across workers. At the start of each epoch, each worker's mini-batch comes from a random permutation of the full dataset, which in general lives partially on other workers. The point is that this movement is exactly the computation-communication problem of Chapter 2, and the same coded-communication machinery lets us reduce the cost by a factor of .
The CommIT-group contribution by Wan, Tuninetti, and Caire (Section 7.3) gives the exact tradeoff between per-worker memory and per-epoch communication . The construction reuses the finite-field IA tools of Chapter 4.
Definition: Epoch and Shuffle in Distributed SGD
Epoch and Shuffle in Distributed SGD
In stochastic gradient descent with mini-batches of size , one epoch consists of iterations, processing every data point in the training set exactly once. A shuffle is a uniformly-random permutation applied to the data indices at the start of epoch .
In the distributed setting with workers, the dataset is partitioned into groups before training starts. At every epoch, worker must receive (or generate locally from cached data) its share of the freshly-permuted dataset — specifically, the subset where is worker 's assigned mini-batch index set at epoch . Without re-shuffling, the mini-batches become correlated across epochs and convergence rates degrade.
In practice, many systems skip shuffling (treating the data as cyclically ordered) and absorb the convergence-rate penalty. The Wan / Tuninetti / Caire result makes proper shuffling cheap enough to not require that compromise.
Data Shuffling
The operation of re-permuting data-point assignments across workers between SGD epochs. Required for theoretical convergence rates of SGD. Coded shuffling reduces the per-epoch communication from to by exploiting the redundancy of per-worker caches.
Per-Epoch Shuffling Cost
The total network traffic (per epoch, normalized) required to deliver each worker's newly-permuted data. For workers with memory and dataset size , the uncoded cost is and the coded minimum is .
Theorem: Uncoded Shuffling Cost
Suppose workers hold -sized caches of the -point dataset and must receive, at the start of each epoch, a fresh permutation assignment of size per worker. The total uncoded per-epoch shuffling communication is This is the baseline for the coded shuffling improvement.
Each worker has cached of the data points. Each worker needs new points per epoch. On average, of those points are not already cached and must be fetched over the network. Multiplying by workers gives , the uncoded rate — independent of any clever pre-caching coordination.
Per-worker deficit
Each worker is assigned new data points. Of these, a fraction is already cached (the cached set is uniformly random across all points under a generic placement). The per-worker uncached fraction is .
Sum and normalize
Per-worker uncached points: . Network bytes (normalizing by one data-point size): times the per-worker deficit equals , which normalized by (file size) gives .
Example: Shuffling 1 GB Across 10 Workers
A 1 GB training set is shuffled across workers, each with memory (30% of dataset cached). Compute the uncoded per-epoch network traffic.
Plug in
files worth of data. Normalized by the dataset size, GB of network traffic per epoch.
Over 100 epochs
GB total shuffling traffic over a training run — often greater than the gradient-aggregation traffic. This is why shuffling is a first-class system concern, not a background detail.
Shuffling Cost vs. Gradient-Aggregation Cost
Plot the per-epoch shuffling traffic against the per-epoch gradient-aggregation traffic as a function of training-set size , with fixed model size and fixed epochs-per- training. The relative cost depends on : for very large datasets, shuffling dominates; for very large models, gradients dominate. In either regime, the shuffling traffic is substantial enough that coded approaches pay off.
Parameters
Number of workers
Per-worker memory fraction
Why No Shuffling Is Actually Bad
It is tempting to skip shuffling to save network traffic. The point is that SGD's convergence analysis depends on independent sample draws at each iteration. Without shuffling, the per-iteration samples become cyclically correlated with the mini-batch structure, and several convergence guarantees degrade:
- i.i.d. sampling with replacement: convergence rate for strongly-convex losses.
- Random-reshuffling sampling: convergence rate under mild conditions — actually better than with-replacement! (Gürbüzbalaban–Ozdaglar–Parrilo 2015).
- No reshuffling (cyclic order): convergence rate or worse, depending on the objective.
So shuffling is not just "nice to have" — re-shuffling actually accelerates convergence compared to even i.i.d. sampling with replacement. The reason is that each epoch visits every data point exactly once, which reduces variance in the stochastic gradient. Coded shuffling (§7.3) makes this optimal shuffling affordable.
Shuffling in Production Systems
Production ML training pipelines (Horovod, PyTorch DistributedDataParallel, TF Mirrored Strategy) shuffle via random index permutations per epoch, with each worker iterating over its shard of the permuted dataset. The underlying shuffle mechanism is usually "random-shuffle- on-read-from-disk" (avoiding network traffic by pre-caching the entire dataset on each worker's SSD when feasible).
At the scale of ImageNet (1.3 M images, 150 GB) and on a 16-worker cluster, the per-epoch shuffling cost becomes non-trivial if the dataset doesn't fit in each worker's memory. Coded shuffling becomes attractive when dataset-size per worker is rather than . For foundation-model training on 100 B+-token datasets, the per-epoch shuffling cost is a serious hardware concern.
- •
ImageNet-scale: GB, workers,
- •
LLM training: TB, workers,
- •
Per-epoch shuffling can match or exceed gradient aggregation cost
Common Mistake: Skipping Shuffling to Save Bandwidth
Mistake:
Train distributed SGD without epoch-level shuffling to avoid the per-epoch network cost.
Correction:
Without shuffling, the implicit sample ordering is cyclic, and SGD's convergence degrades measurably. Empirically (Gürbüzbalaban et al. 2015), random-reshuffling converges about to faster per epoch than i.i.d. sampling with replacement on most ImageNet classification tasks. Skipping shuffling saves one cost but pays a larger one in extra training epochs. Coded shuffling (§7.3) is the better answer.
Historical Note: The Random-Reshuffling Puzzle
2015–presentFor decades, SGD theory was analyzed under the i.i.d.-sampling assumption: each iteration's gradient is computed on a sample drawn uniformly at random, with replacement. Meanwhile, all practical implementations used random-reshuffling (one random permutation per epoch, process the dataset in that order), and empirical results consistently showed reshuffling to be faster. Gürbüzbalaban, Ozdaglar, and Parrilo (2015), and independently Ying, Yuan, and Sayed (2017), closed the theoretical gap: they proved that random-reshuffling achieves convergence — a meaningful speedup over the of i.i.d. sampling. This result makes it information-theoretically justified to shuffle at every epoch, motivating the coded approaches of this chapter.
Key Takeaway
Shuffling is not optional for optimal SGD convergence. Random-reshuffling gives vs. for i.i.d. sampling — a substantial speedup per epoch. In the distributed setting, the per-epoch cost of shuffling is comparable to gradient aggregation. Section 7.2 formulates this cost as an information-theoretic problem; Section 7.3 gives the CommIT-group optimal tradeoff.
Quick Check
Random-reshuffling (one random permutation per epoch) is preferred over i.i.d. sampling (with replacement) in SGD because:
i.i.d. sampling has higher implementation overhead.
Random-reshuffling has better convergence rate: vs. for i.i.d. sampling.
i.i.d. sampling requires more memory.
Random-reshuffling avoids the need for a random-number generator.
Per-epoch visit to every data point reduces gradient variance, accelerating convergence. The theoretical gap was proven by Gürbüzbalaban et al. in 2015.