Why ML Training Re-Shuffles Data

The Shuffling Step is a Real Distributed-Systems Cost

Every production SGD loop re-shuffles the training data at the start of each epoch. The reason comes straight from the convergence theory: without re-shuffling, stochastic gradients become dependent across iterations and SGD's convergence rate degrades from $O(1/\sqrt{T})$ to $O(1/T^{1/3})$ or worse (Bertsekas, Nedić, Recht, Ré). In short: shuffling is not a cosmetic choice — it is a requirement for good convergence.

When training is distributed across $N$ workers, shuffling means moving data across workers. At the start of each epoch, each worker's mini-batch comes from a random permutation of the full dataset, which in general lives partially on other workers. The point is that this movement is exactly the computation-communication problem of Chapter 2, and the same coded-communication machinery lets us reduce the cost by a factor of $N\mu$ .

The CommIT-group contribution by Wan, Tuninetti, and Caire (Section 7.3) gives the exact tradeoff between per-worker memory $M$ and per-epoch communication $R^*(M)$ . The construction reuses the finite-field IA tools of Chapter 4.

Definition:
Epoch and Shuffle in Distributed SGD

In stochastic gradient descent with mini-batches of size $B$ , one epoch consists of $D/B$ iterations, processing every data point in the training set exactly once. A shuffle is a uniformly-random permutation $\pi_t: [D] \to [D]$ applied to the data indices at the start of epoch $t$ .

In the distributed setting with $N$ workers, the dataset is partitioned into $N$ groups before training starts. At every epoch, worker $k$ must receive (or generate locally from cached data) its share of the freshly-permuted dataset — specifically, the subset $\pi_t^{-1}(\mathcal{B}_k^{(t)})$ where $\mathcal{B}_k^{(t)}$ is worker $k$ 's assigned mini-batch index set at epoch $t$ . Without re-shuffling, the mini-batches become correlated across epochs and convergence rates degrade.

In practice, many systems skip shuffling (treating the data as cyclically ordered) and absorb the convergence-rate penalty. The Wan / Tuninetti / Caire result makes proper shuffling cheap enough to not require that compromise.

Data Shuffling

The operation of re-permuting data-point assignments across workers between SGD epochs. Required for theoretical convergence rates of SGD. Coded shuffling reduces the per-epoch communication from $O(N)$ to $O(1/\mu)$ by exploiting the redundancy of per-worker caches.

Per-Epoch Shuffling Cost

The total network traffic (per epoch, normalized) required to deliver each worker's newly-permuted data. For $N$ workers with memory $M$ and dataset size $D$ , the uncoded cost is $N(1 - M/D)$ and the coded minimum is $N(1 - M/D)/(1 + NM/D)$ .

Theorem: Uncoded Shuffling Cost

Suppose $N$ workers hold $M$ -sized caches of the $D$ -point dataset and must receive, at the start of each epoch, a fresh permutation assignment of size $D/N$ per worker. The total uncoded per-epoch shuffling communication is $R_{\text{uncoded}} \;=\; N\left(1 - \frac{M}{D}\right).$ This is the baseline for the coded shuffling improvement.

Each worker has cached $M$ of the $D$ data points. Each worker needs $D/N$ new points per epoch. On average, $D/N \cdot (1 - M/D)$ of those points are not already cached and must be fetched over the network. Multiplying by $N$ workers gives $N(1 - M/D)$ , the uncoded rate — independent of any clever pre-caching coordination.

Proof

Per-worker deficit

Each worker is assigned $D/N$ new data points. Of these, a fraction $M/D$ is already cached (the cached set is uniformly random across all $D$ points under a generic placement). The per-worker uncached fraction is $1 - M/D$ .

Sum and normalize

Per-worker uncached points: $(D/N)(1 - M/D)$ . Network bytes (normalizing by one data-point size): $N$ times the per-worker deficit equals $D(1 - M/D)$ , which normalized by $D$ (file size) gives $N(1 - M/D)$ . $\blacksquare$

Example: Shuffling 1 GB Across 10 Workers

A 1 GB training set is shuffled across $N = 10$ workers, each with memory $M/D = 0.3$ (30% of dataset cached). Compute the uncoded per-epoch network traffic.

Solution

Plug in

$R_{\text{uncoded}} = 10 \cdot (1 - 0.3) = 7.0$ files worth of data. Normalized by the dataset size, $7$ GB of network traffic per epoch.

Over 100 epochs

$700$ GB total shuffling traffic over a training run — often greater than the gradient-aggregation traffic. This is why shuffling is a first-class system concern, not a background detail.

Shuffling Cost vs. Gradient-Aggregation Cost

Plot the per-epoch shuffling traffic against the per-epoch gradient-aggregation traffic as a function of training-set size $D$ , with fixed model size $d$ and fixed epochs-per- training. The relative cost depends on $D / d$ : for very large datasets, shuffling dominates; for very large models, gradients dominate. In either regime, the shuffling traffic is substantial enough that coded approaches pay off.

Parameters

N

— workers16

Number of workers

\mu = M/D

0.3

Per-worker memory fraction

Why No Shuffling Is Actually Bad

It is tempting to skip shuffling to save network traffic. The point is that SGD's convergence analysis depends on independent sample draws at each iteration. Without shuffling, the per-iteration samples become cyclically correlated with the mini-batch structure, and several convergence guarantees degrade:

i.i.d. sampling with replacement: convergence rate $O(1/\sqrt{T})$ for strongly-convex losses.
Random-reshuffling sampling: convergence rate $O(1/T)$ under mild conditions — actually better than with-replacement! (Gürbüzbalaban–Ozdaglar–Parrilo 2015).
No reshuffling (cyclic order): convergence rate $O(1/T^{1/3})$ or worse, depending on the objective.

So shuffling is not just "nice to have" — re-shuffling actually accelerates convergence compared to even i.i.d. sampling with replacement. The reason is that each epoch visits every data point exactly once, which reduces variance in the stochastic gradient. Coded shuffling (§7.3) makes this optimal shuffling affordable.

⚠️Engineering Note

Shuffling in Production Systems

Production ML training pipelines (Horovod, PyTorch DistributedDataParallel, TF Mirrored Strategy) shuffle via random index permutations per epoch, with each worker iterating over its shard of the permuted dataset. The underlying shuffle mechanism is usually "random-shuffle- on-read-from-disk" (avoiding network traffic by pre-caching the entire dataset on each worker's SSD when feasible).

At the scale of ImageNet (1.3 M images, 150 GB) and on a 16-worker cluster, the per-epoch shuffling cost becomes non-trivial if the dataset doesn't fit in each worker's memory. Coded shuffling becomes attractive when dataset-size per worker is $O(D/N)$ rather than $O(D)$ . For foundation-model training on 100 B+-token datasets, the per-epoch shuffling cost is a serious hardware concern.

Practical Constraints

•
ImageNet-scale: $D \approx 150$ GB, $N = 16$ workers, $M/D \approx 1/16$
•
LLM training: $D \approx 10$ TB, $N = 10^3$ workers, $M/D \approx 1/N$
•
Per-epoch shuffling can match or exceed gradient aggregation cost

📋 Ref: PyTorch DataLoader; TF tf.data; Jiang et al. 2018 VLDB

Common Mistake: Skipping Shuffling to Save Bandwidth

Mistake:

Train distributed SGD without epoch-level shuffling to avoid the per-epoch network cost.

Correction:

Without shuffling, the implicit sample ordering is cyclic, and SGD's convergence degrades measurably. Empirically (Gürbüzbalaban et al. 2015), random-reshuffling converges about $1.5$ to $2\times$ faster per epoch than i.i.d. sampling with replacement on most ImageNet classification tasks. Skipping shuffling saves one cost but pays a larger one in extra training epochs. Coded shuffling (§7.3) is the better answer.

Historical Note: The Random-Reshuffling Puzzle

2015–present

For decades, SGD theory was analyzed under the i.i.d.-sampling assumption: each iteration's gradient is computed on a sample drawn uniformly at random, with replacement. Meanwhile, all practical implementations used random-reshuffling (one random permutation per epoch, process the dataset in that order), and empirical results consistently showed reshuffling to be faster. Gürbüzbalaban, Ozdaglar, and Parrilo (2015), and independently Ying, Yuan, and Sayed (2017), closed the theoretical gap: they proved that random-reshuffling achieves $O(1/T)$ convergence — a meaningful speedup over the $O(1/\sqrt{T})$ of i.i.d. sampling. This result makes it information-theoretically justified to shuffle at every epoch, motivating the coded approaches of this chapter.

Key Takeaway

Shuffling is not optional for optimal SGD convergence. Random-reshuffling gives $O(1/T)$ vs. $O(1/\sqrt{T})$ for i.i.d. sampling — a substantial speedup per epoch. In the distributed setting, the per-epoch cost of shuffling is comparable to gradient aggregation. Section 7.2 formulates this cost as an information-theoretic problem; Section 7.3 gives the CommIT-group optimal tradeoff.

Quick Check

Random-reshuffling (one random permutation per epoch) is preferred over i.i.d. sampling (with replacement) in SGD because:

i.i.d. sampling has higher implementation overhead.

Random-reshuffling has better convergence rate: $O(1/T)$ vs. $O(1/\\sqrt{T})$ for i.i.d. sampling.

i.i.d. sampling requires more memory.

Random-reshuffling avoids the need for a random-number generator.