Why ML Training Re-Shuffles Data

The Shuffling Step is a Real Distributed-Systems Cost

Every production SGD loop re-shuffles the training data at the start of each epoch. The reason comes straight from the convergence theory: without re-shuffling, stochastic gradients become dependent across iterations and SGD's convergence rate degrades from O(1/T)O(1/\sqrt{T}) to O(1/T1/3)O(1/T^{1/3}) or worse (Bertsekas, Nedić, Recht, Ré). In short: shuffling is not a cosmetic choice — it is a requirement for good convergence.

When training is distributed across NN workers, shuffling means moving data across workers. At the start of each epoch, each worker's mini-batch comes from a random permutation of the full dataset, which in general lives partially on other workers. The point is that this movement is exactly the computation-communication problem of Chapter 2, and the same coded-communication machinery lets us reduce the cost by a factor of NμN\mu.

The CommIT-group contribution by Wan, Tuninetti, and Caire (Section 7.3) gives the exact tradeoff between per-worker memory MM and per-epoch communication R(M)R^*(M). The construction reuses the finite-field IA tools of Chapter 4.

Definition:

Epoch and Shuffle in Distributed SGD

In stochastic gradient descent with mini-batches of size BB, one epoch consists of D/BD/B iterations, processing every data point in the training set exactly once. A shuffle is a uniformly-random permutation πt:[D][D]\pi_t: [D] \to [D] applied to the data indices at the start of epoch tt.

In the distributed setting with NN workers, the dataset is partitioned into NN groups before training starts. At every epoch, worker kk must receive (or generate locally from cached data) its share of the freshly-permuted dataset — specifically, the subset πt1(Bk(t))\pi_t^{-1}(\mathcal{B}_k^{(t)}) where Bk(t)\mathcal{B}_k^{(t)} is worker kk's assigned mini-batch index set at epoch tt. Without re-shuffling, the mini-batches become correlated across epochs and convergence rates degrade.

In practice, many systems skip shuffling (treating the data as cyclically ordered) and absorb the convergence-rate penalty. The Wan / Tuninetti / Caire result makes proper shuffling cheap enough to not require that compromise.

Data Shuffling

The operation of re-permuting data-point assignments across workers between SGD epochs. Required for theoretical convergence rates of SGD. Coded shuffling reduces the per-epoch communication from O(N)O(N) to O(1/μ)O(1/\mu) by exploiting the redundancy of per-worker caches.

Per-Epoch Shuffling Cost

The total network traffic (per epoch, normalized) required to deliver each worker's newly-permuted data. For NN workers with memory MM and dataset size DD, the uncoded cost is N(1M/D)N(1 - M/D) and the coded minimum is N(1M/D)/(1+NM/D)N(1 - M/D)/(1 + NM/D).

Theorem: Uncoded Shuffling Cost

Suppose NN workers hold MM-sized caches of the DD-point dataset and must receive, at the start of each epoch, a fresh permutation assignment of size D/ND/N per worker. The total uncoded per-epoch shuffling communication is Runcoded  =  N(1MD).R_{\text{uncoded}} \;=\; N\left(1 - \frac{M}{D}\right). This is the baseline for the coded shuffling improvement.

Each worker has cached MM of the DD data points. Each worker needs D/ND/N new points per epoch. On average, D/N(1M/D)D/N \cdot (1 - M/D) of those points are not already cached and must be fetched over the network. Multiplying by NN workers gives N(1M/D)N(1 - M/D), the uncoded rate — independent of any clever pre-caching coordination.

,

Example: Shuffling 1 GB Across 10 Workers

A 1 GB training set is shuffled across N=10N = 10 workers, each with memory M/D=0.3M/D = 0.3 (30% of dataset cached). Compute the uncoded per-epoch network traffic.

Shuffling Cost vs. Gradient-Aggregation Cost

Plot the per-epoch shuffling traffic against the per-epoch gradient-aggregation traffic as a function of training-set size DD, with fixed model size dd and fixed epochs-per- training. The relative cost depends on D/dD / d: for very large datasets, shuffling dominates; for very large models, gradients dominate. In either regime, the shuffling traffic is substantial enough that coded approaches pay off.

Parameters
16

Number of workers

0.3

Per-worker memory fraction

Why No Shuffling Is Actually Bad

It is tempting to skip shuffling to save network traffic. The point is that SGD's convergence analysis depends on independent sample draws at each iteration. Without shuffling, the per-iteration samples become cyclically correlated with the mini-batch structure, and several convergence guarantees degrade:

  • i.i.d. sampling with replacement: convergence rate O(1/T)O(1/\sqrt{T}) for strongly-convex losses.
  • Random-reshuffling sampling: convergence rate O(1/T)O(1/T) under mild conditions — actually better than with-replacement! (Gürbüzbalaban–Ozdaglar–Parrilo 2015).
  • No reshuffling (cyclic order): convergence rate O(1/T1/3)O(1/T^{1/3}) or worse, depending on the objective.

So shuffling is not just "nice to have" — re-shuffling actually accelerates convergence compared to even i.i.d. sampling with replacement. The reason is that each epoch visits every data point exactly once, which reduces variance in the stochastic gradient. Coded shuffling (§7.3) makes this optimal shuffling affordable.

⚠️Engineering Note

Shuffling in Production Systems

Production ML training pipelines (Horovod, PyTorch DistributedDataParallel, TF Mirrored Strategy) shuffle via random index permutations per epoch, with each worker iterating over its shard of the permuted dataset. The underlying shuffle mechanism is usually "random-shuffle- on-read-from-disk" (avoiding network traffic by pre-caching the entire dataset on each worker's SSD when feasible).

At the scale of ImageNet (1.3 M images, 150 GB) and on a 16-worker cluster, the per-epoch shuffling cost becomes non-trivial if the dataset doesn't fit in each worker's memory. Coded shuffling becomes attractive when dataset-size per worker is O(D/N)O(D/N) rather than O(D)O(D). For foundation-model training on 100 B+-token datasets, the per-epoch shuffling cost is a serious hardware concern.

Practical Constraints
  • ImageNet-scale: D150D \approx 150 GB, N=16N = 16 workers, M/D1/16M/D \approx 1/16

  • LLM training: D10D \approx 10 TB, N=103N = 10^3 workers, M/D1/NM/D \approx 1/N

  • Per-epoch shuffling can match or exceed gradient aggregation cost

📋 Ref: PyTorch DataLoader; TF tf.data; Jiang et al. 2018 VLDB

Common Mistake: Skipping Shuffling to Save Bandwidth

Mistake:

Train distributed SGD without epoch-level shuffling to avoid the per-epoch network cost.

Correction:

Without shuffling, the implicit sample ordering is cyclic, and SGD's convergence degrades measurably. Empirically (Gürbüzbalaban et al. 2015), random-reshuffling converges about 1.51.5 to 2×2\times faster per epoch than i.i.d. sampling with replacement on most ImageNet classification tasks. Skipping shuffling saves one cost but pays a larger one in extra training epochs. Coded shuffling (§7.3) is the better answer.

Historical Note: The Random-Reshuffling Puzzle

2015–present

For decades, SGD theory was analyzed under the i.i.d.-sampling assumption: each iteration's gradient is computed on a sample drawn uniformly at random, with replacement. Meanwhile, all practical implementations used random-reshuffling (one random permutation per epoch, process the dataset in that order), and empirical results consistently showed reshuffling to be faster. Gürbüzbalaban, Ozdaglar, and Parrilo (2015), and independently Ying, Yuan, and Sayed (2017), closed the theoretical gap: they proved that random-reshuffling achieves O(1/T)O(1/T) convergence — a meaningful speedup over the O(1/T)O(1/\sqrt{T}) of i.i.d. sampling. This result makes it information-theoretically justified to shuffle at every epoch, motivating the coded approaches of this chapter.

Key Takeaway

Shuffling is not optional for optimal SGD convergence. Random-reshuffling gives O(1/T)O(1/T) vs. O(1/T)O(1/\sqrt{T}) for i.i.d. sampling — a substantial speedup per epoch. In the distributed setting, the per-epoch cost of shuffling is comparable to gradient aggregation. Section 7.2 formulates this cost as an information-theoretic problem; Section 7.3 gives the CommIT-group optimal tradeoff.

Quick Check

Random-reshuffling (one random permutation per epoch) is preferred over i.i.d. sampling (with replacement) in SGD because:

i.i.d. sampling has higher implementation overhead.

Random-reshuffling has better convergence rate: O(1/T)O(1/T) vs. O(1/sqrtT)O(1/\\sqrt{T}) for i.i.d. sampling.

i.i.d. sampling requires more memory.

Random-reshuffling avoids the need for a random-number generator.