The Data-Shuffling Problem

From a System Problem to an Information-Theoretic One

Section 7.1 gave the systems-level view: shuffling is a per-epoch data-movement cost. This section formalizes the problem as an information-theoretic coding problem, with well-defined inputs (the dataset, the permutations, the caches), well-defined outputs (the post-shuffle data distribution), and a precise communication-load metric.

Once we have the formal problem, the tools of Chapter 2 (cut-set converses), Chapter 4 (finite-field IA), and Chapter 5 (polynomial codes) apply directly. The point is that the coded-shuffling tradeoff of §7.3 is not a stand- alone trick; it is a specialization of the computation-communication framework we have been building for five chapters.

Definition:

(N,D,M)(N, D, M)-Data-Shuffling Problem

Let the dataset be W={W1,,WD}\mathcal{W} = \{W_1, \ldots, W_D\}, each WdW_d a fixed-size chunk. The (N,D,M)(N, D, M)-data-shuffling problem operates over NN workers as follows:

  1. Placement phase (one-time, before training): Each worker kk stores a subset TkW\mathcal{T}_k \subseteq \mathcal{W} of size Tk=M|\mathcal{T}_k| = M. The placement is chosen centrally by the master and does not depend on the future permutations.

  2. Delivery phase (one per epoch): At the start of epoch tt, a random permutation πt:[D][D]\pi_t: [D] \to [D] is announced. The permutation induces per-worker assignments Ak(t)=πt1(Bk)\mathcal{A}_k^{(t)} = \pi_t^{-1}(\mathcal{B}_k), where Bk={(k1)D/N+1,,kD/N}\mathcal{B}_k = \{(k-1)D/N + 1, \ldots, kD/N\} is worker kk's fixed slot in the epoch's processing order.

  3. Server broadcasts a sequence of bit-messages to (noiselessly) inform each worker of the new-epoch data it does not already have: worker kk needs every WjAk(t)TkW_j \in \mathcal{A}_k^{(t)} \setminus \mathcal{T}_k.

The per-epoch shuffling rate is Δ(πt)=total broadcast bits/D\Delta(\pi_t) = \text{total broadcast bits} / D (normalized by the dataset size). The worst-case rate is R(M)=maxπΔ(π)R^*(M) = \max_{\pi} \Delta(\pi). The information-theoretic question is: what is the minimum achievable R(M)R^*(M)?

The setup parallels coded caching (Chapter 4 §4.3): the placement phase knows nothing about future demands; the delivery phase answers the specific permutation via a single broadcast. The difference is that coded caching delivers files (one request per user), while data shuffling delivers groups of files (one mini-batch per worker).

Per-Epoch Shuffling Rate Δ\Delta

The total network traffic required to re-shuffle the distributed dataset after a new random permutation is announced, normalized by the dataset size. The optimal R(M)R^*(M) is characterized by the CommIT result of §7.3.

Data Shuffling vs. MapReduce Shuffle

The term "shuffle" is overloaded. Two different operations in distributed systems use it:

  • MapReduce shuffle (Chapter 2): every worker reads its own intermediate values and sends them to the worker responsible for the corresponding reducer key. One-time, per-job operation.
  • Data shuffling in ML (this chapter): a fresh random permutation of the input dataset is computed, and each worker receives its slice of the permuted dataset. Per- epoch operation (potentially hundreds of times per training run).

The coding techniques are similar (both use finite-field IA / XOR alignment), but the rate regions are different. The data-shuffling bound R(M)=1M/D1+NM/DR^*(M) = \frac{1 - M/D}{1 + NM/D} of §7.3 specializes to the same formula as Maddah-Ali / Niesen caching under the identification K=NK = N, F=DF = D (users = workers, files = data points).

Theorem: Lower Bound: R(M)Rcut(M)R^*(M) \geq R_{\text{cut}}(M)

For any (N,D,M)(N, D, M)-data-shuffling scheme, R(M)    N(1M/D)1+NM/D.R^*(M) \;\geq\; \frac{N(1 - M/D)}{1 + NM/D}. The proof is a cut-set argument specialized to the shuffling problem: any permutation-agnostic placement must admit at least this much per-epoch broadcast traffic for the worst-case permutation.

With NN workers and per-worker memory M=μDM = \mu D, a broadcast bit can satisfy at most 1+KM/D=1+Nμ1 + KM/D = 1 + N\mu distinct per-worker missing slots simultaneously (the alignment factor of Chapter 4's coded caching). The total number of missing-data-point deliveries is N(1μ)DN (1 - \mu) D normalized, so the minimum broadcast traffic is N(1μ)/(1+Nμ)N(1 - \mu)/(1 + N\mu). The argument is exactly the Maddah-Ali / Niesen cut-set from §4.3, specialized to the data-shuffling problem.

Example: N=3N = 3 Workers, D=6D = 6 Data Points, M=2M = 2

Set up the (N,D,M)=(3,6,2)(N, D, M) = (3, 6, 2) shuffling problem. Compute the minimum per-epoch shuffling rate, and verify the normalization.

Per-Epoch Shuffling Load vs. Per-Worker Memory

Plot the coded shuffling rate R(M)=N(1μ)/(1+Nμ)R^*(M) = N(1-\mu)/(1+N\mu) against the per-worker memory fraction μ=M/D\mu = M/D, with comparison to the uncoded baseline N(1μ)N(1-\mu). The curves illustrate the multiplicative gain of 1+Nμ1 + N\mu from finite-field IA in the delivery phase. The gap grows as μ\mu increases: at μ=1/2\mu = 1/2 with N=20N = 20 workers, the coded scheme is 11×11\times more efficient than uncoded.

Parameters
20

Number of workers

0.3

Memory fraction at which we annotate the gap

Data Shuffling vs. Coded Caching vs. MapReduce Shuffle

ProblemWhat is deliveredPer-round costInformation-theoretic rate
Coded caching (Ch. 4)KK user-specific file requestsOne broadcastK(1M/F)/(1+KM/F)K(1 - M/F) / (1 + KM/F)
Data shuffling (this chapter)Worker-specific slices of permuted datasetOne broadcast per epochN(1M/D)/(1+NM/D)N(1 - M/D) / (1 + NM/D)
MapReduce shuffle (Ch. 2)Worker-specific intermediate-value partitionsOne shuffle per job(1μ)/(Nμ)(1 - \mu) / (N\mu)

Common Mistake: Memory MM vs. Memory Fraction μ=M/D\mu = M/D

Mistake:

Quote shuffling rates in terms of MM without specifying the dataset size DD.

Correction:

The per-worker memory must be stated in fraction of dataset μ=M/D\mu = M/D to be meaningful. A memory MM that is "a lot" for ImageNet (D=1.3D = 1.3M) may be negligible for trillion-parameter LLM pre-training datasets. The information-theoretic rates depend on μ\mu, not on MM alone.

🔧Engineering Note

Shuffling in Federated Learning: A Special Case

In federated learning, each user's data is fixed and private — there is no cross-user data shuffling at all (this is one of FL's defining features). Each user repeatedly cycles through its own local dataset. The convergence-rate implications are subtle: with nn users, each drawing from their local distribution, the effective "shuffling" comes from the random user-selection per round (FedAvg selects a random subset of users each round, CnC \cdot n of them).

So in FL the shuffling cost is replaced by a user- selection cost. Coded shuffling (this chapter) therefore does not apply to vanilla FL; it applies to standard distributed-SGD deployments (data-center training). Chapter 9 treats FL in detail; Chapter 11 (ByzSecAgg) re-introduces coded-computing primitives into FL for Byzantine robustness.

Practical Constraints
  • FL: no cross-user shuffling; each user cycles its local data

  • Distributed SGD (data-center): cross-worker shuffling via coded shuffling

  • Hybrid: partial data shuffling across user subsets is an active research area

📋 Ref: McMahan et al. FedAvg 2017; PyTorch DataLoader

Key Takeaway

Data shuffling is an information-theoretic problem with a clean achievability-converse structure. The cut-set lower bound R(M)N(1μ)/(1+Nμ)R^*(M) \geq N(1-\mu)/(1+N\mu) of §7.2 mirrors Maddah-Ali / Niesen caching. Section 7.3 gives the achievability (the CommIT-group result by Wan, Tuninetti, and Caire) that matches this lower bound via finite-field IA, closing the rate-region characterization.

Quick Check

For N=20N = 20 workers, per-worker memory fraction μ=1/4\mu = 1/4, what is the minimum per-epoch shuffling rate R(M)R^*(M)?

1515

2.52.5

7.57.5

2020