Prerequisites & Notation

Before You Begin

Chapter 7 applies the coded-computing framework of Part II to a specific operational need in distributed machine learning: the re-shuffling of data across workers at every training epoch. The prerequisites are the coded-caching intuition of Chapter 4, the polynomial-code construction of Chapter 5, and enough SGD background to know why shuffling matters at all.

Coded caching delivery (Chapter 4 §4.3)(Review ch04)
Self-check: State the Maddah-Ali / Niesen global caching gain $1 + KM/F$ and identify where the alignment trick enters.
Polynomial codes for matrix multiplication (Chapter 5)(Review ch05)
Self-check: For $p = q$ partitions, what is the recovery threshold $K$ of the polynomial code?
Distributed SGD architecture (Chapter 1 §1.3)(Review ch01)
Self-check: Why does stochastic gradient descent require data shuffling across epochs?
The $(\mu, \Delta)$ tradeoff framework (Chapter 2)(Review ch02)
Self-check: State the optimal coded-shuffling load $\Delta^*(\mu) = (1 - \mu)/(N\mu)$ .
Standard SGD convergence intuition(Review ch05)
Self-check: Why is random-order sampling (with replacement) related to cyclic re-ordering of mini-batches?

Notation for This Chapter

We follow the coded-caching convention: $N$ workers, $M$ is the per-worker memory (measured in number of data points), and $K$ is reserved for the number of distinct data points per "epoch" when the chapter uses the caching-style framing. Where the chapter inherits Chapter 2's convention, $K$ denotes the recovery threshold and the shift is noted explicitly.

Symbol	Meaning	Introduced
$N$	Number of workers	s01
$D$	Total number of data points in the training set	s01
$M$	Per-worker memory (number of data points each worker can hold)	s01
$\mu = M/D$	Per-worker storage fraction (consistent with Chapter 2)	s01
$\Delta$	Per-epoch shuffling communication load (normalized)	s02
$\pi_t$	Permutation of the dataset used at epoch $t$	s02
$W_d$	The $d$ -th data point, $d = 1, \ldots, D$	s02
$\mathcal{T}_k^{(t)}$	Set of data points stored at worker $k$ during epoch $t$	s02
$R^*(M)$	Optimal per-epoch shuffling rate (the CommIT result)	s03

← Ch 6 Why ML Training Re-Shuffles Data