Prerequisites & Notation

Before You Begin

Chapter 6 specializes the polynomial-code framework of Chapter 5 to the gradient-aggregation problem of distributed SGD. The prerequisites are the matrix-multiplication background of Chapter 5, the basics of stochastic gradient descent (Chapter 1 §1.3), and standard convex/non-convex optimization vocabulary. Readers comfortable with the polynomial-code construction will find the gradient-coding extension immediate.

Polynomial codes for matrix multiplication (Chapter 5)(Review ch05)
Self-check: State the recovery threshold $K$ of the polynomial code for $\mathbf{A}^T \mathbf{B}$ with $(p, q)$ partitions.
Distributed SGD architecture (Chapter 1 §1.3)(Review ch01)
Self-check: For a model with $d$ parameters and $n$ users, what is the per-round aggregate uplink cost?
Smooth (Lipschitz-gradient) and convex / strongly-convex objectives(Review ch05)
Self-check: Define the convergence rate of SGD on a $\mu$ -strongly convex, $L$ -smooth objective.
Order statistics for stragglers (Chapter 1 §1.2)(Review ch01)
Self-check: Why does $\mathbb{E}[T_{(K)}]$ grow only logarithmically when $K = N$ but converges to a constant when $K = \alpha N$ ?

Notation for This Chapter

Gradient-coding notation. We write $n$ for the number of users (matching the federated-learning conventions of Chapter 1) but use $N$ when the chapter inherits Chapter 5's matrix- multiplication framing. Each worker holds a fraction $1/N$ of the data; the encoding distributes "redundancy" across workers so that any $K$ responses suffice to reconstruct the full gradient sum.

Symbol	Meaning	Introduced
$\mathbf{w}_t \in \mathbb{R}^d$	Model parameters at iteration $t$	s01
$\mathbf{g}_k(\mathbf{w}_t)$	Local gradient computed by worker $k$ on its data partition	s01
$\mathbf{G}_t = \sum_k \mathbf{g}_k$	Aggregated gradient — the master's target	s01
$N$	Number of workers	s01
$K$	Recovery threshold (any $K$ responses suffice)	s02
$s$	Per-worker straggler tolerance: each worker runs $s$ partitions	s02
$\mathbf{B} \in \mathbb{R}^{N \times N}$	Encoding matrix (assigns linear combos of partial gradients to workers)	s02
$\eta$	Learning rate	s01
$T$	Number of training rounds	s01
$d$	Model dimensionality	s01

← Ch 5 Distributed SGD as a Coded-Computing Problem