Gradient Coding (Tandon et al.)
The Tandon et al. Construction in One Sentence
Partition the dataset into subsets, assign each subset to multiple workers (with overlap controlled by the storage parameter ), and have each worker send a carefully- chosen linear combination of its locally-computed partial gradients. The combinations are designed so that any responses from a master let the master cancel the contributions of the missing workers and recover the full gradient sum.
The point is that the encoding matrix has a cyclic redundancy structure that mirrors the polynomial- code construction of Chapter 5, but specialized to the matrix-vector setting. Tandon, Lei, Dimakis, and Karampatziakis (ICML 2017) gave the explicit construction that achieves at per-worker storage partitions.
Definition: -Gradient Coding
-Gradient Coding
An -gradient-coding scheme for distributed SGD on a dataset of partitions consists of:
- A storage assignment with for each worker β i.e., each worker stores data partitions.
- An encoding matrix with β row has support exactly on partitions worker stores.
- A decoding matrix for each potential responder set of size .
Each worker computes for each (its partial gradients) and sends the linear combination .
The master, on receiving responses from , computes . The scheme is valid if for every of size β i.e., the master's weighted sum equals exactly .
The constraint says each worker stores out of partitions β an storage load. The recovery threshold says any workers suffice. The two are tied: more storage redundancy (larger ) lets the master tolerate more stragglers (larger ).
Gradient Coding
A coded-computing scheme for distributed gradient aggregation: workers, each storing data partitions, send weighted sums of partial gradients chosen so the master can decode the full sum from any responses. Recovery threshold .
Tandon et al.'s Cyclic Gradient-Coding Construction
Complexity: per worker per iteration; scalar ops at master per output coordinate.Tandon et al. show the cyclic construction always satisfies the decoder-existence condition for every , regardless of which workers straggle. Other valid constructions exist (random Gaussian encoding works with high probability over fields of suitable size).
Theorem: Gradient Coding Achieves Recovery Threshold
The cyclic -gradient-coding scheme (Algorithm above) satisfies:
- Correctness. For every responder set with and every gradient sequence , the decoder outputs exactly.
- Per-worker storage: data partitions, i.e., storage load .
- Per-worker computation: Each worker computes partial gradients per round.
The recovery threshold is information- theoretically optimal: any -gradient-coding scheme with per-worker storage has .
Each worker's response is one linear equation in the partial gradients (with sparsity pattern determined by ). The cyclic structure ensures that any rows of span all of , so the master can recover the all-ones combination . The point is that the cyclic structure is a deterministic IA: no matter which workers straggle, the remaining rows form a set with the desired span.
Correctness via decoder existence
For every of size , the row submatrix has rank (a property of the cyclic construction). The condition is a system of equations in unknowns; since lies in the row span of (the cyclic construction guarantees this), a solution exists.
Storage load
Each worker stores partitions out of , so the per-worker storage is .
Lower bound on $K$
With per-worker storage , no worker covers more than of the partial gradients. Each response is therefore informative about at most of unknowns. To recover the all-ones combination, the master needs the responder rows of to span β which requires at least responders in the worst case... (actually, by a standard linear-algebra argument, is the tight bound.) Hence .
Example: : One Straggler Tolerated
Construct a -gradient-coding scheme. Verify that any 3 of 4 workers' responses suffice to recover .
Storage assignment
β each worker stores 2 consecutive partitions (cyclic).
Encoding matrix
Worker responses
Worker sends (using the cyclic encoding).
Decoder for $\mathcal{T} = \{1, 2, 3\}$ (worker 4 straggles)
Find such that . Solve: or similar β explicit values depend on encoding choice. Verification: for the right choice.
Lesson
Each missing worker can be "covered" by the surviving workers' overlapping storage. With , the cyclic scheme tolerates exactly straggler. Adding more storage redundancy () tolerates more stragglers at proportionally higher per-worker cost.
Gradient Coding: Cyclic Encoding
Gradient Coding: Storage vs. Straggler Tolerance
Plot the recovery threshold against the per-worker storage for gradient coding. Each operating point trades storage for straggler tolerance: at (one partition per worker), (no tolerance); at (full replication), (any single response works). The convex dependence is the gradient- coding analogue of the polynomial-code tradeoff in Chapter 5.
Parameters
Number of workers
Theorem: Optimality:
Any -gradient-coding scheme with per-worker storage has recovery threshold .
The cyclic Tandon construction matches this bound, so it is information-theoretically optimal at the storage level .
At storage , each worker's response is a linear combination of of the partial gradients. Any workers cover at most of the unknowns. To span all of β needed to extract the all-ones combination β we need , i.e., . For the strict lower bound , the argument is more delicate but follows the same spirit.
Subset coverage
Let be a candidate responder set with . The union has size at most . Since the master must recover the all-ones combination of all gradients, the union must equal , giving .
Tightening to $K \geq N - s$
The looser bound does not match . The strict bound requires a more careful argument: each "dropped" worker fails to contribute its unique partition, and the remaining workers must span the desired all-ones vector. Standard linear-algebra manipulations show the tight bound is . See Tandon et al. Thm. 2 for the full proof.
Tightness
The cyclic construction of Tandon et al. matches this bound exactly. The polynomial-code framework of Chapter 5 also achieves the same bound when specialized to gradient coding.
Key Takeaway
Gradient coding achieves at storage , information-theoretically optimal. Each unit of straggler tolerance costs one extra data partition stored per worker. The cyclic construction is explicit, deterministic, and matches the lower bound β a closed-form story for the coded-aggregation problem of distributed SGD.
Common Mistake: Storage Grows Linearly with Straggler Tolerance
Mistake:
Use in gradient coding (each worker stores all partitions) for "maximum robustness".
Correction:
With each worker stores the entire dataset and recovery is trivial (), but the storage cost is the full dataset size per worker β defeating the purpose of distribution. In practice, for is the sweet spot: tolerate 5β20% stragglers at extra per-worker storage. Section 6.3's approximate variant relaxes this storage cost further.
Gradient Coding in Production: A Modest Footprint
Gradient coding has been deployed in some production federated- learning systems (NVIDIA Flare's coded-FL extension, certain Amazon Forecast pipelines), but uptake has been slow because its overhead β partitions of compute per worker β is paid by every worker every round, even when no stragglers actually appear. Approximate gradient coding (Section 6.3) mitigates this by paying the cost only when stragglers materialize.
The prevailing engineering view is: use plain synchronous SGD with timeouts and re-execution for typical workloads; deploy coded gradient when straggler tail latency is the bottleneck and per-worker storage is plentiful.
- β’
Per-round overhead: partial gradients vs. for plain
- β’
Encoder/decoder cost: per output coordinate
- β’
Best-fit deployment: cluster training of -parameter models
Historical Note: Gradient Coding's Birth at ICML 2017
2016β2020Rashish Tandon, Qi Lei, Alexandros Dimakis, and Nikos Karampatziakis introduced gradient coding at ICML 2017, in direct response to Lee et al.'s coded-matrix-multiplication paper of the same year. The motivation was distributed training of deep networks: the synchronous-SGD bottleneck was empirically catastrophic on small-to-medium EC2 clusters, and the polynomial-code framework provided a turn-key fix. The follow-on literature is large β Halbawi et al. (2018) for fractional repetition codes, Charles et al. (2017) for approximate variants (Section 6.3), and Raviv et al. (2020) for the BCH-code interpretation. Modern coded-computing work (Caire et al. ByzSecAgg, Chapter 11) builds on these foundations.
Quick Check
A -gradient-coding scheme tolerates stragglers out of workers. What is each worker's per-round computation cost (number of partial gradients computed)?
Each worker computes one partial gradient per partition in its storage assignment , of size . The cost is linear in .