Gradient Coding (Tandon et al.)

The Tandon et al. Construction in One Sentence

Partition the dataset into NN subsets, assign each subset to multiple workers (with overlap controlled by the storage parameter s+1s + 1), and have each worker send a carefully- chosen linear combination of its locally-computed partial gradients. The combinations are designed so that any K=Nβˆ’sK = N - s responses from a master let the master cancel the contributions of the missing workers and recover the full gradient sum.

The point is that the encoding matrix B\mathbf{B} has a cyclic redundancy structure that mirrors the polynomial- code construction of Chapter 5, but specialized to the matrix-vector setting. Tandon, Lei, Dimakis, and Karampatziakis (ICML 2017) gave the explicit construction that achieves K=Nβˆ’sK = N - s at per-worker storage s+1s + 1 partitions.

Definition:

(s,N)(s, N)-Gradient Coding

An (s,N)(s, N)-gradient-coding scheme for distributed SGD on a dataset of NN partitions {D1,…,DN}\{D_1, \ldots, D_N\} consists of:

  • A storage assignment SkβŠ†[N]\mathcal{S}_k \subseteq [N] with ∣Sk∣=s+1|\mathcal{S}_k| = s + 1 for each worker kk β€” i.e., each worker stores s+1s + 1 data partitions.
  • An encoding matrix B∈RNΓ—N\mathbf{B} \in \mathbb{R}^{N \times N} with supp(Bk,:)=Sk\text{supp}(\mathbf{B}_{k, :}) = \mathcal{S}_k β€” row kk has support exactly on partitions worker kk stores.
  • A decoding matrix D∈R∣Tβˆ£Γ—N\mathbf{D} \in \mathbb{R}^{|\mathcal{T}| \times N} for each potential responder set TβŠ†[N]\mathcal{T} \subseteq [N] of size ∣T∣=Nβˆ’s|\mathcal{T}| = N - s.

Each worker kk computes gi(k)β‰œβˆ‡FDi(w)\mathbf{g}^{(k)}_i \triangleq \nabla F_{D_i}(\mathbf{w}) for each i∈Ski \in \mathcal{S}_k (its s+1s + 1 partial gradients) and sends the linear combination g~k=βˆ‘i∈SkBk,igi(k)\tilde{\mathbf{g}}_k = \sum_{i \in \mathcal{S}_k} \mathbf{B}_{k, i} \mathbf{g}^{(k)}_i.

The master, on receiving responses from T\mathcal{T}, computes G^=βˆ‘k∈TD:,kg~k\widehat{\mathbf{G}} = \sum_{k \in \mathcal{T}} \mathbf{D}_{:, k} \tilde{\mathbf{g}}_k. The scheme is valid if DBT,:=1NT\mathbf{D} \mathbf{B}_{\mathcal{T}, :} = \mathbf{1}_N^T for every T\mathcal{T} of size Nβˆ’sN - s β€” i.e., the master's weighted sum equals exactly βˆ‘igi\sum_i \mathbf{g}_i.

The ∣Sk∣=s+1|\mathcal{S}_k| = s + 1 constraint says each worker stores s+1s + 1 out of NN partitions β€” an (s+1)/N(s + 1)/N storage load. The K=Nβˆ’sK = N - s recovery threshold says any Nβˆ’sN - s workers suffice. The two are tied: more storage redundancy (larger s+1s + 1) lets the master tolerate more stragglers (larger ss).

Gradient Coding

A coded-computing scheme for distributed gradient aggregation: NN workers, each storing s+1s + 1 data partitions, send weighted sums of partial gradients chosen so the master can decode the full sum from any Nβˆ’sN - s responses. Recovery threshold K=Nβˆ’sK = N - s.

Tandon et al.'s Cyclic Gradient-Coding Construction

Complexity: O(sd)O(s d) per worker per iteration; O(K)O(K) scalar ops at master per output coordinate.
Input: Number of workers NN, straggler tolerance ss
(with s<Ns < N), data partitions {D1,…,DN}\{D_1, \ldots, D_N\}.
Output: Encoding matrix B∈RNΓ—N\mathbf{B} \in \mathbb{R}^{N \times N}
and decoding rule.
1. Storage assignment (cyclic):
Sk←{k,k+1,…,k+s}mod  N\mathcal{S}_k \leftarrow \{k, k+1, \ldots, k + s\} \mod N
for k=1,…,Nk = 1, \ldots, N. Each worker stores s+1s + 1
consecutive partitions.
2. Encoding matrix: Choose B\mathbf{B} so that its rows
satisfy the support pattern from step 1, and so that the
columns of B\mathbf{B} are equiangular β€” specifically,
each column has s+1s + 1 non-zero entries summing to one,
arranged cyclically.
3. Decoder for a responder set T\mathcal{T}: Solve the
linear system DBT,:=1NT\mathbf{D} \mathbf{B}_{\mathcal{T}, :} = \mathbf{1}_N^T for D\mathbf{D} of size ∣Tβˆ£Γ—N|\mathcal{T}| \times N. Since BT,:\mathbf{B}_{\mathcal{T}, :} has the
cyclic redundancy property, the system is solvable for
every ∣T∣β‰₯Nβˆ’s|\mathcal{T}| \geq N - s.
4. Per-iteration: Each worker computes its s+1s + 1
partial gradients and sends g~k=Bk,:[g1;…;gN]\tilde{\mathbf{g}}_k = \mathbf{B}_{k, :} [\mathbf{g}_1; \ldots; \mathbf{g}_N]
(only nonzero entries on Sk\mathcal{S}_k). Master
computes G^=βˆ‘k∈TD:,kg~k\widehat{\mathbf{G}} = \sum_{k \in \mathcal{T}} \mathbf{D}_{:, k} \tilde{\mathbf{g}}_k.

Tandon et al. show the cyclic construction always satisfies the decoder-existence condition for every ∣T∣=Nβˆ’s|\mathcal{T}| = N - s, regardless of which workers straggle. Other valid constructions exist (random Gaussian encoding works with high probability over fields of suitable size).

Theorem: Gradient Coding Achieves Recovery Threshold K=Nβˆ’sK = N - s

The cyclic (s,N)(s, N)-gradient-coding scheme (Algorithm above) satisfies:

  1. Correctness. For every responder set TβŠ†[N]\mathcal{T} \subseteq [N] with ∣T∣=Nβˆ’s|\mathcal{T}| = N - s and every gradient sequence {gk}\{\mathbf{g}_k\}, the decoder outputs G^=βˆ‘kgk\widehat{\mathbf{G}} = \sum_k \mathbf{g}_k exactly.
  2. Per-worker storage: ∣Sk∣=s+1|\mathcal{S}_k| = s + 1 data partitions, i.e., storage load μ=(s+1)/N\mu = (s + 1)/N.
  3. Per-worker computation: Each worker computes s+1s + 1 partial gradients per round.

The recovery threshold K=Nβˆ’sK = N - s is information- theoretically optimal: any (s,N)(s, N)-gradient-coding scheme with per-worker storage s+1s + 1 has Kβ‰₯Nβˆ’sK \geq N - s.

Each worker's response is one linear equation in the NN partial gradients {gi}\{\mathbf{g}_i\} (with sparsity pattern determined by Sk\mathcal{S}_k). The cyclic structure ensures that any Nβˆ’sN - s rows of B\mathbf{B} span all of RN\mathbb{R}^N, so the master can recover the all-ones combination 1T[g1;…;gN]=βˆ‘kgk\mathbf{1}^T [\mathbf{g}_1; \ldots; \mathbf{g}_N] = \sum_k \mathbf{g}_k. The point is that the cyclic structure is a deterministic IA: no matter which ss workers straggle, the remaining Nβˆ’sN - s rows form a set with the desired span.

Example: N=4,s=1N = 4, s = 1: One Straggler Tolerated

Construct a (1,4)(1, 4)-gradient-coding scheme. Verify that any 3 of 4 workers' responses suffice to recover G=g1+g2+g3+g4\mathbf{G} = \mathbf{g}_1 + \mathbf{g}_2 + \mathbf{g}_3 + \mathbf{g}_4.

Gradient Coding: Cyclic Encoding

Animation of the Tandon et al. (s,N)(s, N)-gradient-coding scheme. Cyclic data assignment ensures that any Nβˆ’sN - s workers' responses span the all-ones direction, recovering the full sum from any K=Nβˆ’sK = N - s responses.

Gradient Coding: Storage vs. Straggler Tolerance

Plot the recovery threshold K=Nβˆ’sK = N - s against the per-worker storage ΞΌ=(s+1)/N\mu = (s + 1)/N for gradient coding. Each operating point trades storage for straggler tolerance: at ΞΌ=1/N\mu = 1/N (one partition per worker), K=NK = N (no tolerance); at ΞΌ=1\mu = 1 (full replication), K=1K = 1 (any single response works). The convex dependence is the gradient- coding analogue of the polynomial-code tradeoff in Chapter 5.

Parameters
30

Number of workers

Theorem: Optimality: Kβ‰₯Nβˆ’sK \geq N - s

Any (s,N)(s, N)-gradient-coding scheme with per-worker storage ∣Sk∣=s+1|\mathcal{S}_k| = s + 1 has recovery threshold Kβ‰₯Nβˆ’sK \geq N - s.

The cyclic Tandon construction matches this bound, so it is information-theoretically optimal at the storage level ΞΌ=(s+1)/N\mu = (s + 1)/N.

At storage s+1s + 1, each worker's response is a linear combination of s+1s + 1 of the NN partial gradients. Any Kβˆ’1K - 1 workers cover at most (s+1)(Kβˆ’1)(s + 1)(K - 1) of the NN unknowns. To span all of RN\mathbb{R}^N β€” needed to extract the all-ones combination β€” we need (s+1)Kβ‰₯N(s + 1) K \geq N, i.e., Kβ‰₯N/(s+1)K \geq N/(s + 1). For the strict lower bound Kβ‰₯Nβˆ’sK \geq N - s, the argument is more delicate but follows the same spirit.

Key Takeaway

Gradient coding achieves K=Nβˆ’sK = N - s at storage (s+1)/N(s + 1)/N, information-theoretically optimal. Each unit of straggler tolerance costs one extra data partition stored per worker. The cyclic construction is explicit, deterministic, and matches the lower bound β€” a closed-form story for the coded-aggregation problem of distributed SGD.

Common Mistake: Storage Grows Linearly with Straggler Tolerance

Mistake:

Use s=Nβˆ’1s = N - 1 in gradient coding (each worker stores all NN partitions) for "maximum robustness".

Correction:

With s=Nβˆ’1s = N - 1 each worker stores the entire dataset and recovery is trivial (K=1K = 1), but the storage cost is the full dataset size per worker β€” defeating the purpose of distribution. In practice, s=Ξ±Ns = \alpha N for α∈[0.05,0.20]\alpha \in [0.05, 0.20] is the sweet spot: tolerate 5–20% stragglers at 5–20%5–20\% extra per-worker storage. Section 6.3's approximate variant relaxes this storage cost further.

⚠️Engineering Note

Gradient Coding in Production: A Modest Footprint

Gradient coding has been deployed in some production federated- learning systems (NVIDIA Flare's coded-FL extension, certain Amazon Forecast pipelines), but uptake has been slow because its overhead β€” s+1s + 1 partitions of compute per worker β€” is paid by every worker every round, even when no stragglers actually appear. Approximate gradient coding (Section 6.3) mitigates this by paying the cost only when stragglers materialize.

The prevailing engineering view is: use plain synchronous SGD with timeouts and re-execution for typical workloads; deploy coded gradient when straggler tail latency is the bottleneck and per-worker storage is plentiful.

Practical Constraints
  • β€’

    Per-round overhead: s+1s + 1 partial gradients vs. 11 for plain

  • β€’

    Encoder/decoder cost: O(N)O(N) per output coordinate

  • β€’

    Best-fit deployment: cluster training of ∼108\sim 10^8-parameter models

πŸ“‹ Ref: Raviv et al. 2020 IEEE T-IT; NVIDIA Flare CodedFL

Historical Note: Gradient Coding's Birth at ICML 2017

2016–2020

Rashish Tandon, Qi Lei, Alexandros Dimakis, and Nikos Karampatziakis introduced gradient coding at ICML 2017, in direct response to Lee et al.'s coded-matrix-multiplication paper of the same year. The motivation was distributed training of deep networks: the synchronous-SGD bottleneck was empirically catastrophic on small-to-medium EC2 clusters, and the polynomial-code framework provided a turn-key fix. The follow-on literature is large β€” Halbawi et al. (2018) for fractional repetition codes, Charles et al. (2017) for approximate variants (Section 6.3), and Raviv et al. (2020) for the BCH-code interpretation. Modern coded-computing work (Caire et al. ByzSecAgg, Chapter 11) builds on these foundations.

Quick Check

A (s,N)(s, N)-gradient-coding scheme tolerates ss stragglers out of NN workers. What is each worker's per-round computation cost (number of partial gradients computed)?

11

s+1s + 1

Nβˆ’sN - s

NN