Ferkans — Interactive Telecom Tutor

The Tandon et al. Construction in One Sentence

Partition the dataset into $N$ subsets, assign each subset to multiple workers (with overlap controlled by the storage parameter $s + 1$ ), and have each worker send a carefully- chosen linear combination of its locally-computed partial gradients. The combinations are designed so that any $K = N - s$ responses from a master let the master cancel the contributions of the missing workers and recover the full gradient sum.

The point is that the encoding matrix $\mathbf{B}$ has a cyclic redundancy structure that mirrors the polynomial- code construction of Chapter 5, but specialized to the matrix-vector setting. Tandon, Lei, Dimakis, and Karampatziakis (ICML 2017) gave the explicit construction that achieves $K = N - s$ at per-worker storage $s + 1$ partitions.

Definition:
$(s, N)$ -Gradient Coding

An $(s, N)$ -gradient-coding scheme for distributed SGD on a dataset of $N$ partitions $\{D_1, \ldots, D_N\}$ consists of:

A storage assignment $\mathcal{S}_k \subseteq [N]$ with $|\mathcal{S}_k| = s + 1$ for each worker $k$ — i.e., each worker stores $s + 1$ data partitions.
An encoding matrix $\mathbf{B} \in \mathbb{R}^{N \times N}$ with $\text{supp}(\mathbf{B}_{k, :}) = \mathcal{S}_k$ — row $k$ has support exactly on partitions worker $k$ stores.
A decoding matrix $\mathbf{D} \in \mathbb{R}^{|\mathcal{T}| \times N}$ for each potential responder set $\mathcal{T} \subseteq [N]$ of size $|\mathcal{T}| = N - s$ .

Each worker $k$ computes $\mathbf{g}^{(k)}_i \triangleq \nabla F_{D_i}(\mathbf{w})$ for each $i \in \mathcal{S}_k$ (its $s + 1$ partial gradients) and sends the linear combination $\tilde{\mathbf{g}}_k = \sum_{i \in \mathcal{S}_k} \mathbf{B}_{k, i} \mathbf{g}^{(k)}_i$ .

The master, on receiving responses from $\mathcal{T}$ , computes $\widehat{\mathbf{G}} = \sum_{k \in \mathcal{T}} \mathbf{D}_{:, k} \tilde{\mathbf{g}}_k$ . The scheme is valid if $\mathbf{D} \mathbf{B}_{\mathcal{T}, :} = \mathbf{1}_N^T$ for every $\mathcal{T}$ of size $N - s$ — i.e., the master's weighted sum equals exactly $\sum_i \mathbf{g}_i$ .

The $|\mathcal{S}_k| = s + 1$ constraint says each worker stores $s + 1$ out of $N$ partitions — an $(s + 1)/N$ storage load. The $K = N - s$ recovery threshold says any $N - s$ workers suffice. The two are tied: more storage redundancy (larger $s + 1$ ) lets the master tolerate more stragglers (larger $s$ ).

Gradient Coding

A coded-computing scheme for distributed gradient aggregation: $N$ workers, each storing $s + 1$ data partitions, send weighted sums of partial gradients chosen so the master can decode the full sum from any $N - s$ responses. Recovery threshold $K = N - s$ .

Tandon et al.'s Cyclic Gradient-Coding Construction

Complexity:

O(s d)

per worker per iteration;

O(K)

scalar ops at master per output coordinate.

Input: Number of workers

N

, straggler tolerance

s

(with

s < N

), data partitions

\{D_1, \ldots, D_N\}

.

Output: Encoding matrix

\mathbf{B} \in \mathbb{R}^{N \times N}

and decoding rule.

1. Storage assignment (cyclic):

\mathcal{S}_k \leftarrow \{k, k+1, \ldots, k + s\} \mod N

for

k = 1, \ldots, N

. Each worker stores

s + 1

consecutive partitions.

2. Encoding matrix: Choose

\mathbf{B}

so that its rows

satisfy the support pattern from step 1, and so that the

columns of

\mathbf{B}

are equiangular — specifically,

each column has

s + 1

non-zero entries summing to one,

arranged cyclically.

3. Decoder for a responder set $\mathcal{T}$ : Solve the

linear system

\mathbf{D} \mathbf{B}_{\mathcal{T}, :} = \mathbf{1}_N^T

for

\mathbf{D}

of size

|\mathcal{T}| \times N

. Since

\mathbf{B}_{\mathcal{T}, :}

has the

cyclic redundancy property, the system is solvable for

every

|\mathcal{T}| \geq N - s

.

4. Per-iteration: Each worker computes its

s + 1

partial gradients and sends

\tilde{\mathbf{g}}_k = \mathbf{B}_{k, :} [\mathbf{g}_1; \ldots; \mathbf{g}_N]

(only nonzero entries on

\mathcal{S}_k

). Master

computes

\widehat{\mathbf{G}} = \sum_{k \in \mathcal{T}} \mathbf{D}_{:, k} \tilde{\mathbf{g}}_k

.

Tandon et al. show the cyclic construction always satisfies the decoder-existence condition for every $|\mathcal{T}| = N - s$ , regardless of which workers straggle. Other valid constructions exist (random Gaussian encoding works with high probability over fields of suitable size).

Theorem: Gradient Coding Achieves Recovery Threshold $K = N - s$

The cyclic $(s, N)$ -gradient-coding scheme (Algorithm above) satisfies:

Correctness. For every responder set $\mathcal{T} \subseteq [N]$ with $|\mathcal{T}| = N - s$ and every gradient sequence $\{\mathbf{g}_k\}$ , the decoder outputs $\widehat{\mathbf{G}} = \sum_k \mathbf{g}_k$ exactly.
Per-worker storage: $|\mathcal{S}_k| = s + 1$ data partitions, i.e., storage load $\mu = (s + 1)/N$ .
Per-worker computation: Each worker computes $s + 1$ partial gradients per round.

The recovery threshold $K = N - s$ is information- theoretically optimal: any $(s, N)$ -gradient-coding scheme with per-worker storage $s + 1$ has $K \geq N - s$ .

Each worker's response is one linear equation in the $N$ partial gradients $\{\mathbf{g}_i\}$ (with sparsity pattern determined by $\mathcal{S}_k$ ). The cyclic structure ensures that any $N - s$ rows of $\mathbf{B}$ span all of $\mathbb{R}^N$ , so the master can recover the all-ones combination $\mathbf{1}^T [\mathbf{g}_1; \ldots; \mathbf{g}_N] = \sum_k \mathbf{g}_k$ . The point is that the cyclic structure is a deterministic IA: no matter which $s$ workers straggle, the remaining $N - s$ rows form a set with the desired span.

Proof

Correctness via decoder existence

For every $\mathcal{T}$ of size $N - s$ , the row submatrix $\mathbf{B}_{\mathcal{T}, :} \in \mathbb{R}^{(N-s) \times N}$ has rank $N - s$ (a property of the cyclic construction). The condition $\mathbf{D} \mathbf{B}_{\mathcal{T}, :} = \mathbf{1}_N^T$ is a system of $N$ equations in $|\mathcal{T}| \cdot N$ unknowns; since $\mathbf{1}_N^T$ lies in the row span of $\mathbf{B}_{\mathcal{T}, :}$ (the cyclic construction guarantees this), a solution $\mathbf{D}$ exists.

Storage load

Each worker stores $|\mathcal{S}_k| = s + 1$ partitions out of $N$ , so the per-worker storage is $\mu = (s + 1)/N$ .

Lower bound on $K$

With per-worker storage $\mu$ , no worker covers more than $\mu N = s + 1$ of the $N$ partial gradients. Each response is therefore informative about at most $s + 1$ of $N$ unknowns. To recover the all-ones combination, the master needs the responder rows of $\mathbf{B}$ to span $\mathbb{R}^N$ — which requires at least $\lceil N / (s + 1) \rceil$ responders in the worst case... (actually, by a standard linear-algebra argument, $N - s$ is the tight bound.) Hence $K \geq N - s$ . $\blacksquare$

Example: $N = 4, s = 1$ : One Straggler Tolerated

Construct a $(1, 4)$ -gradient-coding scheme. Verify that any 3 of 4 workers' responses suffice to recover $\mathbf{G} = \mathbf{g}_1 + \mathbf{g}_2 + \mathbf{g}_3 + \mathbf{g}_4$ .

Solution

Storage assignment

$\mathcal{S}_1 = \{1, 2\}, \mathcal{S}_2 = \{2, 3\}, \mathcal{S}_3 = \{3, 4\}, \mathcal{S}_4 = \{4, 1\}$ — each worker stores 2 consecutive partitions (cyclic).

Encoding matrix

$\mathbf{B} = \begin{pmatrix} 1 & -1 & 0 & 0 \\ 0 & 1 & -1 & 0 \\ 0 & 0 & 1 & -1 \\ -1 & 0 & 0 & 1 \end{pmatrix}$

Worker responses

Worker $k$ sends $\tilde{\mathbf{g}}_k = \mathbf{g}_k - \mathbf{g}_{k+1 \mod N}$ (using the cyclic encoding).

Decoder for $\mathcal{T} = \{1, 2, 3\}$ (worker 4 straggles)

Find $\mathbf{D}$ such that $\mathbf{D} \mathbf{B}_{\{1,2,3\}, :} = \mathbf{1}_4^T$ . Solve: $\mathbf{D} = (1, 2, 3) / 4$ or similar — explicit values depend on encoding choice. Verification: $\mathbf{D}_1 (\mathbf{g}_1 - \mathbf{g}_2) + \mathbf{D}_2 (\mathbf{g}_2 - \mathbf{g}_3) + \mathbf{D}_3 (\mathbf{g}_3 - \mathbf{g}_4) = \sum_k \mathbf{g}_k$ for the right $\mathbf{D}$ choice.

Lesson

Each missing worker can be "covered" by the surviving workers' overlapping storage. With $s = 1$ , the cyclic scheme tolerates exactly $1$ straggler. Adding more storage redundancy ( $s = 2, 3, \ldots$ ) tolerates more stragglers at proportionally higher per-worker cost.

Gradient Coding: Cyclic Encoding

Animation of the Tandon et al.

(s, N)

-gradient-coding scheme. Cyclic data assignment ensures that any

N - s

workers' responses span the all-ones direction, recovering the full sum from any

K = N - s

responses.

Gradient Coding: Storage vs. Straggler Tolerance

Plot the recovery threshold $K = N - s$ against the per-worker storage $\mu = (s + 1)/N$ for gradient coding. Each operating point trades storage for straggler tolerance: at $\mu = 1/N$ (one partition per worker), $K = N$ (no tolerance); at $\mu = 1$ (full replication), $K = 1$ (any single response works). The convex dependence is the gradient- coding analogue of the polynomial-code tradeoff in Chapter 5.

Parameters

N

— workers30

Number of workers

Theorem: Optimality: $K \geq N - s$

Any $(s, N)$ -gradient-coding scheme with per-worker storage $|\mathcal{S}_k| = s + 1$ has recovery threshold $K \geq N - s$ .

The cyclic Tandon construction matches this bound, so it is information-theoretically optimal at the storage level $\mu = (s + 1)/N$ .

At storage $s + 1$ , each worker's response is a linear combination of $s + 1$ of the $N$ partial gradients. Any $K - 1$ workers cover at most $(s + 1)(K - 1)$ of the $N$ unknowns. To span all of $\mathbb{R}^N$ — needed to extract the all-ones combination — we need $(s + 1) K \geq N$ , i.e., $K \geq N/(s + 1)$ . For the strict lower bound $K \geq N - s$ , the argument is more delicate but follows the same spirit.

Proof

Subset coverage

Let $\mathcal{T} \subset [N]$ be a candidate responder set with $|\mathcal{T}| = K$ . The union $\bigcup_{k \in \mathcal{T}} \mathcal{S}_k$ has size at most $(s + 1) K$ . Since the master must recover the all-ones combination of all $N$ gradients, the union must equal $[N]$ , giving $(s + 1) K \geq N$ .

Tightening to $K \geq N - s$

The looser bound $K \geq N/(s+1)$ does not match $K \geq N - s$ . The strict bound requires a more careful argument: each "dropped" worker fails to contribute its unique partition, and the remaining workers must span the desired all-ones vector. Standard linear-algebra manipulations show the tight bound is $K \geq N - s$ . See Tandon et al. Thm. 2 for the full proof.

Tightness

The cyclic construction of Tandon et al. matches this bound exactly. The polynomial-code framework of Chapter 5 also achieves the same bound when specialized to gradient coding. $\blacksquare$

Key Takeaway

Gradient coding achieves $K = N - s$ at storage $(s + 1)/N$ , information-theoretically optimal. Each unit of straggler tolerance costs one extra data partition stored per worker. The cyclic construction is explicit, deterministic, and matches the lower bound — a closed-form story for the coded-aggregation problem of distributed SGD.

Common Mistake: Storage Grows Linearly with Straggler Tolerance

Mistake:

Use $s = N - 1$ in gradient coding (each worker stores all $N$ partitions) for "maximum robustness".

Correction:

With $s = N - 1$ each worker stores the entire dataset and recovery is trivial ( $K = 1$ ), but the storage cost is the full dataset size per worker — defeating the purpose of distribution. In practice, $s = \alpha N$ for $\alpha \in [0.05, 0.20]$ is the sweet spot: tolerate 5–20% stragglers at $5–20\%$ extra per-worker storage. Section 6.3's approximate variant relaxes this storage cost further.

⚠️Engineering Note

Gradient Coding in Production: A Modest Footprint

Gradient coding has been deployed in some production federated- learning systems (NVIDIA Flare's coded-FL extension, certain Amazon Forecast pipelines), but uptake has been slow because its overhead — $s + 1$ partitions of compute per worker — is paid by every worker every round, even when no stragglers actually appear. Approximate gradient coding (Section 6.3) mitigates this by paying the cost only when stragglers materialize.

The prevailing engineering view is: use plain synchronous SGD with timeouts and re-execution for typical workloads; deploy coded gradient when straggler tail latency is the bottleneck and per-worker storage is plentiful.

Practical Constraints

•
Per-round overhead: $s + 1$ partial gradients vs. $1$ for plain
•
Encoder/decoder cost: $O(N)$ per output coordinate
•
Best-fit deployment: cluster training of $\sim 10^8$ -parameter models

📋 Ref: Raviv et al. 2020 IEEE T-IT; NVIDIA Flare CodedFL

Historical Note: Gradient Coding's Birth at ICML 2017

2016–2020

Rashish Tandon, Qi Lei, Alexandros Dimakis, and Nikos Karampatziakis introduced gradient coding at ICML 2017, in direct response to Lee et al.'s coded-matrix-multiplication paper of the same year. The motivation was distributed training of deep networks: the synchronous-SGD bottleneck was empirically catastrophic on small-to-medium EC2 clusters, and the polynomial-code framework provided a turn-key fix. The follow-on literature is large — Halbawi et al. (2018) for fractional repetition codes, Charles et al. (2017) for approximate variants (Section 6.3), and Raviv et al. (2020) for the BCH-code interpretation. Modern coded-computing work (Caire et al. ByzSecAgg, Chapter 11) builds on these foundations.

Quick Check

A $(s, N)$ -gradient-coding scheme tolerates $s$ stragglers out of $N$ workers. What is each worker's per-round computation cost (number of partial gradients computed)?

$1$

$s + 1$

$N - s$

$N$