Distributed SGD as a Coded-Computing Problem

Revisiting the Aggregation Step

Chapter 1 introduced distributed SGD as the canonical FL workload and identified gradient aggregation as the dominant per-round cost. Chapter 5 showed how polynomial codes can distribute matrix multiplication efficiently in the presence of stragglers. This chapter makes the obvious connection: gradient aggregation is also a linear computation, and the same coding ideas apply. The point is that a gradient gk\mathbf{g}_k is a vector β€” one row of a "matrix" Gstack=[g1;g2;…;gN]\mathbf{G}_{\text{stack}} = [\mathbf{g}_1; \mathbf{g}_2; \ldots; \mathbf{g}_N] β€” and the aggregation G=βˆ‘kgk\mathbf{G} = \sum_k \mathbf{g}_k is an inner product 1TGstack\mathbf{1}^T \mathbf{G}_{\text{stack}}. Encoding the gradients with the polynomial-code machinery of Chapter 5 gives straggler-resilient aggregation.

The chapter has two parts. Gradient coding (Section 6.2) is the exact recovery setting: the master reconstructs the full sum G\mathbf{G} from any KK of NN responses. Approximate gradient coding (Section 6.3) relaxes exactness for lower communication: the master recovers G\mathbf{G} to within bounded error, with significantly less encoding overhead. Both are CommIT-relevant tools that appear in Part III.

Definition:

Coded-Gradient Distributed SGD

A coded-gradient distributed SGD scheme operates over NN workers and proceeds as follows at each iteration tt:

  1. Broadcast. The master broadcasts the current model wt∈Rd\mathbf{w}_t \in \mathbb{R}^d to all workers.
  2. Local computation. Each worker kk computes a coded gradient g~k(wt)∈Rd\tilde{\mathbf{g}}_k(\mathbf{w}_t) \in \mathbb{R}^d β€” a linear combination of partial gradients on its assigned data subsets β€” and sends it to the master.
  3. Master aggregation. The master receives responses from a (possibly random) subset TβŠ†[N]\mathcal{T} \subseteq [N] of size ∣T∣=K|\mathcal{T}| = K, and decodes G^t=βˆ‘kgk(wt)\widehat{\mathbf{G}}_t = \sum_k \mathbf{g}_k(\mathbf{w}_t) from {g~k}k∈T\{\tilde{\mathbf{g}}_k\}_{k \in \mathcal{T}}.
  4. Update. wt+1=wtβˆ’Ξ·tG^t\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \widehat{\mathbf{G}}_t.

The scheme is exact if G^t=Gt\widehat{\mathbf{G}}_t = \mathbf{G}_t exactly for every T\mathcal{T} of size β‰₯K\geq K. Otherwise, the approximation error is βˆ₯G^tβˆ’Gtβˆ₯2/βˆ₯Gtβˆ₯2\|\widehat{\mathbf{G}}_t - \mathbf{G}_t\|^2 / \|\mathbf{G}_t\|^2, which couples to the convergence rate of SGD.

Both exact and approximate variants share the architecture; the difference is in the encoding matrix and in whether the master tolerates an aggregation error. Section 6.2 develops exact gradient coding; Section 6.3 the approximate variant.

Aggregation Is a Matrix-Vector Product

Stack the partial gradients into a matrix Gstack=[g1,g2,…,gN]∈RdΓ—N\mathbf{G}_{\text{stack}} = [\mathbf{g}_1, \mathbf{g}_2, \ldots, \mathbf{g}_N] \in \mathbb{R}^{d \times N}. The aggregate is Gβ€…β€Š=β€…β€ŠGstack 1Nβ€…β€Šβˆˆβ€…β€ŠRd.\mathbf{G} \;=\; \mathbf{G}_{\text{stack}} \, \mathbf{1}_N \;\in\; \mathbb{R}^d. This is exactly a matrix-vector product where the "data matrix" is Gstack\mathbf{G}_{\text{stack}} and the "query" is the all-ones vector 1N\mathbf{1}_N. The point is that any coded-matrix-vector-multiplication scheme of Chapter 5 (the q=1q = 1 special case) directly gives a coded-gradient-aggregation scheme. Gradient coding is just polynomial codes specialized to the matrix-vector setting with row partitioning by worker.

The implication: anything we know about polynomial-code recovery thresholds applies. With p=Np = N "row" partitions of Gstack\mathbf{G}_{\text{stack}} and q=1q = 1 column partition, the polynomial-code recovery threshold is K=pq=NK = pq = N β€” i.e., no coding gain at this storage. Coding only helps when each worker stores more than 1/N1/N of the data. The gradient- coding redundancy parameter ss in Section 6.2 measures exactly this excess storage.

Theorem: Uncoded Distributed SGD: Per-Round Tail Latency

Consider distributed SGD with NN workers and i.i.d. exponential local-computation times of rate Ξ»\lambda. The expected per-round wall-clock time of uncoded synchronous aggregation (master waits for all NN responses) is E[Tround]β€…β€Š=β€…β€ŠHNΞ»β€…β€ŠβˆΌβ€…β€Šln⁑NΞ».\mathbb{E}[T_{\text{round}}] \;=\; \frac{H_N}{\lambda} \;\sim\; \frac{\ln N}{\lambda}. With coded aggregation tolerating ss stragglers (master waits for any K=Nβˆ’sK = N - s responses), the per-round time drops to E[Troundcoded]β€…β€Š=β€…β€ŠHNβˆ’HsΞ»,\mathbb{E}[T_{\text{round}}^{\text{coded}}] \;=\; \frac{H_N - H_{s}}{\lambda}, a constant in NN when s=Ξ±Ns = \alpha N for fixed Ξ±<1\alpha < 1.

Same order-statistic argument as Chapter 1. The coded scheme converts a logarithmic latency penalty into a constant by discarding the slowest ss workers each round. The cost is additional per-worker storage and computation β€” the redundancy factor s/Ns/N β€” which Section 6.2 makes precise.

,

Example: Speedup at N=100N = 100, s=20s = 20

A 100-worker distributed-SGD job has i.i.d. exponential local times with Ξ»=1\lambda = 1. Compute the expected per-round wall-clock time for: (i) uncoded (K=N=100K = N = 100), (ii) coded with s=20s = 20 (K=80K = 80), (iii) coded with s=50s = 50 (K=50K = 50).

Coded SGD Speedup vs. Straggler Tolerance ss

Plot the expected per-round speedup of coded SGD over uncoded as a function of the straggler tolerance ss, for several worker counts NN. The curve HN/(HNβˆ’Hs)H_N / (H_N - H_s) shows the per-round speedup. As ss grows, the speedup increases but the per-worker storage / computation cost also grows.

Parameters
100

Number of workers in the cluster

1

Per-worker exponential service rate

Coded vs. Asynchronous vs. Sparsified SGD

Three competing approaches to the straggler problem deserve naming and comparison up front:

  • Coded synchronous SGD (this chapter): each worker computes a coded gradient; master decodes from any KK responses; iteration semantics preserved.
  • Asynchronous SGD: master applies gradients as they arrive without waiting; staleness hurts convergence.
  • Gradient sparsification / quantization: each worker sends only the largest-magnitude components; bias from compression hurts convergence; orthogonal to straggler mitigation.

Each makes a different tradeoff. Coded synchronous SGD keeps the synchronous semantics (best convergence per round) at the cost of per-worker compute redundancy. The other two trade convergence rate for less work / less communication. In production federated learning all three are sometimes combined.

Straggler Mitigation Strategies

StrategyPer-round latencyConvergence costBest for
Plain synchronousHN/Ξ»H_N / \lambda (worst)Optimal (no penalty)Cluster with negligible stragglers
Coded gradient (this chapter)(HNβˆ’Hs)/Ξ»(H_N - H_s)/\lambdaOptimal β€” exact aggregationSynchronous semantics + straggler robustness
Approximate gradient coding (Β§6.3)Same as codedSlight bias from approximationCommunication-bound deployments
Asynchronous SGD∼1/(Nλ)\sim 1/(N\lambda) per updateSignificant for non-convex modelsLoose convergence requirements
Sparsification / quantizationSame as plainMild bias, often error-feedback correctsBandwidth-bound (orthogonal to stragglers)

Common Mistake: Coded Gradient Recovers the Sum, Not Individual gk\mathbf{g}_k

Mistake:

Treat coded gradient aggregation as a substitute for Bonawitz-style secure aggregation, expecting the master to not learn individual gradients.

Correction:

Coded gradient computation is about straggler tolerance, not privacy. The master learns the sum G=βˆ‘kgk\mathbf{G} = \sum_k \mathbf{g}_k, and the polynomial-code construction does not hide individual gradients from a sufficiently-large coalition. Privacy requires the secret-sharing additive construction of Chapter 10. The two techniques are orthogonal and frequently combined: ByzSecAgg (Chapter 11) uses both.

⚠️Engineering Note

When Does Coded Gradient Pay Off?

Coded gradient computation pays off when:

  1. Stragglers are real and persistent β€” slow GPUs, network jitter, background load. In production deep-learning clusters at FAANG scale, the slowest 5% of workers can be 5–10Γ—\times slower than the median.
  2. Per-worker storage redundancy is affordable β€” coded gradient typically requires each worker to compute partial gradients on 1+s1 + s data partitions instead of one. Linear cost in straggler tolerance.
  3. Exact aggregation matters β€” for convex problems and theoretically-grounded SGD analyses, exact gradient semantics are needed for the convergence proof. For practical deep learning, approximate aggregation is often acceptable (Section 6.3).

In production, gradient coding is most commonly deployed in distributed training of small-to-medium-scale models (under 10810^8 parameters) where the per-worker compute redundancy is small relative to model size. Larger models tend to use approximate methods or asynchronous aggregation.

Practical Constraints
  • β€’

    Per-worker storage must accommodate 1+s1 + s partitions

  • β€’

    Decoder complexity grows as O(K2)O(K^2) in KK

  • β€’

    Synchronization overhead remains β€” coded SGD is still synchronous

πŸ“‹ Ref: Raviv et al. 2020 IEEE T-IT; PyTorch CodedDist research fork

Key Takeaway

Coded gradient computation is polynomial codes specialized to gradient aggregation. The master needs to compute one matrix-vector product per round; coding lets it do so with KK-of-NN straggler tolerance. The cost is per-worker storage and compute redundancy; the gain is constant per-round latency at large NN. Section 6.2 develops the explicit construction.

Quick Check

The aggregate gradient G=βˆ‘kgk\mathbf{G} = \sum_k \mathbf{g}_k can be written as a matrix-vector product with what query?

The all-ones vector 1N\mathbf{1}_N

A random Gaussian vector

An indicator vector ek\mathbf{e}_k

The model parameters wt\mathbf{w}_t