Approximate Gradient Coding

Why Trade Exactness for Less Storage

Section 6.2's exact gradient coding requires per-worker storage of s+1s + 1 partitions to tolerate ss stragglers — a linear cost in straggler tolerance. For modern federated learning with nn in the thousands, even s=N/10s = N/10 becomes expensive. The natural question is whether approximate aggregation suffices: if the master recovers an estimate G^\widehat{\mathbf{G}} with G^G2ϵG2\|\widehat{\mathbf{G}} - \mathbf{G}\|^2 \leq \epsilon \|\mathbf{G}\|^2 rather than the exact sum, can the per-worker storage be much smaller while preserving convergence?

The answer (Charles, Papailiopoulos, Ellenberg 2017) is yes: relaxing the exactness requirement to ϵ\epsilon-bounded error allows the storage to drop substantially, often sub-linearly in ss. The price is a small bias in each iteration; SGD's standard convergence analyses show this bias contributes at most O(ϵ)\mathcal{O}(\epsilon) to the asymptotic suboptimality. For deep learning where gradient noise already dominates, the loss is often negligible.

The point is the recurring tradeoff: storage / computation redundancy vs. aggregation accuracy. Section 6.3 lives at the corner of the design space favored by communication-bound deployments.

Definition:

ϵ\epsilon-Approximate (s,N)(s, N)-Gradient Coding

An ϵ\epsilon-approximate (s,N)(s, N)-gradient-coding scheme relaxes the exact decoder requirement of Section 6.2. The decoder produces an estimate G^=kTD:,kg~k\widehat{\mathbf{G}} = \sum_{k \in \mathcal{T}} \mathbf{D}_{:, k} \tilde{\mathbf{g}}_k with the guarantee E ⁣[G^G2G2]    ϵ\mathbb{E}\!\left[\frac{\|\widehat{\mathbf{G}} - \mathbf{G}\|^2} {\|\mathbf{G}\|^2}\right] \;\leq\; \epsilon over the randomness of the encoding (and possibly of the straggler set). ϵ=0\epsilon = 0 recovers exact gradient coding; ϵ>0\epsilon > 0 allows substantially less per-worker storage.

The expectation is over the random encoding matrix B\mathbf{B} (and any other randomness in the protocol). For a fixed deterministic encoding, the bound holds worst-case over straggler sets, but the typical formulation uses random encodings.

Approximate Gradient Coding

A gradient-coding variant where the master tolerates a small relative error ϵ\epsilon in the recovered gradient. Trades exactness for substantially lower per-worker storage: storage can drop to Θ(logN/ϵ)\Theta(\log N / \epsilon) partitions per worker — sub-linear in NN.

Theorem: Approximate Gradient Coding via Random Sparse Encoding

Let BRN×N\mathbf{B} \in \mathbb{R}^{N \times N} be a random sparse encoding matrix where each row has ρN\rho \cdot N nonzero entries chosen uniformly at random and assigned i.i.d. weights from {+1,1}\{+1, -1\}. For any responder set T\mathcal{T} with T=Ns|\mathcal{T}| = N - s, the least-squares decoder D^=argminDDBT,:1NT22\widehat{\mathbf{D}} = \arg\min_{\mathbf{D}} \|\mathbf{D} \mathbf{B}_{\mathcal{T}, :} - \mathbf{1}_N^T\|_2^2 achieves E ⁣[G^G2G2]    O ⁣(1(Ns)ρ).\mathbb{E}\!\left[\frac{\|\widehat{\mathbf{G}} - \mathbf{G}\|^2} {\|\mathbf{G}\|^2}\right] \;\leq\; \mathcal{O}\!\left(\frac{1}{(N - s) \rho}\right). To achieve relative error ϵ\epsilon, set ρ=Θ(1(Ns)ϵ)\rho = \Theta(\frac{1}{(N - s)\epsilon}). The per-worker storage is ρN=Θ(N(Ns)ϵ)=Θ(1(1s/N)ϵ)\rho N = \Theta(\frac{N}{(N - s)\epsilon}) = \Theta(\frac{1} {(1 - s/N)\epsilon})sub-linear in NN for fixed straggler fraction s/Ns/N and tolerance ϵ\epsilon.

Random sparse encoding is the gradient-coding analogue of random sensing matrices in compressed sensing. The least-squares decoder finds the best linear combination of received responses approximating the all-ones vector. With enough responders (NsN - s of them), the random structure ensures the decoder is close to the all-ones vector with high probability, giving the O(1/[(Ns)ρ])\mathcal{O}(1/[(N-s)\rho]) bound. Doubling ρ\rho halves the error; doubling NsN - s also halves the error. The key efficiency: ρ\rho can be sub-constant (e.g., ρ=O(1/N)\rho = O(1/N)), making the per-worker storage O(1)O(1) — independent of NN.

Example: Approximate vs. Exact at N=100,s=50N = 100, s = 50

Compare the per-worker storage cost of (i) exact gradient coding for s=50s = 50 stragglers out of N=100N = 100, and (ii) approximate gradient coding with ϵ=0.01\epsilon = 0.01 tolerance.

Approximate Gradient Coding: Storage vs. Error

Plot the per-worker storage cost as a function of the error tolerance ϵ\epsilon, for several straggler fractions s/Ns / N. As ϵ0\epsilon \to 0, the cost approaches the exact gradient-coding cost; for moderate ϵ\epsilon (0.01\geq 0.01), the cost is dramatically lower. The curve illustrates the rate-accuracy frontier for federated-learning aggregation.

Parameters
200

Number of workers

0.2

Fraction of stragglers tolerated

Theorem: Convergence of Approximate-Coded SGD on Smooth Strongly-Convex Loss

For an LL-smooth, μ\mu-strongly-convex loss function FF and learning rate η=1/L\eta = 1/L, ϵ\epsilon-approximate gradient coding satisfies E[F(wT)]F(w)    (1μL)T(F(w0)F(w))  +  ϵσ2μ,\mathbb{E}[F(\mathbf{w}_T)] - F(\mathbf{w}^*) \;\leq\; \left(1 - \frac{\mu}{L}\right)^T \cdot (F(\mathbf{w}_0) - F(\mathbf{w}^*)) \;+\; \frac{\epsilon \cdot \sigma^2}{\mu}, where σ2\sigma^2 is the gradient variance bound. The first term is the standard exponential convergence; the second is an O(ϵ)\mathcal{O}(\epsilon) asymptotic floor due to the approximation error.

Approximate aggregation introduces an additional bias / variance term in the SGD update. For convex problems, this bias prevents convergence to the exact optimum but bounds the asymptotic error to O(ϵ)\mathcal{O}(\epsilon). For deep learning (non-convex, but with strong empirical convergence properties), the result is essentially: a small ϵ\epsilon causes negligible degradation. Practical deployments target ϵ[103,102]\epsilon \in [10^{-3}, 10^{-2}] for ImageNet-scale models.

Key Takeaway

Approximate gradient coding trades a small asymptotic convergence floor for substantial storage savings. For ϵ[103,102]\epsilon \in [10^{-3}, 10^{-2}], the per-worker storage drops from Θ(s)\Theta(s) (exact) to Θ(logN/ϵ)\Theta(\log N/\epsilon) (approximate), and SGD convergence is only mildly affected. This is the right operating point for many federated-learning deployments where storage is the binding constraint.

Why This Matters: Approximate Aggregation Naturally Fits AirComp

The ϵ\epsilon-error tolerance of approximate gradient coding is naturally aligned with the noisy aggregation of analog over-the-air computation (Chapter 16). AirComp produces kgk+channel noise\sum_k \mathbf{g}_k + \text{channel noise}, with a noise variance that depends on transmit power and channel realization. This is exactly an approximate-aggregation primitive — the wireless physical layer "computes" the sum natively but with bounded MSE. Chapter 16 develops the AirComp construction in detail and shows how the FL convergence analysis carries over from this section.

Common Mistake: Convergence Result Is for Strongly-Convex Problems

Mistake:

Apply the O(ϵ)\mathcal{O}(\epsilon) asymptotic floor result blindly to non-convex deep-learning workloads.

Correction:

The Theorem 6.3.2 convergence guarantee assumes μ\mu-strong convexity, which deep learning loss functions do not satisfy. Empirically, deep-learning training is often robust to gradient noise (much more than convex theory predicts), so approximate gradient coding works well in practice. But the formal convergence guarantee requires convexity; for non-convex settings, only empirical evidence supports the approach. Always state the assumption.

🔧Engineering Note

Approximate Gradient Coding vs. Sparsification

Sparsification (Top-KK thresholding) and approximate gradient coding both produce lossy gradients, but at different points in the pipeline:

  • Sparsification lossily compresses the gradients themselves (each user keeps only their largest entries). Reduces per-user uplink. Orthogonal to stragglers.
  • Approximate coding lossily approximates the aggregate (each worker still computes full partial gradients but the master aggregates approximately). Reduces per-worker storage and tolerates stragglers.

They can be combined: each user sparsifies, then the coded scheme aggregates. Production federated-learning systems (Apple's PrivateFL, NVIDIA Flare CodedFL) sometimes layer both for compounded savings.

Practical Constraints
  • Sparsification: 100×\sim 100\times uplink reduction at 1%1\% error

  • Approximate coding: 6×\sim 6\times storage reduction at 1%1\% error

  • Combined: hard to characterize formally but works empirically

Quick Check

Approximate gradient coding with random sparse encoding and error tolerance ϵ=0.01\epsilon = 0.01 requires per-worker storage that scales as:

Θ(s)\Theta(s) — same as exact gradient coding

Θ(N/ϵ)\Theta(N/\epsilon) at high straggler fraction

Θ(logN/ϵ)\Theta(\log N / \epsilon) for fixed straggler fraction

Θ(1/N)\Theta(1/\sqrt{N})