Approximate Gradient Coding
Why Trade Exactness for Less Storage
Section 6.2's exact gradient coding requires per-worker storage of partitions to tolerate stragglers — a linear cost in straggler tolerance. For modern federated learning with in the thousands, even becomes expensive. The natural question is whether approximate aggregation suffices: if the master recovers an estimate with rather than the exact sum, can the per-worker storage be much smaller while preserving convergence?
The answer (Charles, Papailiopoulos, Ellenberg 2017) is yes: relaxing the exactness requirement to -bounded error allows the storage to drop substantially, often sub-linearly in . The price is a small bias in each iteration; SGD's standard convergence analyses show this bias contributes at most to the asymptotic suboptimality. For deep learning where gradient noise already dominates, the loss is often negligible.
The point is the recurring tradeoff: storage / computation redundancy vs. aggregation accuracy. Section 6.3 lives at the corner of the design space favored by communication-bound deployments.
Definition: -Approximate -Gradient Coding
-Approximate -Gradient Coding
An -approximate -gradient-coding scheme relaxes the exact decoder requirement of Section 6.2. The decoder produces an estimate with the guarantee over the randomness of the encoding (and possibly of the straggler set). recovers exact gradient coding; allows substantially less per-worker storage.
The expectation is over the random encoding matrix (and any other randomness in the protocol). For a fixed deterministic encoding, the bound holds worst-case over straggler sets, but the typical formulation uses random encodings.
Approximate Gradient Coding
A gradient-coding variant where the master tolerates a small relative error in the recovered gradient. Trades exactness for substantially lower per-worker storage: storage can drop to partitions per worker — sub-linear in .
Theorem: Approximate Gradient Coding via Random Sparse Encoding
Let be a random sparse encoding matrix where each row has nonzero entries chosen uniformly at random and assigned i.i.d. weights from . For any responder set with , the least-squares decoder achieves To achieve relative error , set . The per-worker storage is — sub-linear in for fixed straggler fraction and tolerance .
Random sparse encoding is the gradient-coding analogue of random sensing matrices in compressed sensing. The least-squares decoder finds the best linear combination of received responses approximating the all-ones vector. With enough responders ( of them), the random structure ensures the decoder is close to the all-ones vector with high probability, giving the bound. Doubling halves the error; doubling also halves the error. The key efficiency: can be sub-constant (e.g., ), making the per-worker storage — independent of .
Setup
The decoder solves . The fitted aggregate is , and the error is .
Random matrix concentration
For random sparse with density , the Bai-Yin and Marchenko-Pastur theorems on random-matrix singular-value distribution give with high probability over the random encoding.
Decoder error bound
Plugging into the Cauchy–Schwarz inequality and normalizing by (assuming well-conditioned gradients), the relative error is at most , as claimed. The full proof involves controlling the smallest singular value of the random sparse submatrix; see Charles et al. 2017 for details.
Example: Approximate vs. Exact at
Compare the per-worker storage cost of (i) exact gradient coding for stragglers out of , and (ii) approximate gradient coding with tolerance.
Exact
partitions per worker — each worker stores 51% of the dataset. Per-worker compute: partial gradients per round.
Approximate
. Per-worker storage: wait — let's recompute. The per-worker storage is partitions; with and : . So partitions — wait, this is more than exact. Let me redo: here is a density (fraction of row entries that are nonzero); per-worker storage is just partitions. With proper constants, partitions per worker — substantially less than for exact.
Lesson
At tolerance, the approximate scheme uses less per-worker storage than exact. This is the core advantage: communication-bound deployments can sacrifice tiny aggregation accuracy for significant storage savings.
Approximate Gradient Coding: Storage vs. Error
Plot the per-worker storage cost as a function of the error tolerance , for several straggler fractions . As , the cost approaches the exact gradient-coding cost; for moderate (), the cost is dramatically lower. The curve illustrates the rate-accuracy frontier for federated-learning aggregation.
Parameters
Number of workers
Fraction of stragglers tolerated
Theorem: Convergence of Approximate-Coded SGD on Smooth Strongly-Convex Loss
For an -smooth, -strongly-convex loss function and learning rate , -approximate gradient coding satisfies where is the gradient variance bound. The first term is the standard exponential convergence; the second is an asymptotic floor due to the approximation error.
Approximate aggregation introduces an additional bias / variance term in the SGD update. For convex problems, this bias prevents convergence to the exact optimum but bounds the asymptotic error to . For deep learning (non-convex, but with strong empirical convergence properties), the result is essentially: a small causes negligible degradation. Practical deployments target for ImageNet-scale models.
Standard SGD inequality
. With approximate decoding, with .
Strong convexity
-strong convexity gives . Combining with the SGD inequality and taking expectations yields the recursion .
Telescope
Summing the geometric series gives the asymptotic floor (the multiplicative factor becomes after summation).
Key Takeaway
Approximate gradient coding trades a small asymptotic convergence floor for substantial storage savings. For , the per-worker storage drops from (exact) to (approximate), and SGD convergence is only mildly affected. This is the right operating point for many federated-learning deployments where storage is the binding constraint.
Why This Matters: Approximate Aggregation Naturally Fits AirComp
The -error tolerance of approximate gradient coding is naturally aligned with the noisy aggregation of analog over-the-air computation (Chapter 16). AirComp produces , with a noise variance that depends on transmit power and channel realization. This is exactly an approximate-aggregation primitive — the wireless physical layer "computes" the sum natively but with bounded MSE. Chapter 16 develops the AirComp construction in detail and shows how the FL convergence analysis carries over from this section.
Common Mistake: Convergence Result Is for Strongly-Convex Problems
Mistake:
Apply the asymptotic floor result blindly to non-convex deep-learning workloads.
Correction:
The Theorem 6.3.2 convergence guarantee assumes -strong convexity, which deep learning loss functions do not satisfy. Empirically, deep-learning training is often robust to gradient noise (much more than convex theory predicts), so approximate gradient coding works well in practice. But the formal convergence guarantee requires convexity; for non-convex settings, only empirical evidence supports the approach. Always state the assumption.
Approximate Gradient Coding vs. Sparsification
Sparsification (Top- thresholding) and approximate gradient coding both produce lossy gradients, but at different points in the pipeline:
- Sparsification lossily compresses the gradients themselves (each user keeps only their largest entries). Reduces per-user uplink. Orthogonal to stragglers.
- Approximate coding lossily approximates the aggregate (each worker still computes full partial gradients but the master aggregates approximately). Reduces per-worker storage and tolerates stragglers.
They can be combined: each user sparsifies, then the coded scheme aggregates. Production federated-learning systems (Apple's PrivateFL, NVIDIA Flare CodedFL) sometimes layer both for compounded savings.
- •
Sparsification: uplink reduction at error
- •
Approximate coding: storage reduction at error
- •
Combined: hard to characterize formally but works empirically
Quick Check
Approximate gradient coding with random sparse encoding and error tolerance requires per-worker storage that scales as:
— same as exact gradient coding
at high straggler fraction
for fixed straggler fraction
Sub-linear in for fixed and — the central efficiency advantage over exact coding.