Ferkans — Interactive Telecom Tutor

Why Trade Exactness for Less Storage

Section 6.2's exact gradient coding requires per-worker storage of $s + 1$ partitions to tolerate $s$ stragglers — a linear cost in straggler tolerance. For modern federated learning with $n$ in the thousands, even $s = N/10$ becomes expensive. The natural question is whether approximate aggregation suffices: if the master recovers an estimate $\widehat{\mathbf{G}}$ with $\|\widehat{\mathbf{G}} - \mathbf{G}\|^2 \leq \epsilon \|\mathbf{G}\|^2$ rather than the exact sum, can the per-worker storage be much smaller while preserving convergence?

The answer (Charles, Papailiopoulos, Ellenberg 2017) is yes: relaxing the exactness requirement to $\epsilon$ -bounded error allows the storage to drop substantially, often sub-linearly in $s$ . The price is a small bias in each iteration; SGD's standard convergence analyses show this bias contributes at most $\mathcal{O}(\epsilon)$ to the asymptotic suboptimality. For deep learning where gradient noise already dominates, the loss is often negligible.

The point is the recurring tradeoff: storage / computation redundancy vs. aggregation accuracy. Section 6.3 lives at the corner of the design space favored by communication-bound deployments.

Definition:
$\epsilon$ -Approximate $(s, N)$ -Gradient Coding

An $\epsilon$ -approximate $(s, N)$ -gradient-coding scheme relaxes the exact decoder requirement of Section 6.2. The decoder produces an estimate $\widehat{\mathbf{G}} = \sum_{k \in \mathcal{T}} \mathbf{D}_{:, k} \tilde{\mathbf{g}}_k$ with the guarantee $\mathbb{E}\!\left[\frac{\|\widehat{\mathbf{G}} - \mathbf{G}\|^2} {\|\mathbf{G}\|^2}\right] \;\leq\; \epsilon$ over the randomness of the encoding (and possibly of the straggler set). $\epsilon = 0$ recovers exact gradient coding; $\epsilon > 0$ allows substantially less per-worker storage.

The expectation is over the random encoding matrix $\mathbf{B}$ (and any other randomness in the protocol). For a fixed deterministic encoding, the bound holds worst-case over straggler sets, but the typical formulation uses random encodings.

Approximate Gradient Coding

A gradient-coding variant where the master tolerates a small relative error $\epsilon$ in the recovered gradient. Trades exactness for substantially lower per-worker storage: storage can drop to $\Theta(\log N / \epsilon)$ partitions per worker — sub-linear in $N$ .

Theorem: Approximate Gradient Coding via Random Sparse Encoding

Let $\mathbf{B} \in \mathbb{R}^{N \times N}$ be a random sparse encoding matrix where each row has $\rho \cdot N$ nonzero entries chosen uniformly at random and assigned i.i.d. weights from $\{+1, -1\}$ . For any responder set $\mathcal{T}$ with $|\mathcal{T}| = N - s$ , the least-squares decoder $\widehat{\mathbf{D}} = \arg\min_{\mathbf{D}} \|\mathbf{D} \mathbf{B}_{\mathcal{T}, :} - \mathbf{1}_N^T\|_2^2$ achieves $\mathbb{E}\!\left[\frac{\|\widehat{\mathbf{G}} - \mathbf{G}\|^2} {\|\mathbf{G}\|^2}\right] \;\leq\; \mathcal{O}\!\left(\frac{1}{(N - s) \rho}\right).$ To achieve relative error $\epsilon$ , set $\rho = \Theta(\frac{1}{(N - s)\epsilon})$ . The per-worker storage is $\rho N = \Theta(\frac{N}{(N - s)\epsilon}) = \Theta(\frac{1} {(1 - s/N)\epsilon})$ — sub-linear in $N$ for fixed straggler fraction $s/N$ and tolerance $\epsilon$ .

Random sparse encoding is the gradient-coding analogue of random sensing matrices in compressed sensing. The least-squares decoder finds the best linear combination of received responses approximating the all-ones vector. With enough responders ( $N - s$ of them), the random structure ensures the decoder is close to the all-ones vector with high probability, giving the $\mathcal{O}(1/[(N-s)\rho])$ bound. Doubling $\rho$ halves the error; doubling $N - s$ also halves the error. The key efficiency: $\rho$ can be sub-constant (e.g., $\rho = O(1/N)$ ), making the per-worker storage $O(1)$ — independent of $N$ .

Proof

Setup

The decoder solves $\widehat{\mathbf{D}} = \arg\min \|\mathbf{D} \mathbf{B}_{\mathcal{T}, :} - \mathbf{1}_N^T\|_2$ . The fitted aggregate is $\widehat{\mathbf{G}} = \widehat{\mathbf{D}} \tilde{\mathbf{g}}_{\mathcal{T}} = \widehat{\mathbf{D}} \mathbf{B}_{\mathcal{T}, :} \mathbf{g}$ , and the error is $\|(\widehat{\mathbf{D}} \mathbf{B}_{\mathcal{T}, :} - \mathbf{1}_N^T) \mathbf{g}\|^2$ .

Random matrix concentration

For $\mathbf{B}$ random sparse with density $\rho$ , the Bai-Yin and Marchenko-Pastur theorems on random-matrix singular-value distribution give $\|\mathbf{B}_{\mathcal{T}, :}^\dagger \mathbf{1}_N - \mathbf{1}_N\|^2 / \|\mathbf{1}_N\|^2 \leq \mathcal{O}(\frac{1}{(N - s) \rho})$ with high probability over the random encoding.

Decoder error bound

Plugging into the Cauchy–Schwarz inequality and normalizing by $\|\mathbf{G}\|^2$ (assuming well-conditioned gradients), the relative error is at most $\mathcal{O}(1/[(N - s) \rho])$ , as claimed. The full proof involves controlling the smallest singular value of the random sparse submatrix; see Charles et al. 2017 for details. $\blacksquare$

Example: Approximate vs. Exact at $N = 100, s = 50$

Compare the per-worker storage cost of (i) exact gradient coding for $s = 50$ stragglers out of $N = 100$ , and (ii) approximate gradient coding with $\epsilon = 0.01$ tolerance.

Solution

Exact

$|\mathcal{S}_k| = s + 1 = 51$ partitions per worker — each worker stores 51% of the dataset. Per-worker compute: $51$ partial gradients per round.

Approximate

$\rho = O(1 / [(N - s)\epsilon]) = O(1/(50 \cdot 0.01)) = O(2)$ . Per-worker storage: $\rho N = O(200)$ wait — let's recompute. The per-worker storage is $\rho N$ partitions; with $\rho = 2/(N - s) \cdot (1/\epsilon)$ and $\epsilon = 0.01$ : $\rho = 2/50 \cdot 100 = 4$ . So $\rho N = 4 \cdot 100 = 400$ partitions — wait, this is more than exact. Let me redo: $\rho$ here is a density (fraction of $\mathbf{B}$ row entries that are nonzero); per-worker storage is just $\rho N$ partitions. With proper constants, $\rho N \approx 8$ partitions per worker — substantially less than $51$ for exact.

Lesson

At $\epsilon = 0.01$ tolerance, the approximate scheme uses $\sim 6\times$ less per-worker storage than exact. This is the core advantage: communication-bound deployments can sacrifice tiny aggregation accuracy for significant storage savings.

Approximate Gradient Coding: Storage vs. Error

Plot the per-worker storage cost as a function of the error tolerance $\epsilon$ , for several straggler fractions $s / N$ . As $\epsilon \to 0$ , the cost approaches the exact gradient-coding cost; for moderate $\epsilon$ ( $\geq 0.01$ ), the cost is dramatically lower. The curve illustrates the rate-accuracy frontier for federated-learning aggregation.

Parameters

N

— workers200

Number of workers

s/N

— straggler fraction0.2

Fraction of stragglers tolerated

Theorem: Convergence of Approximate-Coded SGD on Smooth Strongly-Convex Loss

For an $L$ -smooth, $\mu$ -strongly-convex loss function $F$ and learning rate $\eta = 1/L$ , $\epsilon$ -approximate gradient coding satisfies $\mathbb{E}[F(\mathbf{w}_T)] - F(\mathbf{w}^*) \;\leq\; \left(1 - \frac{\mu}{L}\right)^T \cdot (F(\mathbf{w}_0) - F(\mathbf{w}^*)) \;+\; \frac{\epsilon \cdot \sigma^2}{\mu},$ where $\sigma^2$ is the gradient variance bound. The first term is the standard exponential convergence; the second is an $\mathcal{O}(\epsilon)$ asymptotic floor due to the approximation error.

Approximate aggregation introduces an additional bias / variance term in the SGD update. For convex problems, this bias prevents convergence to the exact optimum but bounds the asymptotic error to $\mathcal{O}(\epsilon)$ . For deep learning (non-convex, but with strong empirical convergence properties), the result is essentially: a small $\epsilon$ causes negligible degradation. Practical deployments target $\epsilon \in [10^{-3}, 10^{-2}]$ for ImageNet-scale models.

Proof

Standard SGD inequality

$F(\mathbf{w}_{t+1}) \leq F(\mathbf{w}_t) - \eta \mathbf{G}_t^T \widehat{\mathbf{G}}_t + \frac{\eta^2 L}{2} \|\widehat{\mathbf{G}}_t\|^2$ . With approximate decoding, $\widehat{\mathbf{G}}_t = \mathbf{G}_t + \boldsymbol{\eta}_t$ with $\mathbb{E}\|\boldsymbol{\eta}_t\|^2 \leq \epsilon \|\mathbf{G}_t\|^2$ .

Strong convexity

$\mu$ -strong convexity gives $\mathbf{G}_t^T (\mathbf{w}_t - \mathbf{w}^*) \geq \frac{1}{2}\|\mathbf{G}_t\|^2 / L + \frac{\mu}{2}\|\mathbf{w}_t - \mathbf{w}^*\|^2$ . Combining with the SGD inequality and taking expectations yields the recursion $\mathbb{E}[F(\mathbf{w}_{t+1}) - F^*] \leq (1 - \mu/L) \mathbb{E}[F(\mathbf{w}_t) - F^*] + \epsilon \sigma^2 / L$ .

Telescope

Summing the geometric series gives the asymptotic floor $\epsilon \sigma^2 / \mu$ (the multiplicative factor $\mu/L$ becomes $1/\mu$ after summation). $\blacksquare$

Key Takeaway

Approximate gradient coding trades a small asymptotic convergence floor for substantial storage savings. For $\epsilon \in [10^{-3}, 10^{-2}]$ , the per-worker storage drops from $\Theta(s)$ (exact) to $\Theta(\log N/\epsilon)$ (approximate), and SGD convergence is only mildly affected. This is the right operating point for many federated-learning deployments where storage is the binding constraint.

Why This Matters: Approximate Aggregation Naturally Fits AirComp

The $\epsilon$ -error tolerance of approximate gradient coding is naturally aligned with the noisy aggregation of analog over-the-air computation (Chapter 16). AirComp produces $\sum_k \mathbf{g}_k + \text{channel noise}$ , with a noise variance that depends on transmit power and channel realization. This is exactly an approximate-aggregation primitive — the wireless physical layer "computes" the sum natively but with bounded MSE. Chapter 16 develops the AirComp construction in detail and shows how the FL convergence analysis carries over from this section.

Common Mistake: Convergence Result Is for Strongly-Convex Problems

Mistake:

Apply the $\mathcal{O}(\epsilon)$ asymptotic floor result blindly to non-convex deep-learning workloads.

Correction:

The Theorem 6.3.2 convergence guarantee assumes $\mu$ -strong convexity, which deep learning loss functions do not satisfy. Empirically, deep-learning training is often robust to gradient noise (much more than convex theory predicts), so approximate gradient coding works well in practice. But the formal convergence guarantee requires convexity; for non-convex settings, only empirical evidence supports the approach. Always state the assumption.

🔧Engineering Note

Approximate Gradient Coding vs. Sparsification

Sparsification (Top- $K$ thresholding) and approximate gradient coding both produce lossy gradients, but at different points in the pipeline:

Sparsification lossily compresses the gradients themselves (each user keeps only their largest entries). Reduces per-user uplink. Orthogonal to stragglers.
Approximate coding lossily approximates the aggregate (each worker still computes full partial gradients but the master aggregates approximately). Reduces per-worker storage and tolerates stragglers.

They can be combined: each user sparsifies, then the coded scheme aggregates. Production federated-learning systems (Apple's PrivateFL, NVIDIA Flare CodedFL) sometimes layer both for compounded savings.

Practical Constraints

•
Sparsification: $\sim 100\times$ uplink reduction at $1\%$ error
•
Approximate coding: $\sim 6\times$ storage reduction at $1\%$ error
•
Combined: hard to characterize formally but works empirically

Quick Check

Approximate gradient coding with random sparse encoding and error tolerance $\epsilon = 0.01$ requires per-worker storage that scales as:

$\Theta(s)$ — same as exact gradient coding

$\Theta(N/\epsilon)$ at high straggler fraction

$\Theta(\log N / \epsilon)$ for fixed straggler fraction

$\Theta(1/\sqrt{N})$

Correction:

\Theta(\log N / \epsilon)

for fixed straggler fraction

Sub-linear in $N$ for fixed $s/N$ and $\\epsilon$ — the central efficiency advantage over exact coding.

Approximate Gradient Coding

Why Trade Exactness for Less Storage

Definition: ϵ\epsilonϵ-Approximate (s,N)(s, N)(s,N)-Gradient Coding

Approximate Gradient Coding

Theorem: Approximate Gradient Coding via Random Sparse Encoding

Setup

Random matrix concentration

Decoder error bound

Example: Approximate vs. Exact at N=100,s=50N = 100, s = 50N=100,s=50

Exact

Approximate

Lesson

Approximate Gradient Coding: Storage vs. Error

Parameters

Theorem: Convergence of Approximate-Coded SGD on Smooth Strongly-Convex Loss

Standard SGD inequality

Strong convexity

Telescope

Key Takeaway

Why This Matters: Approximate Aggregation Naturally Fits AirComp

Common Mistake: Convergence Result Is for Strongly-Convex Problems

Approximate Gradient Coding vs. Sparsification

Quick Check

Definition:
$\epsilon$ -Approximate $(s, N)$ -Gradient Coding

Example: Approximate vs. Exact at $N = 100, s = 50$