Ferkans — Interactive Telecom Tutor

ex-ch06-01

Easy

For a $(s, N) = (5, 50)$ gradient-coding scheme, what are the per-worker storage, the recovery threshold, and the straggler tolerance?

Show Hint

Per-worker storage: $s + 1$ partitions; threshold: $K = N - s$ .

Solution

Compute

Per-worker storage: $s + 1 = 6$ partitions. Recovery threshold: $K = N - s = 45$ . Straggler tolerance: $s = 5$ (master ignores slowest 5 of 50 workers).

ex-ch06-02

Easy

The aggregate gradient $\mathbf{G} = \sum_k \mathbf{g}_k$ is a matrix-vector product. Identify the matrix and the vector.

Show Hint

Stack the $\mathbf{g}_k$ columnwise.

Solution

Identification

$\mathbf{G} = \mathbf{G}_{\text{stack}} \mathbf{1}_N$ where $\mathbf{G}_{\text{stack}} = [\mathbf{g}_1, \mathbf{g}_2, \ldots, \mathbf{g}_N] \in \mathbb{R}^{d \times N}$ is the matrix of stacked gradients and $\mathbf{1}_N$ is the all-ones query vector.

Implication

Coded gradient computation is the $q = 1$ specialization of polynomial codes (Chapter 5).

ex-ch06-03

Easy

For exact gradient coding with $N = 200$ workers and i.i.d. exponential local times rate $\lambda = 1$ , find the per-round latency speedup of $s = 40$ stragglers tolerated vs. $s = 0$ .

Show Hint

Speedup $= H_N / (H_N - H_s)$ .

Solution

Compute harmonic numbers

$H_{200} \approx 5.878$ ; $H_{40} \approx 4.279$ ; $H_{200} - H_{40} \approx 1.599$ .

Speedup

$5.878 / 1.599 \approx 3.68\times$ .

ex-ch06-04

Easy

State the difference between exact gradient coding and approximate gradient coding in one sentence each.

Solution

Exact

Exact gradient coding (Tandon et al.) recovers $\mathbf{G} = \sum_k \mathbf{g}_k$ exactly from any $K = N - s$ responses, at per-worker storage $s + 1$ .

Approximate

Approximate gradient coding (Charles et al.) recovers an estimate $\widehat{\mathbf{G}}$ with $\mathbb{E}\| \widehat{\mathbf{G}} - \mathbf{G}\|^2 \leq \epsilon \|\mathbf{G}\|^2$ , at per-worker storage $\Theta(\log N / \epsilon)$ — sub-linear in $N$ .

ex-ch06-05

Medium

Design a $(2, 6)$ -gradient-coding scheme. Specify the storage assignment $\mathcal{S}_k$ for each worker and write out an encoding matrix $\mathbf{B}$ that satisfies the cyclic construction.

Show Hint

$\mathcal{S}_k = \{k, k+1, k+2\} \mod N$ .

Solution

Storage assignment

$\mathcal{S}_1 = \{1, 2, 3\}$ , $\mathcal{S}_2 = \{2, 3, 4\}$ , $\mathcal{S}_3 = \{3, 4, 5\}$ , $\mathcal{S}_4 = \{4, 5, 6\}$ , $\mathcal{S}_5 = \{5, 6, 1\}$ , $\mathcal{S}_6 = \{6, 1, 2\}$ . Each worker stores $s + 1 = 3$ consecutive partitions.

Encoding matrix (one valid choice)

$\mathbf{B} = \begin{pmatrix} 1 & -1 & 1 & 0 & 0 & 0 \\ 0 & 1 & -1 & 1 & 0 & 0 \\ 0 & 0 & 1 & -1 & 1 & 0 \\ 0 & 0 & 0 & 1 & -1 & 1 \\ 1 & 0 & 0 & 0 & 1 & -1 \\ -1 & 1 & 0 & 0 & 0 & 1 \end{pmatrix}$ Each row $\mathbf{B}_{k, :}$ has support exactly on $\mathcal{S}_k$ .

Recovery threshold

$K = N - s = 4$ . Master can tolerate any $2$ stragglers out of $6$ workers.

ex-ch06-06

Medium

Prove that the uncoded SGD scheme (no redundancy) corresponds to $s = 0$ in gradient coding, and verify that the recovery threshold $K = N - 0 = N$ is correct.

Show Hint

$s = 0$ means each worker stores only its own partition.

Solution

$s = 0$ case

Storage assignment: $\mathcal{S}_k = \{k\}$ — each worker stores its own partition only. Encoding matrix: $\mathbf{B} = \mathbf{I}_N$ (the identity).

Verification

Recovery threshold: $K = N - s = N$ . Master needs all $N$ responses (one per partial gradient). This is exactly uncoded SGD: every worker's response is needed for the full sum. Straggler tolerance: $0$ .

Implication

$s = 0$ recovers uncoded as a degenerate case of gradient coding. The framework is strictly more general — increasing $s$ trades storage for straggler tolerance.

ex-ch06-07

Medium

Compute the per-worker storage for approximate gradient coding with $\epsilon = 0.05$ , $N = 100$ workers, and straggler tolerance $s = 30$ .

Show Hint

Density $\rho = O(1/[(N - s)\epsilon])$ .

Solution

Density

$\rho = c / [(N - s) \epsilon] = c / (70 \cdot 0.05) = c / 3.5$ for some constant $c$ . Taking $c = 8$ for safety: $\rho \approx 2.3$ — but this should be normalized so that $\rho N$ is the per-worker storage. Re-do: per-worker storage $\rho N \approx 2.3 \cdot 100 / 100 \approx$ ... let's be careful. From the theorem, per-worker storage scales as $1/[(1 - s/N) \epsilon] = 1/(0.7 \cdot 0.05) \approx 28.6$ partitions per worker.

Comparison with exact

Exact gradient coding for $s = 30$ requires $s + 1 = 31$ partitions per worker. Approximate at $\epsilon = 0.05$ requires $\sim 29$ — comparable at this $\epsilon$ but with a smooth tradeoff: tighter $\epsilon$ requires more storage, looser $\epsilon$ much less.

Lesson

At loose error tolerances ( $\epsilon \geq 0.1$ ), approximate coding beats exact by a factor of $5-10\times$ in storage. For tight tolerances ( $\epsilon \leq 0.01$ ), the gap narrows.

ex-ch06-08

Medium

Derive (informally) the convergence-rate result for approximate gradient coding on a smooth strongly-convex objective. Identify the asymptotic floor.

Show Hint

Standard SGD inequality + bounded approximation error.

Solution

Setup

$\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \widehat{\mathbf{G}}_t$ with $\widehat{\mathbf{G}}_t = \mathbf{G}_t + \boldsymbol{\eta}_t$ , $\mathbb{E}\|\boldsymbol{\eta}_t\|^2 \leq \epsilon \|\mathbf{G}_t\|^2$ .

SGD inequality

Standard analysis (Bubeck 2015 §6.2) gives $\mathbb{E}[F(\mathbf{w}_T) - F^*] \leq (1 - \mu/L)^T \cdot (F(\mathbf{w}_0) - F^*) + \frac{\epsilon \sigma^2} {\mu}$ , where the additional term arises from the approximation noise.

Asymptotic floor

As $T \to \infty$ , the first term vanishes and the floor $\epsilon \sigma^2 / \mu$ remains. For $\epsilon = 10^{-3}$ , $\sigma^2 = 1$ , $\mu = 0.1$ , the floor is $\sim 10^{-2}$ — small, often acceptable in practice.

ex-ch06-09

Medium

Explain why coded gradient computation does not provide privacy against the master, and how it can be combined with Bonawitz-style secure aggregation to provide both.

Show Hint

Coded gradient just makes aggregation tolerate stragglers.

Solution

Why no privacy

The master decodes the exact sum $\mathbf{G} = \sum_k \mathbf{g}_k$ from the coded responses. This is a deterministic function of the responses; no random masking is involved. The master could in principle invert the system to learn individual gradients (especially if it has fewer than $N$ unknowns relative to known structure).

Composition with SecAgg

Each user computes their coded gradient $\tilde{\mathbf{g}}_k$ (Section 6.2), then adds pairwise random masks $\mathbf{r}_{k,j}$ before sending. The masks cancel in the aggregate, so the master sees $\tilde{\mathbf{g}}_k + \sum_{j} \mathbf{r}_{k,j}$ but not individual $\tilde{\mathbf{g}}_k$ — let alone the underlying $\mathbf{g}_k$ . The composition is well-defined because both operations are linear.

Cost

Composition is additive: per-user storage grows by $O(\log n)$ for mask seeds; per-round communication grows by $O(n^2)$ for the pairwise mask exchange. Recovery threshold is unchanged ( $K = N - s$ ).

ex-ch06-10

Medium

Compare gradient sparsification (top- $K$ ) and approximate gradient coding in terms of (i) what is approximated, (ii) where in the pipeline the approximation happens, and (iii) when each pays off.

Solution

What is approximated

Sparsification: each user's individual gradient $\mathbf{g}_k$ is approximated by zeroing out small-magnitude entries. Approximate coding: the aggregate $\mathbf{G}$ is approximated; individual gradients are still computed exactly.

Where in pipeline

Sparsification: per-user, before aggregation. Approximate coding: at the master, after all responses arrive.

When each pays off

Sparsification: when uplink bandwidth is the bottleneck (mobile FL, narrowband wireless). Approximate coding: when straggler tolerance and per-worker storage are the bottlenecks (datacenter clusters with heterogeneous compute).

Combination

The two are orthogonal and can be combined: each user sparsifies and the aggregation is done with approximate coding. Production systems sometimes layer both for compounded savings.

ex-ch06-11

Hard

Prove that the cyclic gradient-coding construction's encoding matrix $\mathbf{B}$ has the property: for every responder set $\mathcal{T}$ of size $N - s$ , the linear system $\mathbf{D} \mathbf{B}_{\mathcal{T}, :} = \mathbf{1}_N^T$ has a solution $\mathbf{D}$ .

Show Hint

Use the cyclic structure: $\mathbf{B}_{\mathcal{T}, :}$ has rank $N - s$ .

$\mathbf{1}_N$ is in the row span of $\mathbf{B}_{\mathcal{T}, :}$ .

Solution

Rank of $\mathbf{B}_{\mathcal{T}, :}$

The cyclic structure of $\mathbf{B}$ ensures that any $N - s$ row submatrix has rank $N - s$ (this is a standard property of the equiangular cyclic encoding; see Tandon et al. Lemma 1).

$\mathbf{1}_N$ in the row span

The all-ones vector $\mathbf{1}_N^T$ lies in the row span of $\mathbf{B}$ by construction (column sums of $\mathbf{B}$ equal $1$ at each coordinate). Restricting to the responder rows preserves this: the cyclic encoding is designed so that any $N - s$ rows can reconstruct the all-ones row via a unique linear combination.

Solution exists

Since $\mathbf{B}_{\mathcal{T}, :}$ has full row rank and $\mathbf{1}_N^T$ lies in its row span, the linear system $\mathbf{D} \mathbf{B}_{\mathcal{T}, :} = \mathbf{1}_N^T$ has a unique solution $\mathbf{D} = \mathbf{1}_N^T \mathbf{B}_{\mathcal{T}, :}^\dagger$ . The decoder applies this $\mathbf{D}$ to the responses $\tilde{\mathbf{g}}_{\mathcal{T}}$ and recovers $\sum_k \mathbf{g}_k$ exactly. $\blacksquare$

ex-ch06-12

Hard

Derive the random-sparse approximate gradient-coding error bound $\mathbb{E}\|\widehat{\mathbf{G}} - \mathbf{G}\|^2 / \|\mathbf{G}\|^2 \leq \mathcal{O}(1/[(N - s)\rho])$ informally. What is the role of the random-matrix concentration?

Show Hint

Use Marchenko-Pastur: smallest singular value of a random matrix.

Solution

Setup

$\widehat{\mathbf{G}} = \widehat{\mathbf{D}} \mathbf{B}_{\mathcal{T}, :} \mathbf{g}$ with $\widehat{\mathbf{D}} = \mathbf{1}_N^T \mathbf{B}_{\mathcal{T}, :}^\dagger$ (least-squares). Error: $\|(\widehat{\mathbf{D}} \mathbf{B}_{\mathcal{T}, :} - \mathbf{1}_N^T) \mathbf{g}\|^2$ .

Random-matrix concentration

The smallest singular value of an $(N-s) \times N$ random matrix with density $\rho$ is, with high probability, $\sigma_{\min}(\mathbf{B}_{\mathcal{T}, :}) \geq c\sqrt{(N-s)\rho}$ (Bai-Yin / Marchenko-Pastur). This bounds the conditioning of the pseudo-inverse.

Combine

$\|\widehat{\mathbf{D}} \mathbf{B}_{\mathcal{T}, :} - \mathbf{1}_N^T\| \leq \sigma_{\min}^{-1} \cdot O(1) \leq O(1/\sqrt{(N-s)\rho})$ . Squared: $O(1/[(N-s)\rho])$ . Multiplying by $\|\mathbf{g}\|^2$ gives the error bound.

Implication

Density $\rho$ acts as a "regularizer" for the decoder: more density = better conditioning = smaller error. The role of random-matrix concentration is to control $\sigma_{\min}$ , which determines decoder accuracy.

ex-ch06-13

Hard

Sketch a heterogeneous gradient-coding scheme where workers have different storage budgets $b_k$ . Conjecture the optimal recovery threshold and discuss what makes the problem harder than the symmetric case.

Show Hint

Recovery threshold should depend on the per-worker storage distribution.

Solution

Construction sketch

Assign worker $k$ to store $b_k$ partitions. The encoding matrix has row- $k$ support of size $b_k$ (potentially varying across rows). The decoder finds $\mathbf{D}$ such that $\mathbf{D} \mathbf{B}_{\mathcal{T}, :} = \mathbf{1}_N^T$ for any responder set $\mathcal{T}$ .

Recovery threshold

For symmetric $b_k = b$ , the recovery threshold is $K = N - (b - 1)$ . Heterogeneous case: the threshold becomes $K^* = \min_{\mathcal{T}} |\mathcal{T}|$ such that $\sum_{k \in \mathcal{T}} b_k \geq N$ . This is a covering problem on the worker-partition graph.

Why harder

Heterogeneous storage leads to a non-uniform partition assignment: some partitions are stored by many workers, others by few. The worst-case responder subset depends on which partitions the stragglers covered. Optimal scheme construction is a combinatorial-optimization problem with no general closed-form solution.

Status

Partial results exist (Ye-Abbe 2018, Wang et al. 2020). The optimal characterization in heterogeneous settings is an open research direction (see Chapter 18).

ex-ch06-14

Hard

Compute the joint storage / communication / latency trade-off for the composition of (i) $(s, N)$ -gradient coding, (ii) Bonawitz secure aggregation, on a federated- learning round with $n$ users, $d$ -dimensional gradients, and $b$ -bit precision. Identify the dominant cost.

Show Hint

Sum the costs of each layer.

Solution

Per-user storage

Coded gradient: $s + 1$ partitions of $d/N$ scalars each = $(s+1) d / N$ scalars. Mask seeds: $O(t \log n)$ bits, negligible. Total: $\approx (s+1) d / N$ scalars.

Per-user uplink

Coded gradient: one masked $d$ -scalar gradient upload. Mask exchange: $O(n)$ pairwise interactions per round. Total: $\approx d \cdot b + O(n^2)$ bits per round.

Per-round latency

Stragglers absorbed: $\mathbb{E}[T_{\text{round}}] = (H_n - H_s)/\lambda$ . Mask-cancellation overhead: small constant. Total: $\approx (H_n - H_s)/\lambda$ .

Dominant cost

For typical FL parameters ( $n = 1000$ , $d = 10^7$ , $s = 100$ ): per-user storage $\approx 10^6$ scalars; per-user uplink $\approx 10^8$ bits; per-round latency $\approx 1\lambda^{-1}$ . The uplink dominates the per-round cost; the storage dominates the per-user memory budget.

Engineering implications

Coded gradient computation reduces latency substantially but does not reduce the per-user uplink. For uplink-bound deployments, combine with sparsification or quantization. For storage-bound deployments, consider approximate gradient coding (§6.3).

ex-ch06-15

Challenge

Open problem. For non-convex deep-learning models, derive (or empirically validate) a bound on the convergence-rate degradation of approximate gradient coding with relative error $\epsilon$ . The convex-case bound (§6.3 Thm. 6.3.2) gives an asymptotic floor of $\mathcal{O}(\epsilon)$ ; characterize whether the same holds for non-convex losses.

Show Hint

Try the smooth non-convex SGD analysis (Karimi et al. 2016).

The PL-condition can give similar guarantees.

Solution

Convex-case bound

For $L$ -smooth, $\mu$ -strongly-convex losses with approximate gradient coding, $\mathbb{E}[F(\mathbf{w}_T) - F^*] \leq (1 - \mu/L)^T \cdot (F(\mathbf{w}_0) - F^*) + \epsilon \sigma^2 / \mu$ .

Non-convex-case analysis

For smooth non-convex losses with approximate gradients of relative error $\epsilon$ , $\mathbb{E}\|\nabla F(\mathbf{w}_T)\|^2 \leq \mathcal{O}((F(\mathbf{w}_0) - F^*)/T) + O(\epsilon \sigma^2)$ . That is, the gradient norm converges to a $\mathcal{O}(\sqrt{\epsilon})$ neighborhood of zero, not the function value to a $\mathcal{O}(\epsilon)$ neighborhood of optimum.

PL-condition refinement

If the loss satisfies the Polyak-Łojasiewicz (PL) condition (which deep networks empirically often do, especially in the over-parameterized regime), the convex-case bound transfers and gives an $\mathcal{O}(\epsilon)$ asymptotic floor on function-value suboptimality. This is consistent with empirical observations in production FL.

Status

The full characterization for general non-convex losses is open. A research-grade exercise would be to validate the PL-based bound on standard FL benchmarks (FEMNIST, Shakespeare, CelebA-FL) and report whether the $\mathcal{O}(\epsilon)$ scaling holds empirically. This direction connects coded computing to the modern non-convex SGD analysis literature.

Exercises

ex-ch06-01

Compute

ex-ch06-02

Identification

Implication

ex-ch06-03

Compute harmonic numbers

Speedup

ex-ch06-04

Exact

Approximate

ex-ch06-05

Storage assignment

Encoding matrix (one valid choice)

Recovery threshold

ex-ch06-06

$s = 0$ case

Verification

Implication

ex-ch06-07

Density

Comparison with exact

Lesson

ex-ch06-08

Setup

SGD inequality

Asymptotic floor

ex-ch06-09

Why no privacy

Composition with SecAgg

Cost

ex-ch06-10

What is approximated

Where in pipeline

When each pays off

Combination

ex-ch06-11

Rank of $\mathbf{B}_{\mathcal{T}, :}$

$\mathbf{1}_N$ in the row span

Solution exists

ex-ch06-12

Setup

Random-matrix concentration

Combine

Implication

ex-ch06-13

Construction sketch

Recovery threshold

Why harder

Status

ex-ch06-14

Per-user storage

Per-user uplink

Per-round latency

Dominant cost

Engineering implications

ex-ch06-15

Convex-case bound

Non-convex-case analysis

PL-condition refinement

Status