Exercises
ex-ch06-01
EasyFor a gradient-coding scheme, what are the per-worker storage, the recovery threshold, and the straggler tolerance?
Per-worker storage: partitions; threshold: .
Compute
Per-worker storage: partitions. Recovery threshold: . Straggler tolerance: (master ignores slowest 5 of 50 workers).
ex-ch06-02
EasyThe aggregate gradient is a matrix-vector product. Identify the matrix and the vector.
Stack the columnwise.
Identification
where is the matrix of stacked gradients and is the all-ones query vector.
Implication
Coded gradient computation is the specialization of polynomial codes (Chapter 5).
ex-ch06-03
EasyFor exact gradient coding with workers and i.i.d. exponential local times rate , find the per-round latency speedup of stragglers tolerated vs. .
Speedup .
Compute harmonic numbers
; ; .
Speedup
.
ex-ch06-04
EasyState the difference between exact gradient coding and approximate gradient coding in one sentence each.
Exact
Exact gradient coding (Tandon et al.) recovers exactly from any responses, at per-worker storage .
Approximate
Approximate gradient coding (Charles et al.) recovers an estimate with , at per-worker storage β sub-linear in .
ex-ch06-05
MediumDesign a -gradient-coding scheme. Specify the storage assignment for each worker and write out an encoding matrix that satisfies the cyclic construction.
.
Storage assignment
, , , , , . Each worker stores consecutive partitions.
Encoding matrix (one valid choice)
Each row has support exactly on .
Recovery threshold
. Master can tolerate any stragglers out of workers.
ex-ch06-06
MediumProve that the uncoded SGD scheme (no redundancy) corresponds to in gradient coding, and verify that the recovery threshold is correct.
means each worker stores only its own partition.
$s = 0$ case
Storage assignment: β each worker stores its own partition only. Encoding matrix: (the identity).
Verification
Recovery threshold: . Master needs all responses (one per partial gradient). This is exactly uncoded SGD: every worker's response is needed for the full sum. Straggler tolerance: .
Implication
recovers uncoded as a degenerate case of gradient coding. The framework is strictly more general β increasing trades storage for straggler tolerance.
ex-ch06-07
MediumCompute the per-worker storage for approximate gradient coding with , workers, and straggler tolerance .
Density .
Density
for some constant . Taking for safety: β but this should be normalized so that is the per-worker storage. Re-do: per-worker storage ... let's be careful. From the theorem, per-worker storage scales as partitions per worker.
Comparison with exact
Exact gradient coding for requires partitions per worker. Approximate at requires β comparable at this but with a smooth tradeoff: tighter requires more storage, looser much less.
Lesson
At loose error tolerances (), approximate coding beats exact by a factor of in storage. For tight tolerances (), the gap narrows.
ex-ch06-08
MediumDerive (informally) the convergence-rate result for approximate gradient coding on a smooth strongly-convex objective. Identify the asymptotic floor.
Standard SGD inequality + bounded approximation error.
Setup
with , .
SGD inequality
Standard analysis (Bubeck 2015 Β§6.2) gives , where the additional term arises from the approximation noise.
Asymptotic floor
As , the first term vanishes and the floor remains. For , , , the floor is β small, often acceptable in practice.
ex-ch06-09
MediumExplain why coded gradient computation does not provide privacy against the master, and how it can be combined with Bonawitz-style secure aggregation to provide both.
Coded gradient just makes aggregation tolerate stragglers.
Why no privacy
The master decodes the exact sum from the coded responses. This is a deterministic function of the responses; no random masking is involved. The master could in principle invert the system to learn individual gradients (especially if it has fewer than unknowns relative to known structure).
Composition with SecAgg
Each user computes their coded gradient (Section 6.2), then adds pairwise random masks before sending. The masks cancel in the aggregate, so the master sees but not individual β let alone the underlying . The composition is well-defined because both operations are linear.
Cost
Composition is additive: per-user storage grows by for mask seeds; per-round communication grows by for the pairwise mask exchange. Recovery threshold is unchanged ().
ex-ch06-10
MediumCompare gradient sparsification (top-) and approximate gradient coding in terms of (i) what is approximated, (ii) where in the pipeline the approximation happens, and (iii) when each pays off.
What is approximated
Sparsification: each user's individual gradient is approximated by zeroing out small-magnitude entries. Approximate coding: the aggregate is approximated; individual gradients are still computed exactly.
Where in pipeline
Sparsification: per-user, before aggregation. Approximate coding: at the master, after all responses arrive.
When each pays off
Sparsification: when uplink bandwidth is the bottleneck (mobile FL, narrowband wireless). Approximate coding: when straggler tolerance and per-worker storage are the bottlenecks (datacenter clusters with heterogeneous compute).
Combination
The two are orthogonal and can be combined: each user sparsifies and the aggregation is done with approximate coding. Production systems sometimes layer both for compounded savings.
ex-ch06-11
HardProve that the cyclic gradient-coding construction's encoding matrix has the property: for every responder set of size , the linear system has a solution .
Use the cyclic structure: has rank .
is in the row span of .
Rank of $\mathbf{B}_{\mathcal{T}, :}$
The cyclic structure of ensures that any row submatrix has rank (this is a standard property of the equiangular cyclic encoding; see Tandon et al. Lemma 1).
$\mathbf{1}_N$ in the row span
The all-ones vector lies in the row span of by construction (column sums of equal at each coordinate). Restricting to the responder rows preserves this: the cyclic encoding is designed so that any rows can reconstruct the all-ones row via a unique linear combination.
Solution exists
Since has full row rank and lies in its row span, the linear system has a unique solution . The decoder applies this to the responses and recovers exactly.
ex-ch06-12
HardDerive the random-sparse approximate gradient-coding error bound informally. What is the role of the random-matrix concentration?
Use Marchenko-Pastur: smallest singular value of a random matrix.
Setup
with (least-squares). Error: .
Random-matrix concentration
The smallest singular value of an random matrix with density is, with high probability, (Bai-Yin / Marchenko-Pastur). This bounds the conditioning of the pseudo-inverse.
Combine
. Squared: . Multiplying by gives the error bound.
Implication
Density acts as a "regularizer" for the decoder: more density = better conditioning = smaller error. The role of random-matrix concentration is to control , which determines decoder accuracy.
ex-ch06-13
HardSketch a heterogeneous gradient-coding scheme where workers have different storage budgets . Conjecture the optimal recovery threshold and discuss what makes the problem harder than the symmetric case.
Recovery threshold should depend on the per-worker storage distribution.
Construction sketch
Assign worker to store partitions. The encoding matrix has row- support of size (potentially varying across rows). The decoder finds such that for any responder set .
Recovery threshold
For symmetric , the recovery threshold is . Heterogeneous case: the threshold becomes such that . This is a covering problem on the worker-partition graph.
Why harder
Heterogeneous storage leads to a non-uniform partition assignment: some partitions are stored by many workers, others by few. The worst-case responder subset depends on which partitions the stragglers covered. Optimal scheme construction is a combinatorial-optimization problem with no general closed-form solution.
Status
Partial results exist (Ye-Abbe 2018, Wang et al. 2020). The optimal characterization in heterogeneous settings is an open research direction (see Chapter 18).
ex-ch06-14
HardCompute the joint storage / communication / latency trade-off for the composition of (i) -gradient coding, (ii) Bonawitz secure aggregation, on a federated- learning round with users, -dimensional gradients, and -bit precision. Identify the dominant cost.
Sum the costs of each layer.
Per-user storage
Coded gradient: partitions of scalars each = scalars. Mask seeds: bits, negligible. Total: scalars.
Per-user uplink
Coded gradient: one masked -scalar gradient upload. Mask exchange: pairwise interactions per round. Total: bits per round.
Per-round latency
Stragglers absorbed: . Mask-cancellation overhead: small constant. Total: .
Dominant cost
For typical FL parameters (, , ): per-user storage scalars; per-user uplink bits; per-round latency . The uplink dominates the per-round cost; the storage dominates the per-user memory budget.
Engineering implications
Coded gradient computation reduces latency substantially but does not reduce the per-user uplink. For uplink-bound deployments, combine with sparsification or quantization. For storage-bound deployments, consider approximate gradient coding (Β§6.3).
ex-ch06-15
ChallengeOpen problem. For non-convex deep-learning models, derive (or empirically validate) a bound on the convergence-rate degradation of approximate gradient coding with relative error . The convex-case bound (Β§6.3 Thm. 6.3.2) gives an asymptotic floor of ; characterize whether the same holds for non-convex losses.
Try the smooth non-convex SGD analysis (Karimi et al. 2016).
The PL-condition can give similar guarantees.
Convex-case bound
For -smooth, -strongly-convex losses with approximate gradient coding, .
Non-convex-case analysis
For smooth non-convex losses with approximate gradients of relative error , . That is, the gradient norm converges to a neighborhood of zero, not the function value to a neighborhood of optimum.
PL-condition refinement
If the loss satisfies the Polyak-Εojasiewicz (PL) condition (which deep networks empirically often do, especially in the over-parameterized regime), the convex-case bound transfers and gives an asymptotic floor on function-value suboptimality. This is consistent with empirical observations in production FL.
Status
The full characterization for general non-convex losses is open. A research-grade exercise would be to validate the PL-based bound on standard FL benchmarks (FEMNIST, Shakespeare, CelebA-FL) and report whether the scaling holds empirically. This direction connects coded computing to the modern non-convex SGD analysis literature.