Exercises
ex18-1
EasyShow that on has Chebyshev coefficients decaying exponentially. What polynomial degree is needed for uniform error?
is analytic on .
Chebyshev coefficients of analytic functions decay exponentially.
Analyticity
is analytic on a strip . By Bernstein's theorem, Chebyshev coefficients decay as .
Degree for $10^{-3}$
Need , i.e., . .
Coded-computing cost
Degree , so recovery threshold workers. Reasonable for moderate cluster sizes.
ex18-2
EasyExplain why ReLU's polynomial approximation gives only error, not exponential.
ReLU has a kink at .
Gibbs phenomenon for non-smooth functions.
The kink
ReLU ; derivative jumps at . Not differentiable at that point.
Chebyshev convergence rate
For non-smooth functions, Chebyshev coefficients decay polynomially, typically as .
Total error
. Polynomial convergence.
Engineering implication
For error, β prohibitive recovery threshold. Production ReLU coded computing requires hybrid or learned schemes.
ex18-3
EasyFor users, gradient dim, -privacy, Byzantine, DP, compute the lower-bound communication per round.
Use Theorem 18.2.1.
Compute components
.
Multiply
bits per round.
Compare with FL-only
FL with no constraints: bits. Joint axis cost: overhead.
ex18-4
EasyAn ErdΕs-RΓ©nyi graph has spectral gap (second eigenvalue of Metropolis-Hastings mixing matrix) for large . Compute for , . What is the mixing time ?
for typical graphs.
Spectral gap
.
Mixing time
.
Operational
Need rounds of mixing per consensus. For D-SGD with 1000 iterations, 20x overhead is tolerable.
ex18-5
MediumCompute the quantum PIR rate and classical for . By what percent does quantum outperform?
Substitute directly.
Quantum
.
Classical
.
Comparison
Actually, classical is higher here. For large , classical ; quantum . Quantum does not always outperform β the advantage depends on the regime.
Advantage regime
Quantum advantage is for moderate and specific non-classical PIR variants (coded storage + colluding). Check Allaix et al. 2022 for the exact regime.
ex18-6
MediumFor users in D-SGD with graph density , compute the total per-round peer exchanges and compare with centralized ( uploads).
Expected edges: .
Each edge is two exchanges (one per direction).
Expected edges
.
Per-round exchanges
edge-exchanges.
Comparison
Centralized: uploads. D-SGD with sparse graph: exchanges β more.
Operational
Sparse D-SGD doesn't save total bandwidth, but distributes it (no central bottleneck). The trade-off is resilience vs. raw efficiency.
ex18-7
MediumFor coded transformer attention with Chebyshev-polynomial softmax approximation of degree and workers, compute the recovery threshold if (i) only is coded, and (ii) both and softmax are coded.
Matrix product via Chapter 5: threshold (for simple polynomial codes).
Softmax approximation: degree-8 polynomial, .
(i) QK^T only
Threshold for polynomial encoding: (most schemes) β can tolerate stragglers per .
(ii) Both
Softmax requires degree Lagrange code: threshold . Total effective threshold: per workers. Can tolerate stragglers.
Engineering
Coding both stages reduces straggler tolerance but preserves more of the transformer pipeline. End-to-end optimization depends on the straggler rate.
ex18-8
MediumIn an MoE model with experts and active per input, a fraction of the computation is activated per input. Argue that coded computing must adapt to this sparsity.
Naive coded computing assumes all experts used.
Sparse coded computing exploits non-activation.
Naive coded MoE
If all experts are coded redundantly, the recovery threshold is . For , large cluster required.
Sparsity exploitation
Only experts are active per input. Each input has a known sparse activation pattern.
Adaptive scheme
Encode only the active experts' computations per input. Threshold β much smaller.
Challenge
The activation is data-dependent: different inputs use different experts. Scheduling straggler tolerance per input breaks uniform coding. Open research.
ex18-9
MediumA RAG system performs queries per session. The user caches documents from previous queries. Apply cache-aided PIR (Chapter 15) to estimate the retrieval-rate improvement.
Effective reduces by (side-info PIR).
With docs, databases.
Without cache
PIR rate .
With cache of $5$
Effective . Rate . For the rate saturates at . Cache of 5 is negligible relative to .
When cache helps
Cache helps significantly when . For : effective , rate . Requires large cache.
Operational
In RAG, typical (user caches a few; library has thousands). Cache-aided PIR gives minor rate improvement; its main value is recurring- document reuse.
ex18-10
HardSketch an approach to proving an information-theoretic lower bound on the recovery threshold of non-linear coded computing. What is the converse argument?
Use cut-set with a non-linear function.
Think about how many workers must be non-stragglers to reconstruct .
Setup
User wants where is split across workers. Each worker computes partial information. Recovery threshold: minimum workers needed.
Cut-set argument
Consider a subset of workers. The mutual information between and is bounded by the joint entropy of their inputs β in the non-linear case, this is not linearly separable.
Non-linear converse difficulty
Unlike the polynomial case, non-linear can create non-additive information. Cut-set bounds are not tight; tighter bounds require computing non-linear conditional entropies β an unsolved problem in general.
Open aspect
Tight converses for non-linear coded computing require new techniques β e.g., matroid theory, or generalizations of Fano's inequality to non-linear regimes. An active area.
ex18-11
HardGiven a fixed total communication budget bits, what is the optimal allocation among privacy (), robustness (), and DP ()? Derive the Pareto frontier.
Lower bound: .
Each axis has its own cost-benefit.
Setup
Maximize utility (e.g., user-level privacy + system-level robustness + aggregate DP) subject to budget.
Lagrangian
.
First-order conditions
Equate marginal utilities across axes: . The optimal equalize marginal benefits.
Frontier shape
For convex utility, the Pareto frontier is a smooth curve in space. For the specific form of each axis's utility (e.g., privacy is binary β either satisfied or not), the frontier is piecewise linear.
Engineering
In practice, operators set from requirements (privacy law, SLA, etc.) rather than maximize utility. The Pareto frontier is a check that the requirements are feasible.
ex18-12
HardSketch how AirComp (Ch 16) could be combined with D-SGD (Β§18.3). Each peer-to-peer pair uses a small MAC; aggregation is per-edge. What are the synchronization and scaling implications?
Per-edge AirComp: channel uses per neighbor pair.
Synchronization across pairs.
Per-edge AirComp
For each edge in , peers and mutually transmit their models. Using AirComp: channel use per edge.
Total bandwidth
AirComp rounds. For sparse : β sub-quadratic.
Synchronization
Each peer-pair must synchronize independently. Harder than synchronizing a single MAC β pairwise synchronization is harder in multi-hop networks.
Privacy benefit
Per-edge AirComp provides weak privacy between each pair only: the MAC superposition hides the sum of the two models. Stronger privacy requires additional masking.
Open problem
Convergence rate of D-SGD with AirComp: can the graph-mixing and per-edge MSE be combined into a unified convergence result? Open.
ex18-13
HardA federated RAG system has users, each hosting documents. Retrieval must be -private against any users. What is the total retrieval overhead (per document) under different ?
This is -colluding PIR (Ch 14 Β§14.2) with databases.
.
$T = 1$ (classical)
. Retrieve 1 document at rate 0.99 β download documents' worth.
$T = 5$
. Download documents' worth.
$T = 50$ (half-colluding)
. Download documents' worth β overhead.
$T = 90$
. Download documents β overhead.
Operational
For modest , PIR overhead is negligible. For aggressive collusion tolerance, substantial. Federated RAG deployments should pick based on threat model.
ex18-14
HardFor -colluding PIR with , compute (i) classical PIR capacity and (ii) quantum PIR rate (Theorem 18.4.1). For what range of does quantum strictly outperform classical?
Classical: (asymptotic).
Quantum: .
Classical vs. Quantum
Classical: . Quantum: .
Find $T$ where quantum exceeds classical
Simplifying: Roots: β no real roots, so LHS always positive. Quantum always at least matches classical? Let's check : Classical ; Quantum . Quantum is lower here! The formula in Thm. 18.4.1 applies for specific PIR variants, not vanilla -colluding.
Resolution
The quantum advantage is specific to coded-storage PIR or SPIR, not classical -colluding. For classical -colluding, classical can be equal or better than quantum. The quantum advantage emerges in multi-constraint PIR variants (Allaix et al. 2022).
Open
Characterizing the exact classical-quantum gap for each PIR variant is open. Some variants have quantum advantage; others don't.
ex18-15
ChallengeThe five CommIT contributions cover coded shuffling, SecAgg, ByzSecAgg, CCESA, and IT-secure FRL. Propose a joint problem β combining two or more contributions β that would constitute a sixth CommIT direction. State the problem statement, the achievability scheme, and the key open questions.
Combine ByzSecAgg with cache-aided PIR?
CCESA + AirComp?
FRL + ByzSecAgg?
Example candidate: CCESA + AirComp
Setup: wireless FL with users on a sparse graph (CCESA- like topology); within each neighborhood, AirComp aggregation. Combines CCESA's scaling with AirComp's per-round cost.
Achievability
Modify CCESA's sparse-graph topology so neighbors are reachable via a pair of peers with AirComp: local edges use AirComp; global consensus uses gossip + digital filter.
Open questions
- Convergence rate as a function of (sparsity), (per-edge AirComp), and graph-mixing time.
- Privacy bound: CCESA's reliability guarantees hold under AirComp aggregation?
- Practical synchronization: neighbor-synchronous AirComp vs. full global synchronization.
Impact
A successful framework would give communication-optimal wireless FL with formal privacy-robustness-DP guarantees β closing gaps in all five prior CommIT contributions simultaneously. This is the kind of sixth contribution the CommIT community could aim for.
Research plan
(1) Simulate small-scale prototype. (2) Prove convergence bound. (3) Characterize threat model. (4) Test on realistic wireless channels. Each step is a year of research. The larger research program spans 3-5 years.