Exercises
ex-ch01-01
EasyA MapReduce job has workers storing equal disjoint partitions of a GB intermediate file. How much aggregate network traffic does an uncoded shuffle generate?
Use .
Plug in GB and .
Apply the formula
GB.
ex-ch01-02
EasyTen workers have i.i.d. exponential task-completion times with rate . Find , the expected wait for the slowest worker.
Recall .
Compute numerically.
Harmonic number
.
Expected latency
time units — almost the per-worker mean of .
ex-ch01-03
EasyA federated-learning round uses users, model size parameters, bits per scalar. Compute the total per-round uplink traffic.
Use bits.
Plug in
bits, i.e., GB per round.
ex-ch01-04
EasyTrue or false: adding more workers to a synchronous distributed iteration always decreases the per-iteration wall-clock time.
Think about the formula.
Answer
False. The per-iteration latency is , which increases with . Without redundancy, adding workers only slows things down — each extra worker adds one more exponential tail that can delay the iteration.
ex-ch01-05
MediumDerive the coded communication load for MapReduce with , starting from the pair endpoints and . Verify that the curve is convex and interpolates the two endpoints.
Check the function's value at and .
Compute the second derivative with respect to .
Endpoint checks
At : . ✓ At : . ✓
Convexity
Let . Then and for . Hence is convex in , as required.
ex-ch01-06
MediumFor a redundant scheme with recovery threshold and i.i.d. exponential task times (rate ), derive and compute its limit as with .
Use the memoryless decomposition .
Recognize as a partial sum.
Order-statistic gaps
, so .
Asymptotic
. Hence .
ex-ch01-07
MediumA parameter server has Gbps of ingress bandwidth and a round must complete in under second. The model has parameters at bits. What is the maximum number of users that can be aggregated per round in the plaintext (no-privacy) baseline?
Compute bits per user, then divide available bandwidth by that.
Per-user uplink
bits per user.
Users per second
At bits/s ingress and bits per user, the server can ingest users per second — or, stated in the opposite direction, just 0.3 users fit in the -second budget. To serve users per round we either need compression (e.g., 1-bit quantization reduces it to user per ms) or a parallel-aggregation architecture.
ex-ch01-08
MediumSuppose a federated-learning service claims "privacy by design" because "raw data never leaves the device". Identify at least three ways the service can still leak information to the cloud provider, and map each to the chapter of this book that addresses it.
Gradient inversion.
Traffic analysis of activity patterns.
Server-side inference on aggregate gradients.
Gradient inversion (DLG)
A plaintext gradient can be inverted to reconstruct the training samples. Addressed in Chapter 10 (secure aggregation).
Activity side-channels
Timing and existence of uploads reveal that a particular user holds data matching the current model. Addressed by differential privacy and cover-traffic protocols in Chapter 18.
Model memorization
Even the aggregated model weights can memorize individual training samples. Addressed by DP-SGD (Chapter 9) and the PIR-style mechanisms of Part IV when applied to model serving.
ex-ch01-09
MediumConsider a coded scheme with recovery threshold for an iteration that uses workers, each with i.i.d. exponential task times of rate . Compute and compare with .
Use .
With recovery
.
Without recovery
.
Ratio
Tolerating just 2 stragglers cuts the iteration latency by roughly a factor of . The cost of course is the storage/computation redundancy that Chapter 5 will quantify.
ex-ch01-10
MediumDescribe a threat model where an honest-but-curious parameter server can be implemented with no privacy-preserving protocol at all, and explain what has to be true of the deployment for this to be defensible.
Think about who the server is and what legal / contractual guarantees exist.
When does an attacker not have access to the server?
Within a single organization
If all users are employees of one organization, the parameter server is run by that organization's SRE team, and the data is not regulated externally (HIPAA, GDPR, export controls), then the server is not an adversary — it is part of the trusted computing base. In this setting no privacy- preserving protocol is required, because there is no adversary to defend against.
When the defense breaks
The defense fails the moment any of the following becomes true: (i) users include external parties, (ii) the server is subject to a court order or subpoena, (iii) the server can be compromised by an attacker, (iv) the server's personnel are not contractually bound to the data owner. Each of these is the threat model addressed in Chapters 10–12.
ex-ch01-11
MediumPlot (analytically) the ratio as a function of for i.i.d. exponential task times. What does this tell you about the "price of synchrony"?
, .
Compute ratio
. For : . For : . For : .
Takeaway
The price of synchrony is sub-linear (), but still substantial — a cluster of 10 000 workers pays almost its per-worker mean time on every synchronous barrier. This is one of the central empirical reasons coded computing exists.
ex-ch01-12
HardShow that for arbitrary i.i.d. task-time distributions (not necessarily exponential) with finite variance, the expected latency is monotone non-decreasing in . Find a counterexample to the claim " is decreasing in ".
Use a monotone coupling.
Consider heavy-tailed distributions.
Monotone coupling
Couple the -sample system to the -sample system by letting the first samples coincide. Then pointwise, hence in expectation.
Counter-example for the increment claim
For a Pareto distribution with shape parameter , the increments can be non-monotone — they may grow in before shrinking, reflecting the heavy-tailed nature of the distribution. The intuition is that adding a new worker has a non-trivial chance of producing a very large outlier that dominates the order statistic.
ex-ch01-13
HardA distributed-training protocol needs to tolerate Byzantine workers (who send adversarially crafted outputs). Argue informally that any such protocol must have communication complexity at least in the large- regime. (A precise proof requires tools from Chapter 11.)
The plaintext baseline costs .
Each Byzantine worker's claim must be checkable, which requires additional scalars of redundancy.
Plaintext baseline
Section 1.3's aggregation-cost bound gives scalars of uplink even without privacy or robustness.
Byzantine overhead
To detect (or at least bound) the corruption introduced by Byzantine workers, the protocol must receive enough redundant information to localize the corruption. Any linear-algebraic argument (e.g., Reed–Solomon error correction over gradient-sized vectors) requires at least redundant scalars per coordinate — hence .
Sum and interpret
Adding baseline and overhead gives . Chapter 11 makes this precise and characterizes the achievable rate region.
ex-ch01-14
HardSuppose you deploy a federated-learning service with users, model , 32-bit gradients, and a training budget of 100 GB of total per-round uplink (aggregate across users). What is the largest fraction of the model gradient that each user can upload? Propose a sparsification scheme consistent with that budget and discuss the convergence implications.
Per-user upload budget = total / .
Fraction = budget / (d b).
Per-user budget
Total GB = bits. Divide by users: bits/user/round.
Fraction
Model bits. Fraction — about 25 000 out of coordinates.
Sparsification
Top- with keeps the largest-magnitude 0.0025% of gradient entries each round. Convergence remains provable under standard error-feedback techniques (Stich et al., 2018) with a slowdown factor of in round count. The combined wall-clock cost (uplink × rounds) is actually smaller than plaintext at these parameters — illustrating why compression is standard in production FL.
ex-ch01-15
ChallengeConsider a hybrid scheme: workers, recovery threshold , i.i.d. exponential task times rate , plus a limited asynchronous buffer that can absorb one worker's response one round early (i.e., the first-to-finish worker of the current round pre-starts its next round's work immediately). Derive the expected iteration latency and show where the scheme sits relative to the synchronous -of- latency.
Write the iteration time as two nested order statistics.
Use independence across rounds from memorylessness.
Decompose
Let be the latency of a pure synchronous -of- round: . The hybrid scheme saves the time of the last finisher of the previous round that the buffer was able to absorb. By memorylessness this saving is again exponential with rate and can be computed independently.
Expected savings
The absorbed round's last-response savings is over the pre-started round. Amortized over many rounds this gives a per-round saving of where is the absorbed-round stretch.
Position
For moderate the hybrid scheme sits between fully synchronous -of- (upper bound) and fully asynchronous (which has no staleness penalty only in expectation). It is roughly equivalent to using a slightly tighter recovery threshold — buying latency at the cost of coded-storage overhead.