Exercises
ex-ch09-01
EasyDistinguish between federated learning and data-center distributed SGD. Give three differences.
Three differences
- Data location. Data-center SGD: data shuffled across workers. FL: data stays on user devices.
- Participation. Data-center: all workers participate every iteration. FL: a subset () per round.
- Heterogeneity. Data-center: homogeneous GPUs, fast network. FL: heterogeneous devices, slow / unreliable uplinks.
ex-ch09-02
EasyA FedAvg deployment has users, , local epochs, model size scalars, and rounds. Compute the aggregate uplink traffic.
Per round: users × scalars × 32 bits.
Per-round
users upload. bits per round = 4 GB.
Over training
rounds × GB = 800 GB aggregate uplink. Per-user average: GB per user over the full training (some users are selected more often, others less).
ex-ch09-03
EasyWhy does FedAvg save communication compared to vanilla distributed SGD? What is the savings factor?
local epochs per round.
Mechanism
FedAvg replaces individual gradient exchanges with a single model-update exchange. Each round does local SGD steps but only one upload.
Savings factor
reduction in total rounds needed (on i.i.d. data) gives total communication savings. On non-IID data, savings are limited by client drift.
ex-ch09-04
EasyWhy is 8-bit quantization not considered a privacy mechanism?
Compression vs. privacy
Quantization maps each scalar to a finite alphabet, saving bandwidth. The quantized gradient is still statistically informative about the underlying data — gradient inversion attacks recover the sample with minor quality degradation.
What gives privacy
True privacy requires either information-theoretic masking (secure aggregation) or randomization (differential privacy). Mere bandwidth reduction does not suffice.
ex-ch09-05
MediumState and briefly prove: for i.i.d. user data, FedAvg with local epochs achieves convergence on a -strongly-convex objective.
Use the Li et al. 2020 Thm. 1 structure.
Statement
.
Proof sketch
FedAvg's update is an unbiased + bounded-variance estimator of the centralized SGD step. The effective variance includes an -dependent term from local steps; on i.i.d. data, this variance is bounded. Standard SGD analysis with bounded variance gives convergence.
Interpretation
On i.i.d. data, local epochs are "free" in terms of convergence rate (only constants change). On non-IID data, the heterogeneity term enters and constrains .
ex-ch09-06
MediumDerive the variance floor of stochastic -bit quantization on a gradient scalar with bounded magnitude.
Uniform quantizer has error per scalar.
Uniform quantizer
Each scalar gets rounded to one of evenly- spaced levels in . Level spacing . Per-scalar quantization error is uniform in ; variance .
Total across $d$ scalars
Independent errors across scalars: total variance .
Convergence implication
SGD with quantized gradients has a variance floor proportional to this quantity. For , — variance floor dominated by inherent stochastic noise. For , — significant.
ex-ch09-07
MediumImplement pseudocode for Top- SGD with error feedback. Explain why error feedback is needed for convergence.
See Algorithm 9.3.1 in Section 9.3.
Pseudocode
See §9.3 Algorithm: accumulate discarded entries
in e; add e to next gradient; sparsify; update
e with new discarded entries.
Why error feedback
Without error feedback, the discarded entries are permanently lost. The sparsified gradient is a biased estimate of the true gradient — SGD converges to a biased optimum. Error feedback ensures all gradient entries are eventually transmitted (just with delay), recovering asymptotic unbiasedness.
Convergence
Stich et al. 2018 prove that Top- with error feedback matches baseline SGD convergence rate on strongly-convex problems. The proof uses the error buffer's bounded norm to control the bias.
ex-ch09-08
MediumFor a model with parameters, compute the communication savings from (a) 4-bit quantization, (b) top-1% sparsification, (c) both combined.
(a) 4-bit quantization
Savings: .
(b) Top-1\% sparsification
Keep entries. Indexing overhead: bits per kept entry. Total: bits vs. original. Savings: .
(c) Combined
Top-1% then 4-bit quantize the kept values: bits. Savings: .
Observation
Composition is multiplicative in effect but the fixed indexing overhead ( per kept entry) limits the combined savings as sparsification becomes aggressive.
ex-ch09-09
MediumExplain the DLG (Deep Leakage from Gradients) attack and why it succeeds on single-sample gradients.
Solve .
Attack
Given , , and the loss , minimize over via gradient descent.
Why it succeeds
The gradient is a deterministic function of . For generic models, this function is (locally) invertible. The optimization landscape has a unique global minimum at .
Batch-size effect
With batch size , the target is — superimposed signals. The attack must jointly reconstruct all samples. For small (up to ), this works; for large , reconstructions degrade.
ex-ch09-10
MediumExplain why federated learning's "data stays on device" architecture is not sufficient for privacy.
Architectural promise
Data never leaves user 's device. Only gradient updates are sent.
The gap
is a deterministic function of (and ). By the data processing inequality, can reveal at most as much information as — but "at most as much" can be all of it. Empirically, gradient inversion recovers training samples.
Implication
Raw data locality is not privacy. Privacy requires destroying information about that leaves the device. Secure aggregation does this by masking; differential privacy does this by adding noise.
ex-ch09-11
HardProve (informally) that FedAvg on non-IID data with large converges to a biased fixed point. Characterize the bias.
Client drift: users' local optima differ.
Client drift
With local SGD epochs, user makes progress toward its local optimum , not the global optimum . For non-IID data, .
Averaged update
FedAvg averages the user-local models: . As , , so — the weighted average of user-local optima, not the global optimum.
Bias characterization
Bias scales with the heterogeneity . For IID data, ; for highly non-IID, is large. FedProx and SCAFFOLD address this by regularization and variance reduction.
ex-ch09-12
HardCompute the aggregate uplink traffic of a realistic FL deployment: users, , , model size (a small foundation model), 8-bit quantization, top-1% sparsification, rounds.
Per-round per-user
Selected users: . Quantized: bits per full gradient. Sparsified to 1%: kept entries, each bits (value + index), total bits = MB per user per round.
Per-round aggregate
users × MB = GB per round.
Over training
rounds × GB = TB aggregate uplink across all selected users.
Remarks
For a -user population with 1% participation (= 10,000 active per round), the aggregate traffic is substantial even with aggressive compression. Adding secure aggregation (Chapter 10) roughly doubles the cost (pairwise masks). Per-user budget: MB per round × ~ round per user = MB total. Feasible for Wi-Fi, challenging for 4G.
ex-ch09-13
HardDesign a FL experiment to empirically verify the gradient-inversion attack. Specify the dataset, architecture, and evaluation metric.
Setup
Dataset: CIFAR-10 or ImageNet-100 (subset). Architecture: ResNet-18 or LeNet-5 (smaller for speed). Training sample: randomly select one image from the dataset.
Attack
Given the per-sample gradient, initialize a random image. Minimize via Adam (learning rate , 500 iterations). Include a regularizer to encourage valid pixel values (e.g., total-variation regularization).
Evaluation
Compare to the original sample. Metrics:
- Pixel-level MSE or PSNR.
- Cosine similarity in feature space (via a pre-trained classifier).
- Visual inspection.
For LeNet on CIFAR-10, expect PSNR >30 dB and near-pixel-perfect reconstruction.
Extensions
Increase batch size and measure reconstruction quality. Plot vs. — a direct empirical verification of the §9.4 theorem.
ex-ch09-14
HardAnalyze the tension between compression and privacy: argue that aggressive compression can actually hurt privacy because it makes the gradient distinguish training samples more rather than less.
Naive intuition
Aggressive compression destroys information; therefore it destroys privacy-sensitive info too. Should improve privacy.
Counter-intuition
Compression keeps the largest-magnitude entries (top- sparsification) or the sign (1-bit quantization). These high-information features are precisely the ones most correlated with the training sample. The low-magnitude entries — which do little to distinguish samples — are discarded.
Result
The compressed gradient may be more informative per-bit about the sample than the full gradient. Empirical evidence: 1-bit SignSGD gradients still allow inversion, sometimes with higher efficiency than full gradients (fewer optimization iterations needed for the attacker).
Policy implication
Compression-as-privacy is not just insufficient — it can be counterproductive. Always pair compression with explicit privacy mechanisms (secure aggregation, DP).
ex-ch09-15
ChallengeOpen problem. Derive an information-theoretic lower bound on the leakage of gradients in FL as a function of model architecture (depth, width), loss function, and batch size. Characterize when gradient-inversion reconstruction becomes information-theoretically infeasible.
Think mutual information between gradient and sample.
Setup
Measure leakage: . For full leakage, (no ambiguity after seeing gradient). For zero leakage, (gradient uninformative).
Single-sample case
For deterministic loss and architecture, the gradient is a deterministic function of . If invertible, — full leakage. Non-invertibility arises for simple models (e.g., linear regression with ), partial leakage there.
Batch-size effect
— mutual info between samples and their mean gradient. The sum has degrees of freedom; joint sample information. When , leakage is reduced. For large , the aggregate gradient is "insufficient statistic" of the samples.
Open characterization
The exact leakage as a function of is open. Empirical studies give operating points. Chapter 18 lists this as one of the open problems for privacy-aware FL. The CommIT-group contributions of Chapters 10–12 provide operational protocols, but the information-theoretic lower bound on leakage remains a research frontier.