Ferkans — Interactive Telecom Tutor

ex-ch09-01

Easy

Distinguish between federated learning and data-center distributed SGD. Give three differences.

Solution

Three differences

Data location. Data-center SGD: data shuffled across workers. FL: data stays on user devices.
Participation. Data-center: all workers participate every iteration. FL: a subset ( $C \cdot n$ ) per round.
Heterogeneity. Data-center: homogeneous GPUs, fast network. FL: heterogeneous devices, slow / unreliable uplinks.

ex-ch09-02

Easy

A FedAvg deployment has $n = 1000$ users, $C = 0.1$ , $E = 5$ local epochs, model size $d = 10^7$ scalars, and $T = 200$ rounds. Compute the aggregate uplink traffic.

Show Hint

Per round: $C n$ users × $d$ scalars × 32 bits.

Solution

Per-round

$C n = 100$ users upload. $100 \cdot 10^7 \cdot 32 = 3.2 \cdot 10^{10}$ bits per round = 4 GB.

Over training

$200$ rounds × $4$ GB = 800 GB aggregate uplink. Per-user average: $800 / 1000 = 0.8$ GB per user over the full training (some users are selected more often, others less).

ex-ch09-03

Easy

Why does FedAvg save communication compared to vanilla distributed SGD? What is the savings factor?

Show Hint

$E$ local epochs per round.

Solution

Mechanism

FedAvg replaces $E$ individual gradient exchanges with a single model-update exchange. Each round does $E$ local SGD steps but only one upload.

Savings factor

$E\times$ reduction in total rounds needed (on i.i.d. data) gives $E\times$ total communication savings. On non-IID data, savings are limited by client drift.

ex-ch09-04

Easy

Why is 8-bit quantization not considered a privacy mechanism?

Solution

Compression vs. privacy

Quantization maps each scalar to a finite alphabet, saving bandwidth. The quantized gradient is still statistically informative about the underlying data — gradient inversion attacks recover the sample with minor quality degradation.

What gives privacy

True privacy requires either information-theoretic masking (secure aggregation) or randomization (differential privacy). Mere bandwidth reduction does not suffice.

ex-ch09-05

Medium

State and briefly prove: for i.i.d. user data, FedAvg with $E$ local epochs achieves $O(1/T)$ convergence on a $\mu$ -strongly-convex objective.

Show Hint

Use the Li et al. 2020 Thm. 1 structure.

Solution

Statement

$\mathbb{E}[F(\mathbf{w}_T) - F(\mathbf{w}^*)] \leq \frac{\kappa^2}{T + t} \cdot (\sigma^2 E / (Cn) + G^2)$ .

Proof sketch

FedAvg's update is an unbiased + bounded-variance estimator of the centralized SGD step. The effective variance includes an $E$ -dependent term from local steps; on i.i.d. data, this variance is bounded. Standard SGD analysis with bounded variance gives $O(1/T)$ convergence.

Interpretation

On i.i.d. data, $E$ local epochs are "free" in terms of convergence rate (only constants change). On non-IID data, the heterogeneity $\Gamma$ term enters and constrains $E$ .

ex-ch09-06

Medium

Derive the variance floor of stochastic $b$ -bit quantization on a gradient scalar with bounded magnitude.

Show Hint

Uniform quantizer has error $\Delta / \sqrt{12}$ per scalar.

Solution

Uniform quantizer

Each scalar gets rounded to one of $2^b$ evenly- spaced levels in $[-R, R]$ . Level spacing $\Delta = 2R/2^b$ . Per-scalar quantization error is uniform in $[-\Delta/2, \Delta/2]$ ; variance $\Delta^2 / 12 = R^2 / (3 \cdot 4^b)$ .

Total across $d$ scalars

Independent errors across scalars: total variance $d \cdot R^2 / (3 \cdot 4^b)$ .

Convergence implication

SGD with quantized gradients has a variance floor proportional to this quantity. For $b = 8$ , $1/4^b \approx 10^{-5}$ — variance floor dominated by inherent stochastic noise. For $b = 1$ , $1/4 = 0.25$ — significant.

ex-ch09-07

Medium

Implement pseudocode for Top- $K$ SGD with error feedback. Explain why error feedback is needed for convergence.

Show Hint

See Algorithm 9.3.1 in Section 9.3.

Solution

Pseudocode

See §9.3 Algorithm: accumulate discarded entries in e; add e to next gradient; sparsify; update e with new discarded entries.

Why error feedback

Without error feedback, the discarded entries are permanently lost. The sparsified gradient is a biased estimate of the true gradient — SGD converges to a biased optimum. Error feedback ensures all gradient entries are eventually transmitted (just with delay), recovering asymptotic unbiasedness.

Convergence

Stich et al. 2018 prove that Top- $K$ with error feedback matches baseline SGD convergence rate on strongly-convex problems. The proof uses the error buffer's bounded norm to control the bias.

ex-ch09-08

Medium

For a model with $d = 10^8$ parameters, compute the communication savings from (a) 4-bit quantization, (b) top-1% sparsification, (c) both combined.

Solution

(a) 4-bit quantization

Savings: $32/4 = 8\times$ .

(b) Top-1\% sparsification

Keep $d/100 = 10^6$ entries. Indexing overhead: $\log_2 d = 27$ bits per kept entry. Total: $10^6 \cdot (32 + 27) = 5.9 \cdot 10^7$ bits vs. $32 \cdot 10^8 = 3.2 \cdot 10^9$ original. Savings: $54\times$ .

(c) Combined

Top-1% then 4-bit quantize the kept values: $10^6 \cdot (4 + 27) = 3.1 \cdot 10^7$ bits. Savings: $3.2 \cdot 10^9 / 3.1 \cdot 10^7 = 103\times$ .

Observation

Composition is multiplicative in effect but the fixed indexing overhead ( $\log d$ per kept entry) limits the combined savings as sparsification becomes aggressive.

ex-ch09-09

Medium

Explain the DLG (Deep Leakage from Gradients) attack and why it succeeds on single-sample gradients.

Show Hint

Solve $\min_\xi \|\nabla \ell(\mathbf{w}; \xi) - \mathbf{g}_k\|^2$ .

Solution

Attack

Given $\mathbf{g}_k$ , $\mathbf{w}$ , and the loss $\ell$ , minimize $\|\nabla \ell(\mathbf{w}; \hat\xi) - \mathbf{g}_k\|^2$ over $\hat\xi$ via gradient descent.

Why it succeeds

The gradient $\mathbf{g}_k = \nabla \ell(\mathbf{w}; \xi_k)$ is a deterministic function of $(\mathbf{w}, \xi_k)$ . For generic models, this function is (locally) invertible. The optimization landscape has a unique global minimum at $\hat\xi = \xi_k$ .

Batch-size effect

With batch size $B$ , the target is $(1/B) \sum_{i=1}^B \nabla \ell(\mathbf{w}; \xi_{k,i})$ — $B$ superimposed signals. The attack must jointly reconstruct all $B$ samples. For $B$ small (up to $\sim 48$ ), this works; for large $B$ , reconstructions degrade.

ex-ch09-10

Medium

Explain why federated learning's "data stays on device" architecture is not sufficient for privacy.

Solution

Architectural promise

Data $\mathcal{D}_k$ never leaves user $k$ 's device. Only gradient updates $\mathbf{g}_k$ are sent.

The gap

$\mathbf{g}_k$ is a deterministic function of $\mathcal{D}_k$ (and $\mathbf{w}$ ). By the data processing inequality, $\mathbf{g}_k$ can reveal at most as much information as $\mathcal{D}_k$ — but "at most as much" can be all of it. Empirically, gradient inversion recovers training samples.

Implication

Raw data locality is not privacy. Privacy requires destroying information about $\mathcal{D}_k$ that leaves the device. Secure aggregation does this by masking; differential privacy does this by adding noise.

ex-ch09-11

Hard

Prove (informally) that FedAvg on non-IID data with large $E$ converges to a biased fixed point. Characterize the bias.

Show Hint

Client drift: users' local optima differ.

Solution

Client drift

With $E$ local SGD epochs, user $k$ makes progress toward its local optimum $\mathbf{w}_k^*$ , not the global optimum $\mathbf{w}^*$ . For non-IID data, $\mathbf{w}_k^* \neq \mathbf{w}^*$ .

Averaged update

FedAvg averages the user-local models: $\mathbf{w}_{t+1} = \sum_k (n_k/n_{\text{tot}}) \mathbf{w}_t^{(k)}$ . As $E \to \infty$ , $\mathbf{w}_t^{(k)} \to \mathbf{w}_k^*$ , so $\mathbf{w}_{t+1} \to \sum_k (n_k/n_{\text{tot}}) \mathbf{w}_k^*$ — the weighted average of user-local optima, not the global optimum.

Bias characterization

Bias $\|\mathbf{w}_\infty - \mathbf{w}^*\|$ scales with the heterogeneity $\Gamma = F(\mathbf{w}^*) - \sum_k (n_k/n_{\text{tot}}) F_k(\mathbf{w}_k^*)$ . For IID data, $\Gamma = 0$ ; for highly non-IID, $\Gamma$ is large. FedProx and SCAFFOLD address this by regularization and variance reduction.

ex-ch09-12

Hard

Compute the aggregate uplink traffic of a realistic FL deployment: $n = 10^6$ users, $C = 0.01$ , $E = 5$ , model size $d = 10^9$ (a small foundation model), 8-bit quantization, top-1% sparsification, $T = 100$ rounds.

Solution

Per-round per-user

Selected users: $Cn = 10^4$ . Quantized: $8 \cdot 10^9 = 8 \cdot 10^9$ bits per full gradient. Sparsified to 1%: $10^7$ kept entries, each $8 + 30 = 38$ bits (value + index), total $3.8 \cdot 10^8$ bits = $47.5$ MB per user per round.

Per-round aggregate

$10^4$ users × $47.5$ MB = $475$ GB per round.

Over training

$100$ rounds × $475$ GB = $47.5$ TB aggregate uplink across all selected users.

Remarks

For a $10^6$ -user population with 1% participation (= 10,000 active per round), the aggregate traffic is substantial even with aggressive compression. Adding secure aggregation (Chapter 10) roughly doubles the cost (pairwise masks). Per-user budget: $47.5$ MB per round × ~ $1$ round per user = $47.5$ MB total. Feasible for Wi-Fi, challenging for 4G.

ex-ch09-13

Hard

Design a FL experiment to empirically verify the gradient-inversion attack. Specify the dataset, architecture, and evaluation metric.

Solution

Setup

Dataset: CIFAR-10 or ImageNet-100 (subset). Architecture: ResNet-18 or LeNet-5 (smaller for speed). Training sample: randomly select one image from the dataset.

Attack

Given the per-sample gradient, initialize a random image. Minimize $\|\nabla \ell(\mathbf{w}; \hat\xi) - \mathbf{g}\|^2$ via Adam (learning rate $0.1$ , 500 iterations). Include a regularizer to encourage valid pixel values (e.g., total-variation regularization).

Evaluation

Compare $\hat\xi$ to the original sample. Metrics:

Pixel-level MSE or PSNR.
Cosine similarity in feature space (via a pre-trained classifier).
Visual inspection.

For LeNet on CIFAR-10, expect PSNR >30 dB and near-pixel-perfect reconstruction.

Extensions

Increase batch size $B = 1, 2, 4, \ldots, 64$ and measure reconstruction quality. Plot $\text{PSNR}$ vs. $B$ — a direct empirical verification of the §9.4 theorem.

ex-ch09-14

Hard

Analyze the tension between compression and privacy: argue that aggressive compression can actually hurt privacy because it makes the gradient distinguish training samples more rather than less.

Solution

Naive intuition

Aggressive compression destroys information; therefore it destroys privacy-sensitive info too. Should improve privacy.

Counter-intuition

Compression keeps the largest-magnitude entries (top- $K$ sparsification) or the sign (1-bit quantization). These high-information features are precisely the ones most correlated with the training sample. The low-magnitude entries — which do little to distinguish samples — are discarded.

Result

The compressed gradient may be more informative per-bit about the sample than the full gradient. Empirical evidence: 1-bit SignSGD gradients still allow inversion, sometimes with higher efficiency than full gradients (fewer optimization iterations needed for the attacker).

Policy implication

Compression-as-privacy is not just insufficient — it can be counterproductive. Always pair compression with explicit privacy mechanisms (secure aggregation, DP).

ex-ch09-15

Challenge

Open problem. Derive an information-theoretic lower bound on the leakage of gradients in FL as a function of model architecture (depth, width), loss function, and batch size. Characterize when gradient-inversion reconstruction becomes information-theoretically infeasible.

Show Hint

Think mutual information between gradient and sample.

Solution

Setup

Measure leakage: $I(\xi; \mathbf{g}) = H(\xi) - H(\xi \mid \mathbf{g})$ . For full leakage, $H(\xi \mid \mathbf{g}) = 0$ (no ambiguity after seeing gradient). For zero leakage, $H(\xi \mid \mathbf{g}) = H(\xi)$ (gradient uninformative).

Single-sample case

For deterministic loss and architecture, the gradient is a deterministic function of $\xi$ . If invertible, $I(\xi; \mathbf{g}) = H(\xi)$ — full leakage. Non-invertibility arises for simple models (e.g., linear regression with $d < \dim \xi$ ), partial leakage there.

Batch-size effect

$I(\{\xi_i\}_{i=1}^B; (1/B) \sum_i \mathbf{g}_i)$ — mutual info between $B$ samples and their mean gradient. The sum has $d$ degrees of freedom; $B \cdot \dim \xi$ joint sample information. When $B \cdot \dim \xi > d$ , leakage is reduced. For large $B$ , the aggregate gradient is "insufficient statistic" of the samples.

Open characterization

The exact leakage as a function of $(B, d, \dim \xi, \text{loss})$ is open. Empirical studies give operating points. Chapter 18 lists this as one of the open problems for privacy-aware FL. The CommIT-group contributions of Chapters 10–12 provide operational protocols, but the information-theoretic lower bound on leakage remains a research frontier.

Exercises

ex-ch09-01

Three differences

ex-ch09-02

Per-round

Over training

ex-ch09-03

Mechanism

Savings factor

ex-ch09-04

Compression vs. privacy

What gives privacy

ex-ch09-05

Statement

Proof sketch

Interpretation

ex-ch09-06

Uniform quantizer

Total across $d$ scalars

Convergence implication

ex-ch09-07

Pseudocode

Why error feedback

Convergence

ex-ch09-08

(a) 4-bit quantization

(b) Top-1\% sparsification

(c) Combined

Observation

ex-ch09-09

Attack

Why it succeeds

Batch-size effect

ex-ch09-10

Architectural promise

The gap

Implication

ex-ch09-11

Client drift

Averaged update

Bias characterization

ex-ch09-12

Per-round per-user

Per-round aggregate

Over training

Remarks

ex-ch09-13

Setup

Attack

Evaluation

Extensions

ex-ch09-14

Naive intuition

Counter-intuition

Result

Policy implication

ex-ch09-15

Setup

Single-sample case

Batch-size effect

Open characterization