Convergence under Aggregation MSE

The Central Question: How Does MSE Cost Rounds?

Section Β§17.1 introduced the wireless-FL protocol. The aggregator's output is G^t=Gt+et\hat{\mathbf{G}}_t = \mathbf{G}_t + \mathbf{e}_t with per-round MSE \ntnmseagg(t)=E[βˆ₯etβˆ₯2]\ntn{mseagg}(t) = \mathbb{E}[\|\mathbf{e}_t\|^2]. Bigger MSE should mean slower convergence β€” but how much slower?

The question is sharply answered by the standard SGD convergence analysis, adapted to our setting. For smooth and strongly convex FL losses, the convergence rate decomposes into (i) an exponentially decaying "deterministic" term and (ii) a persistent "noise floor" proportional to \ntnmseagg\ntn{mseagg}. The floor is the irreducible cost of noisy aggregation.

The point is that wireless-FL convergence is not about zero error β€” it is about the error floor. Designers can choose to allocate power/bandwidth (to lower MSE), or to run more rounds (beating the floor with decreasing learning rate). This section formalizes the trade-off.

Theorem: FedAvg Convergence Under Aggregation MSE

Assume the FL loss FF is LL-smooth and ΞΌ\mu-strongly convex. Users have bounded gradient variance: E[βˆ₯gk(t)βˆ’βˆ‡F(ΞΈt)βˆ₯2]≀σg2\mathbb{E}[\|\mathbf{g}_k^{(t)} - \nabla F(\boldsymbol{\theta}_t)\|^2] \leq \sigma_g^2 for all kk.

Run wireless-FL (Algorithm 17.1.1) with learning rate Ξ·lr≀1/L\eta_{\text{lr}} \leq 1/L, uniform scheduling ∣St∣=n|\mathcal{S}_t| = n, and per-round aggregation MSE \ntnmseagg(t)≀\ntnmseagg\ntn{mseagg}(t) \leq \ntn{mseagg}. Then, after TT rounds, E[F(ΞΈT)βˆ’F⋆]β€…β€Šβ‰€β€…β€Š(1βˆ’Ξ·lrΞΌ)Tβ‹…(F(ΞΈ0)βˆ’F⋆)β€…β€Š+β€…β€ŠΞ·lr2ΞΌ(Οƒg2n+\ntnmseaggn2).\mathbb{E}[F(\boldsymbol{\theta}_T) - F^{\star}] \;\leq\; (1 - \eta_{\text{lr}}\mu)^T \cdot (F(\boldsymbol{\theta}_0) - F^{\star}) \;+\; \frac{\eta_{\text{lr}}}{2\mu} \left(\frac{\sigma_g^2}{n} + \frac{\ntn{mseagg}}{n^2}\right).

Interpretation. The first term decays exponentially β€” the FL loss improves toward the "noise floor" determined by the second term. Both SGD variance (Οƒg2/n\sigma_g^2/n) and aggregation MSE (\ntnmseagg/n2\ntn{mseagg}/n^2) contribute to the floor.

,

Example: Convergence at Given MSE

A wireless-FL task has L=10L = 10, ΞΌ=1\mu = 1, Οƒg2=1\sigma_g^2 = 1, n=50n = 50 users, Ξ·lr=0.1\eta_{\text{lr}} = 0.1, initial loss gap F(ΞΈ0)βˆ’F⋆=10F(\boldsymbol{\theta}_0) - F^{\star} = 10. Digital aggregation gives \ntnmseagg=0.1\ntn{mseagg} = 0.1; AirComp gives \ntnmseagg=1.0\ntn{mseagg} = 1.0. Compute the predicted final-loss floor for both and the round count TT needed to get within 2Γ—2\times of the floor.

When Does Aggregation MSE Matter?

The convergence bound reveals a critical threshold: \ntnmseaggβ€…β€Šβ‰ͺβ€…β€Šnβ‹…Οƒg2β€…β€Šβ‡’β€…β€ŠaggregationΒ MSEΒ isΒ irrelevant.\ntn{mseagg} \;\ll\; n \cdot \sigma_g^2 \;\Rightarrow\; \text{aggregation MSE is irrelevant.} In this regime, the FL noise floor is set by per-user SGD variance. Halving \ntnmseagg\ntn{mseagg} does not halve the convergence β€” the SGD variance dominates. Conversely: \ntnmseaggβ€…β€Šβ‰³β€…β€Šnβ‹…Οƒg2β€…β€Šβ‡’β€…β€ŠaggregationΒ MSEΒ matters.\ntn{mseagg} \;\gtrsim\; n \cdot \sigma_g^2 \;\Rightarrow\; \text{aggregation MSE matters.} Small-Οƒg2\sigma_g^2 (low-variance per-user gradients, as in large-batch local SGD) and heterogeneous gradients can push the system into this regime.

Operational implication: engineer aggregation MSE to match the SGD regime. Over-engineering AirComp (pushing \ntnmseagg\ntn{mseagg} below nΟƒg2n \sigma_g^2) wastes power and bandwidth. Under-engineering (above nΟƒg2n \sigma_g^2) costs convergence. The sweet spot is \ntnmseaggβ‰ˆnΟƒg2\ntn{mseagg} \approx n \sigma_g^2 β€” the matched design. This is the golden thread in sharp form.

Theorem: Beating the Floor: Decreasing Learning Rate

Under the same assumptions as Theorem 17.2.1, if Ξ·lr,t=1/(ΞΌt)\eta_{\text{lr},t} = 1/(\mu t) (decreasing): E[F(ΞΈT)βˆ’F⋆]β€…β€Šβ‰€β€…β€ŠO ⁣(L(Οƒg2/n+\ntnmseagg/n2)ΞΌ2T)β€…β€Šβ†’β€…β€Š0.\mathbb{E}[F(\boldsymbol{\theta}_T) - F^{\star}] \;\leq\; O\!\left(\frac{L (\sigma_g^2/n + \ntn{mseagg}/n^2)}{\mu^2 T}\right) \;\to\; 0. The decreasing learning rate beats the noise floor β€” at rate O(1/T)O(1/T). For large TT, the loss converges to the exact optimum.

FL Convergence: Exact Aggregation vs. AirComp

Simulate FL convergence under different aggregation MSE levels. Compare the ideal case (\ntnmseagg=0\ntn{mseagg} = 0) with realistic values. The plot shows the exponential decay toward a noise floor set by Theorem 17.2.1. Change nn and Οƒg2\sigma_g^2 to see how the floor is dominated by either SGD variance or aggregation MSE.

Parameters
200
50
1
1

Theorem: Non-Convex FL Under Aggregation MSE

For an LL-smooth (possibly non-convex) FL loss, wireless-FL with constant Ξ·lr≀1/L\eta_{\text{lr}} \leq 1/L satisfies min⁑t≀TE[βˆ₯βˆ‡F(ΞΈt)βˆ₯2]β€…β€Šβ‰€β€…β€Š2(F(ΞΈ0)βˆ’F⋆)Ξ·lrTβ€…β€Š+β€…β€ŠLΞ·lrβ‹…(Οƒg2n+\ntnmseaggn2).\min_{t \leq T} \mathbb{E}[\|\nabla F(\boldsymbol{\theta}_t)\|^2] \;\leq\; \frac{2(F(\boldsymbol{\theta}_0) - F^{\star})}{\eta_{\text{lr}} T} \;+\; L\eta_{\text{lr}} \cdot \left(\frac{\sigma_g^2}{n} + \frac{\ntn{mseagg}}{n^2}\right). The gradient norm converges to a floor proportional to the same variance terms; the rate is O(1/T)O(1/T) instead of O(exp⁑)O(\exp).

,
⚠️Engineering Note

Tuning for Target Accuracy

To achieve a target loss gap Ξ΅\varepsilon:

  • Compute the required floor. Need Ξ·lrV/(2ΞΌ)≀Ρ\eta_{\text{lr}} V / (2\mu) \leq \varepsilon, i.e., V≀2ΞΌΞ΅/Ξ·lrV \leq 2\mu\varepsilon/\eta_{\text{lr}}.

  • Allocate the variance budget. V=Οƒg2/n+\ntnmseagg/n2V = \sigma_g^2/n + \ntn{mseagg}/n^2. Assign ∼50%\sim 50\% to each source (rule of thumb).

  • Design \ntnmseagg\ntn{mseagg}. Given the budget, \ntnmseagg≀0.5Vn2\ntn{mseagg} \leq 0.5 V n^2 β€” which becomes the AirComp power-control target.

  • Compute TT. Tβ‰₯log⁑(Ξ΅/(F0βˆ’F⋆))/log⁑(1βˆ’Ξ·lrΞΌ)T \geq \log(\varepsilon / (F_0 - F^{\star})) / \log(1 - \eta_{\text{lr}}\mu) gives the round count.

  • Match to battery budget. Each round uses Ek(t)E_k^{(t)} energy per user. Total: Tβ‹…EΛ‰T \cdot \bar{E}. Must fit device capacity.

  • Rotate users. If the battery budget is tight, schedule users in rotating subsets St\mathcal{S}_t to average the drain.

A spreadsheet or quick Python calculation settles the design. Over- engineering aggregation β€” e.g., \ntnmseaggβ‰ͺnΟƒg2\ntn{mseagg} \ll n \sigma_g^2 β€” is always wasted.

Practical Constraints
  • β€’

    Variance budget: V≀2ΞΌΞ΅/Ξ·lrV \leq 2\mu\varepsilon/\eta_{\text{lr}}

  • β€’

    Split: ∼50%\sim 50\% SGD / 50%50\% aggregation

  • β€’

    TT from the log-ratio formula

  • β€’

    Battery: T⋅EˉT \cdot \bar{E}

  • β€’

    Rotate users for fairness

πŸ“‹ Ref: Amiri-Gunduz 2020; Bottou et al. 2018

Common Mistake: Over-Engineering Aggregation MSE

Mistake:

Design an aggregator targeting \ntnmseaggβ‰ͺnΟƒg2\ntn{mseagg} \ll n \sigma_g^2, burning power/bandwidth unnecessarily.

Correction:

Per Theorem 17.2.1, once \ntnmseagg/n2β‰ͺΟƒg2/n\ntn{mseagg}/n^2 \ll \sigma_g^2/n, further reductions don't help convergence. The golden thread says: aggregation fidelity is one axis, SGD variance is another. Design both at matched levels. Use Theorem 17.2.1 as a design calculator: given (n,Οƒg2,Ξ·lr,ΞΌ)(n, \sigma_g^2, \eta_{\text{lr}}, \mu), compute the aggregation MSE that lands at the sweet spot. Anything below is free at no convergence benefit β€” but also no convergence harm.

Key Takeaway

Wireless-FL convergence has a noise floor Ξ·lr(Οƒg2/n+\ntnmseagg/n2)/(2ΞΌ)\eta_{\text{lr}}(\sigma_g^2/n + \ntn{mseagg}/n^2)/(2\mu) under constant Ξ·lr\eta_{\text{lr}}. The aggregation MSE is amortized by n2n^2 β€” for large nn, it is dominated by the SGD variance term. Match aggregation to SGD regime: \ntnmseaggβ‰ˆnΟƒg2\ntn{mseagg} \approx n \sigma_g^2. Over-engineering is wasted. Decreasing learning rates eliminate the floor at O(1/T)O(1/T) cost in rate.

Quick Check

For wireless-FL with n=100n = 100 users, Οƒg2=2\sigma_g^2 = 2, and a constant learning rate, at what aggregation MSE does the AirComp contribution equal the SGD-variance contribution to the noise floor?

\ntnmseagg=2\ntn{mseagg} = 2

\ntnmseagg=20\ntn{mseagg} = 20

\ntnmseagg=200\ntn{mseagg} = 200

\ntnmseagg=20000\ntn{mseagg} = 20000