Ferkans — Interactive Telecom Tutor

The Central Question: How Does MSE Cost Rounds?

Section §17.1 introduced the wireless-FL protocol. The aggregator's output is $\hat{\mathbf{G}}_t = \mathbf{G}_t + \mathbf{e}_t$ with per-round MSE $\ntn{mseagg}(t) = \mathbb{E}[\|\mathbf{e}_t\|^2]$ . Bigger MSE should mean slower convergence — but how much slower?

The question is sharply answered by the standard SGD convergence analysis, adapted to our setting. For smooth and strongly convex FL losses, the convergence rate decomposes into (i) an exponentially decaying "deterministic" term and (ii) a persistent "noise floor" proportional to $\ntn{mseagg}$ . The floor is the irreducible cost of noisy aggregation.

The point is that wireless-FL convergence is not about zero error — it is about the error floor. Designers can choose to allocate power/bandwidth (to lower MSE), or to run more rounds (beating the floor with decreasing learning rate). This section formalizes the trade-off.

Theorem: FedAvg Convergence Under Aggregation MSE

Assume the FL loss $F$ is $L$ -smooth and $\mu$ -strongly convex. Users have bounded gradient variance: $\mathbb{E}[\|\mathbf{g}_k^{(t)} - \nabla F(\boldsymbol{\theta}_t)\|^2] \leq \sigma_g^2$ for all $k$ .

Run wireless-FL (Algorithm 17.1.1) with learning rate $\eta_{\text{lr}} \leq 1/L$ , uniform scheduling $|\mathcal{S}_t| = n$ , and per-round aggregation MSE $\ntn{mseagg}(t) \leq \ntn{mseagg}$ . Then, after $T$ rounds, $\mathbb{E}[F(\boldsymbol{\theta}_T) - F^{\star}] \;\leq\; (1 - \eta_{\text{lr}}\mu)^T \cdot (F(\boldsymbol{\theta}_0) - F^{\star}) \;+\; \frac{\eta_{\text{lr}}}{2\mu} \left(\frac{\sigma_g^2}{n} + \frac{\ntn{mseagg}}{n^2}\right).$

Interpretation. The first term decays exponentially — the FL loss improves toward the "noise floor" determined by the second term. Both SGD variance ( $\sigma_g^2/n$ ) and aggregation MSE ( $\ntn{mseagg}/n^2$ ) contribute to the floor.

Proof

Bias of the update

$\mathbb{E}[\hat{\mathbf{G}}_t/n - \nabla F(\boldsymbol{\theta}_t)] = \mathbf{0}$ (assuming zero-mean AirComp/digital error and i.i.d. user gradients). Unbiasedness is key.

Variance decomposition

$\mathbb{E}[\|\hat{\mathbf{G}}_t/n - \nabla F(\boldsymbol{\theta}_t)\|^2] = \underbrace{\sigma_g^2/n}_{\text{SGD}} + \underbrace{\ntn{mseagg}/n^2}_{\text{aggregation}}$ . The two sources are independent under the stated assumptions.

Smooth-SGD recursion

Standard smooth SGD analysis gives $\mathbb{E}[F(\boldsymbol{\theta}_{t+1}) - F^{\star}] \leq (1 - \eta_{\text{lr}}\mu) \mathbb{E}[F(\boldsymbol{\theta}_t) - F^{\star}] + \eta_{\text{lr}}^2 L / 2 \cdot V$ where $V$ is the variance bound above.

Telescoping

Telescoping over $T$ rounds yields the stated inequality, after simplification and using $\eta_{\text{lr}} \leq 1/L$ .

The FL noise floor

As $T \to \infty$ , the first term $\to 0$ ; the second term persists. This is the FL noise floor $\eta_{\text{lr}}(\sigma_g^2/n + \ntn{mseagg}/n^2)/(2\mu)$ — the irreducible loss gap under noisy aggregation.

,

Example: Convergence at Given MSE

A wireless-FL task has $L = 10$ , $\mu = 1$ , $\sigma_g^2 = 1$ , $n = 50$ users, $\eta_{\text{lr}} = 0.1$ , initial loss gap $F(\boldsymbol{\theta}_0) - F^{\star} = 10$ . Digital aggregation gives $\ntn{mseagg} = 0.1$ ; AirComp gives $\ntn{mseagg} = 1.0$ . Compute the predicted final-loss floor for both and the round count $T$ needed to get within $2\times$ of the floor.

Solution

SGD variance contribution

$\sigma_g^2/n = 1/50 = 0.02$ .

Digital aggregation floor

$\ntn{mseagg}/n^2 = 0.1/2500 = 4 \cdot 10^{-5}$ . Total variance: $V_{\text{digital}} = 0.02 + 4 \cdot 10^{-5} \approx 0.0200$ . Floor: $\eta_{\text{lr}} V/(2\mu) = 0.1 \cdot 0.02 / 2 = 0.001$ .

AirComp floor

$\ntn{mseagg}/n^2 = 1.0/2500 = 4 \cdot 10^{-4}$ . Total variance: $V_{\text{AirComp}} = 0.02 + 4 \cdot 10^{-4} = 0.0204$ . Floor: $\approx 0.00102$ .

Rounds to 2× floor

$(1 - 0.1 \cdot 1)^T \cdot 10 \leq \text{floor}$ $\Rightarrow T \geq \log(\text{floor}/10)/\log(0.9) \approx 87$ rounds.

Operational interpretation

The floors differ by only $\sim 2\%$ — AirComp's $10\times$ larger MSE costs almost nothing in FL convergence because the SGD variance dominates. The wireless-FL bottleneck is the user-level stochasticity, not the aggregation fidelity. This is a key engineering insight: don't over-engineer the aggregation layer when SGD variance is the binding constraint.

When Does Aggregation MSE Matter?

The convergence bound reveals a critical threshold: $\ntn{mseagg} \;\ll\; n \cdot \sigma_g^2 \;\Rightarrow\; \text{aggregation MSE is irrelevant.}$ In this regime, the FL noise floor is set by per-user SGD variance. Halving $\ntn{mseagg}$ does not halve the convergence — the SGD variance dominates. Conversely: $\ntn{mseagg} \;\gtrsim\; n \cdot \sigma_g^2 \;\Rightarrow\; \text{aggregation MSE matters.}$ Small- $\sigma_g^2$ (low-variance per-user gradients, as in large-batch local SGD) and heterogeneous gradients can push the system into this regime.

Operational implication: engineer aggregation MSE to match the SGD regime. Over-engineering AirComp (pushing $\ntn{mseagg}$ below $n \sigma_g^2$ ) wastes power and bandwidth. Under-engineering (above $n \sigma_g^2$ ) costs convergence. The sweet spot is $\ntn{mseagg} \approx n \sigma_g^2$ — the matched design. This is the golden thread in sharp form.

Theorem: Beating the Floor: Decreasing Learning Rate

Under the same assumptions as Theorem 17.2.1, if $\eta_{\text{lr},t} = 1/(\mu t)$ (decreasing): $\mathbb{E}[F(\boldsymbol{\theta}_T) - F^{\star}] \;\leq\; O\!\left(\frac{L (\sigma_g^2/n + \ntn{mseagg}/n^2)}{\mu^2 T}\right) \;\to\; 0.$ The decreasing learning rate beats the noise floor — at rate $O(1/T)$ . For large $T$ , the loss converges to the exact optimum.

Proof

Adaptive step size

With $\eta_{\text{lr},t} \to 0$ , the noise contribution in each step shrinks. The cumulative effect gives $O(1/T)$ convergence.

Price paid

Decreasing learning rate slows the deterministic descent (first term of Theorem 17.2.1 no longer exponential). For large $T$ , the floor-beating dominates. For small $T$ (budget-limited), constant $\eta_{\text{lr}}$ is better.

Operational choice

Budget $T$ determines the strategy: constant $\eta_{\text{lr}}$ for small $T$ (tight budget, don't care about floor); decreasing $\eta_{\text{lr}}$ for large $T$ (lot of rounds, want high accuracy).

FL Convergence: Exact Aggregation vs. AirComp

Simulate FL convergence under different aggregation MSE levels. Compare the ideal case ( $\ntn{mseagg} = 0$ ) with realistic values. The plot shows the exponential decay toward a noise floor set by Theorem 17.2.1. Change $n$ and $\sigma_g^2$ to see how the floor is dominated by either SGD variance or aggregation MSE.

Parameters

T

— rounds200

n

— users50

\sigma_g^2

— per-user gradient variance1

\ntn{mseagg}

— aggregation MSE1

Theorem: Non-Convex FL Under Aggregation MSE

For an $L$ -smooth (possibly non-convex) FL loss, wireless-FL with constant $\eta_{\text{lr}} \leq 1/L$ satisfies $\min_{t \leq T} \mathbb{E}[\|\nabla F(\boldsymbol{\theta}_t)\|^2] \;\leq\; \frac{2(F(\boldsymbol{\theta}_0) - F^{\star})}{\eta_{\text{lr}} T} \;+\; L\eta_{\text{lr}} \cdot \left(\frac{\sigma_g^2}{n} + \frac{\ntn{mseagg}}{n^2}\right).$ The gradient norm converges to a floor proportional to the same variance terms; the rate is $O(1/T)$ instead of $O(\exp)$ .

Proof

Descent lemma

$L$ -smoothness gives $F(\boldsymbol{\theta}_{t+1}) \leq F(\boldsymbol{\theta}_t) - \eta_{\text{lr}} \|\nabla F\|^2 + \eta_{\text{lr}}^2 L/2 \|\mathbf{e}\|^2$ .

Expectation and telescoping

Take expectation, telescope. Using $\mathbb{E}[\|\mathbf{e}\|^2] = V$ and $\sum_{t=0}^{T-1} \mathbb{E}[\|\nabla F(\boldsymbol{\theta}_t)\|^2] \geq T \min_t \mathbb{E}[\|\nabla F\|^2]$ , solve for the minimum.

Operational interpretation

Non-convex FL (deep networks) loses the exponential decay but retains the noise-floor structure. The golden thread remains: pay aggregation MSE and/or SGD variance, reach a floor. Design the aggregator to fall below the floor dominated by SGD variance.

,

⚠️Engineering Note

Tuning for Target Accuracy

To achieve a target loss gap $\varepsilon$ :

Compute the required floor. Need $\eta_{\text{lr}} V / (2\mu) \leq \varepsilon$ , i.e., $V \leq 2\mu\varepsilon/\eta_{\text{lr}}$ .
Allocate the variance budget. $V = \sigma_g^2/n + \ntn{mseagg}/n^2$ . Assign $\sim 50\%$ to each source (rule of thumb).
Design $\ntn{mseagg}$ . Given the budget, $\ntn{mseagg} \leq 0.5 V n^2$ — which becomes the AirComp power-control target.
Compute $T$ . $T \geq \log(\varepsilon / (F_0 - F^{\star})) / \log(1 - \eta_{\text{lr}}\mu)$ gives the round count.
Match to battery budget. Each round uses $E_k^{(t)}$ energy per user. Total: $T \cdot \bar{E}$ . Must fit device capacity.
Rotate users. If the battery budget is tight, schedule users in rotating subsets $\mathcal{S}_t$ to average the drain.

A spreadsheet or quick Python calculation settles the design. Over- engineering aggregation — e.g., $\ntn{mseagg} \ll n \sigma_g^2$ — is always wasted.

Practical Constraints

•
Variance budget: $V \leq 2\mu\varepsilon/\eta_{\text{lr}}$
•
Split: $\sim 50\%$ SGD / $50\%$ aggregation
•
$T$ from the log-ratio formula
•
Battery: $T \cdot \bar{E}$
•
Rotate users for fairness

📋 Ref: Amiri-Gunduz 2020; Bottou et al. 2018

Common Mistake: Over-Engineering Aggregation MSE

Mistake:

Design an aggregator targeting $\ntn{mseagg} \ll n \sigma_g^2$ , burning power/bandwidth unnecessarily.

Correction:

Per Theorem 17.2.1, once $\ntn{mseagg}/n^2 \ll \sigma_g^2/n$ , further reductions don't help convergence. The golden thread says: aggregation fidelity is one axis, SGD variance is another. Design both at matched levels. Use Theorem 17.2.1 as a design calculator: given $(n, \sigma_g^2, \eta_{\text{lr}}, \mu)$ , compute the aggregation MSE that lands at the sweet spot. Anything below is free at no convergence benefit — but also no convergence harm.

Key Takeaway

Wireless-FL convergence has a noise floor $\eta_{\text{lr}}(\sigma_g^2/n + \ntn{mseagg}/n^2)/(2\mu)$ under constant $\eta_{\text{lr}}$ . The aggregation MSE is amortized by $n^2$ — for large $n$ , it is dominated by the SGD variance term. Match aggregation to SGD regime: $\ntn{mseagg} \approx n \sigma_g^2$ . Over-engineering is wasted. Decreasing learning rates eliminate the floor at $O(1/T)$ cost in rate.

Quick Check

For wireless-FL with $n = 100$ users, $\sigma_g^2 = 2$ , and a constant learning rate, at what aggregation MSE does the AirComp contribution equal the SGD-variance contribution to the noise floor?

$\ntn{mseagg} = 2$

$\ntn{mseagg} = 20$

$\ntn{mseagg} = 200$

$\ntn{mseagg} = 20000$

Correction:

\ntn{mseagg} = 200

$200/n^2 = 200/10000 = 0.02 = \sigma_g^2/n$ . The two variance terms are equal at $\ntn{mseagg} = n \sigma_g^2 = 200$ .

Convergence under Aggregation MSE

The Central Question: How Does MSE Cost Rounds?

Theorem: FedAvg Convergence Under Aggregation MSE

Bias of the update

Variance decomposition

Smooth-SGD recursion

Telescoping

The FL noise floor

Example: Convergence at Given MSE

SGD variance contribution

Digital aggregation floor

AirComp floor

Rounds to 2× floor

Operational interpretation

When Does Aggregation MSE Matter?

Theorem: Beating the Floor: Decreasing Learning Rate

Adaptive step size

Price paid

Operational choice

FL Convergence: Exact Aggregation vs. AirComp

Parameters

Theorem: Non-Convex FL Under Aggregation MSE

Descent lemma

Expectation and telescoping

Operational interpretation

Tuning for Target Accuracy

Common Mistake: Over-Engineering Aggregation MSE

Key Takeaway

Quick Check