Convergence under Aggregation MSE
The Central Question: How Does MSE Cost Rounds?
Section Β§17.1 introduced the wireless-FL protocol. The aggregator's output is with per-round MSE . Bigger MSE should mean slower convergence β but how much slower?
The question is sharply answered by the standard SGD convergence analysis, adapted to our setting. For smooth and strongly convex FL losses, the convergence rate decomposes into (i) an exponentially decaying "deterministic" term and (ii) a persistent "noise floor" proportional to . The floor is the irreducible cost of noisy aggregation.
The point is that wireless-FL convergence is not about zero error β it is about the error floor. Designers can choose to allocate power/bandwidth (to lower MSE), or to run more rounds (beating the floor with decreasing learning rate). This section formalizes the trade-off.
Theorem: FedAvg Convergence Under Aggregation MSE
Assume the FL loss is -smooth and -strongly convex. Users have bounded gradient variance: for all .
Run wireless-FL (Algorithm 17.1.1) with learning rate , uniform scheduling , and per-round aggregation MSE . Then, after rounds,
Interpretation. The first term decays exponentially β the FL loss improves toward the "noise floor" determined by the second term. Both SGD variance () and aggregation MSE () contribute to the floor.
Bias of the update
(assuming zero-mean AirComp/digital error and i.i.d. user gradients). Unbiasedness is key.
Variance decomposition
. The two sources are independent under the stated assumptions.
Smooth-SGD recursion
Standard smooth SGD analysis gives where is the variance bound above.
Telescoping
Telescoping over rounds yields the stated inequality, after simplification and using .
The FL noise floor
As , the first term ; the second term persists. This is the FL noise floor β the irreducible loss gap under noisy aggregation.
Example: Convergence at Given MSE
A wireless-FL task has , , , users, , initial loss gap . Digital aggregation gives ; AirComp gives . Compute the predicted final-loss floor for both and the round count needed to get within of the floor.
SGD variance contribution
.
Digital aggregation floor
. Total variance: . Floor: .
AirComp floor
. Total variance: . Floor: .
Rounds to 2Γ floor
rounds.
Operational interpretation
The floors differ by only β AirComp's larger MSE costs almost nothing in FL convergence because the SGD variance dominates. The wireless-FL bottleneck is the user-level stochasticity, not the aggregation fidelity. This is a key engineering insight: don't over-engineer the aggregation layer when SGD variance is the binding constraint.
When Does Aggregation MSE Matter?
The convergence bound reveals a critical threshold: In this regime, the FL noise floor is set by per-user SGD variance. Halving does not halve the convergence β the SGD variance dominates. Conversely: Small- (low-variance per-user gradients, as in large-batch local SGD) and heterogeneous gradients can push the system into this regime.
Operational implication: engineer aggregation MSE to match the SGD regime. Over-engineering AirComp (pushing below ) wastes power and bandwidth. Under-engineering (above ) costs convergence. The sweet spot is β the matched design. This is the golden thread in sharp form.
Theorem: Beating the Floor: Decreasing Learning Rate
Under the same assumptions as Theorem 17.2.1, if (decreasing): The decreasing learning rate beats the noise floor β at rate . For large , the loss converges to the exact optimum.
Adaptive step size
With , the noise contribution in each step shrinks. The cumulative effect gives convergence.
Price paid
Decreasing learning rate slows the deterministic descent (first term of Theorem 17.2.1 no longer exponential). For large , the floor-beating dominates. For small (budget-limited), constant is better.
Operational choice
Budget determines the strategy: constant for small (tight budget, don't care about floor); decreasing for large (lot of rounds, want high accuracy).
FL Convergence: Exact Aggregation vs. AirComp
Simulate FL convergence under different aggregation MSE levels. Compare the ideal case () with realistic values. The plot shows the exponential decay toward a noise floor set by Theorem 17.2.1. Change and to see how the floor is dominated by either SGD variance or aggregation MSE.
Parameters
Theorem: Non-Convex FL Under Aggregation MSE
For an -smooth (possibly non-convex) FL loss, wireless-FL with constant satisfies The gradient norm converges to a floor proportional to the same variance terms; the rate is instead of .
Descent lemma
-smoothness gives .
Expectation and telescoping
Take expectation, telescope. Using and , solve for the minimum.
Operational interpretation
Non-convex FL (deep networks) loses the exponential decay but retains the noise-floor structure. The golden thread remains: pay aggregation MSE and/or SGD variance, reach a floor. Design the aggregator to fall below the floor dominated by SGD variance.
Tuning for Target Accuracy
To achieve a target loss gap :
-
Compute the required floor. Need , i.e., .
-
Allocate the variance budget. . Assign to each source (rule of thumb).
-
Design . Given the budget, β which becomes the AirComp power-control target.
-
Compute . gives the round count.
-
Match to battery budget. Each round uses energy per user. Total: . Must fit device capacity.
-
Rotate users. If the battery budget is tight, schedule users in rotating subsets to average the drain.
A spreadsheet or quick Python calculation settles the design. Over- engineering aggregation β e.g., β is always wasted.
- β’
Variance budget:
- β’
Split: SGD / aggregation
- β’
from the log-ratio formula
- β’
Battery:
- β’
Rotate users for fairness
Common Mistake: Over-Engineering Aggregation MSE
Mistake:
Design an aggregator targeting , burning power/bandwidth unnecessarily.
Correction:
Per Theorem 17.2.1, once , further reductions don't help convergence. The golden thread says: aggregation fidelity is one axis, SGD variance is another. Design both at matched levels. Use Theorem 17.2.1 as a design calculator: given , compute the aggregation MSE that lands at the sweet spot. Anything below is free at no convergence benefit β but also no convergence harm.
Key Takeaway
Wireless-FL convergence has a noise floor under constant . The aggregation MSE is amortized by β for large , it is dominated by the SGD variance term. Match aggregation to SGD regime: . Over-engineering is wasted. Decreasing learning rates eliminate the floor at cost in rate.
Quick Check
For wireless-FL with users, , and a constant learning rate, at what aggregation MSE does the AirComp contribution equal the SGD-variance contribution to the noise floor?
. The two variance terms are equal at .