FedAvg and Its Convergence
Why FedAvg is the Canonical FL Algorithm
Vanilla distributed SGD sends each user's single-step gradient to the server every iteration. This has two problems for FL:
- High communication cost. With iterations and uplink per iteration, each user sends scalars β too much for mobile uplink.
- Sparse participation. Only of users can participate per round; requiring every user per iteration is infeasible.
FedAvg (McMahan et al. 2017) solves both: each selected user runs local SGD steps before uploading. The result is that "global rounds" correspond to local SGD steps β communication is reduced by factor . Convergence theory guarantees this works for i.i.d. data and, with caveats, for non-i.i.d. data.
The point is that FedAvg is the baseline against which every later FL protocol (secure aggregation, differential privacy, AirComp) is measured. Section 9.2 formalizes the algorithm and states its convergence properties.
FedAvg (McMahan et al.)
Complexity: per-user-per-roundThe key parameter is β the number of local epochs. with reduces to vanilla distributed SGD. Larger saves communication but can introduce client drift (users' local models diverging from the global one), especially on non-IID data.
Definition: FedAvg Algorithm (Formal)
FedAvg Algorithm (Formal)
The FedAvg algorithm for federated learning with users, participation rate , local epochs , and learning rate runs the loop:
At round : the server selects users uniformly at random, broadcasts , each selected user performs local SGD epochs starting from to produce , and the server computes .
The aggregation is a weighted average by user data sizes, giving each user's contribution in proportion to their dataset size.
FedAvg
Federated Averaging algorithm introduced by McMahan et al. (2017). Each round, a subset of users runs local SGD epochs and the server averages the resulting models. Canonical FL algorithm; reduces communication by factor compared to vanilla distributed SGD.
Client Drift
The phenomenon that with large (many local epochs per round), each user's local model drifts away from the global model toward its local objective. Causes FedAvg to converge to a biased stationary point on non-IID data.
Theorem: FedAvg Convergence on i.i.d. Data
Assume each user's local objective is -smooth and -strongly convex, user data is i.i.d. from a common distribution, and the stochastic gradient variance is bounded: .
For FedAvg with learning rate after rounds, the expected suboptimality satisfies: where is the condition number, bounds , and is the initial round count.
The rate is (asymptotically), matching centralized SGD up to constants that depend on , , and .
FedAvg on i.i.d. data converges at the same asymptotic rate as centralized SGD; the local-epoch parameter enters the constants but not the scaling. Operationally, this means you can save communication by factor (fewer rounds) at modest convergence cost.
The point is that convergence theory supports FedAvg's communication-saving strategy β the gains are "free" in the IID case. The non-IID case is harder; see the next theorem.
Aggregate update vs. centralized step
The FedAvg update after one round with local epochs looks similar to a single centralized SGD step with effective gradient β approximately, because local steps drift.
Bound the drift
Each user's local trajectory stays within a ball of radius proportional to around the global model. On i.i.d. data, the per-user local gradient has mean equal to the global gradient, so local-epoch drift averages out.
Apply centralized SGD analysis
With the drift bounded, the SGD inequality applies modulo the extra variance term from local trajectories. Algebraic manipulations yield the rate.
Constants depend on $E, C, n$
The variance term is β more local epochs inflate the variance (client drift); more participants reduce it. The full analysis is in Li et al. 2020 Thm. 1.
Theorem: FedAvg Convergence on non-i.i.d. Data
Under the same assumptions as Theorem 9.2.1, but with non-i.i.d. user data (user 's local objective differs from the global ), FedAvg satisfies: where measures the heterogeneity of user objectives (0 on i.i.d. data; large on highly non-i.i.d. data).
The convergence rate is still but the constant includes a term that grows with local epochs on non-i.i.d. data. Hence: on non-i.i.d. data, too-large hurts convergence.
The heterogeneity term measures how much the user-local optima differ from the global optimum. On i.i.d. data, β users all pull toward the same optimum, so local steps help. On non-i.i.d. data, users pull in different directions; each local epoch is partially wasted on overfitting to the user's local data.
Operationally: choose carefully based on how non-i.i.d. the data is. For extremely heterogeneous data ( large), is safest. For mild heterogeneity, β is usually acceptable.
Heterogeneity enters the drift bound
User 's local trajectory over epochs has drift β larger increases the drift and the effective variance.
The $\Gamma$ term
Algebraic analysis reveals that the term is the additional cost of non-i.i.d. local training: each local epoch moves the user's model toward its own local optimum, away from the global one. Averaging fewer times (smaller ) reduces the bias at the cost of more communication rounds.
Combine
Li et al. (2020) Thm. 2 establishes the bound rigorously. The key takeaway: non-i.i.d. data is not fatal, but it requires care in choosing .
FedAvg Convergence: Effect of Local Epochs
Plot (simulated) FedAvg convergence loss as a function of the number of rounds, for several values of (local epochs). On i.i.d. data, larger is strictly better (fewer rounds to reach target loss). On non-i.i.d. data, too-large causes divergence or plateau. The plot illustrates why is a critical hyperparameter.
Parameters
Non-IID strength (0 = IID, 2 = highly non-IID)
Number of global rounds
Example: Communication Speedup of FedAvg vs. Vanilla SGD
For a CIFAR-10 training run with users, , local epochs, and i.i.d. data split. Compare the communication cost with (vanilla distributed SGD) to reach the same training loss.
Per-round cost (both)
Selected users: . Per-round uplink: scalars.
Rounds to target (from McMahan et al.)
(vanilla): ~ rounds to 90% accuracy. : ~ rounds to 90% accuracy (empirical result from the FedAvg paper).
Communication ratio
Vanilla: scalars. FedAvg: scalars. Savings factor: β exactly , matching theory.
Local compute increase
FedAvg does more local compute per round but fewer rounds. Per user: same total compute ( local steps). Savings are in communication, not compute.
Common Mistake: Too-Large Local Epochs on Non-IID Data
Mistake:
Set on a heavily non-IID federated-learning task to maximize communication savings.
Correction:
Too-large on non-IID data causes client drift β each user's local model overfits to its local data, diverging from the global optimum. FedAvg convergence stalls or degrades. Typical production settings: , tuned by validation-accuracy monitoring. SCAFFOLD (Karimireddy et al.) and FedProx (Li et al.) explicitly mitigate client drift and allow larger .
FedAvg in Production
Production FedAvg deployments (Google Gboard, NVIDIA Flare, Apple on-device FL) typically use:
- β local epochs per round
- β participation per round
- 100β1000 rounds per training phase
- Per-device budget: 10 MBβ100 MB uplink per session
Hyperparameter tuning is critical and typically done via offline simulations on held-out data. Production pipelines include automatic drift detection and early-stopping to guard against non-IID-induced divergence.
- β’
Typical β, β
- β’
Per-session uplink: 10β100 MB
- β’
Drift-detection via validation accuracy
Key Takeaway
FedAvg saves communication by factor (local epochs) per round, with convergence on i.i.d. data. On non-i.i.d. data, client drift limits usable ; special algorithms (FedProx, SCAFFOLD) can partially mitigate this. FedAvg is the baseline against which every FL protocol β including the privacy-preserving variants of Chapters 10β12 β is measured.
Quick Check
Compared to vanilla distributed SGD, FedAvg with local epochs and the same convergence target:
Uses less total communication (fewer rounds) on i.i.d. data.
Uses the same communication as vanilla SGD.
Uses more communication.
Always converges faster than vanilla SGD, regardless of data distribution.
local epochs per round reduce the required number of global rounds by factor β provided convergence is not degraded by non-IID drift.