FedAvg and Its Convergence

Why FedAvg is the Canonical FL Algorithm

Vanilla distributed SGD sends each user's single-step gradient to the server every iteration. This has two problems for FL:

  1. High communication cost. With TT iterations and uplink dd per iteration, each user sends TdT d scalars β€” too much for mobile uplink.
  2. Sparse participation. Only Cβ‹…nC \cdot n of nn users can participate per round; requiring every user per iteration is infeasible.

FedAvg (McMahan et al. 2017) solves both: each selected user runs EE local SGD steps before uploading. The result is that TT "global rounds" correspond to Tβ‹…ET \cdot E local SGD steps β€” communication is reduced by factor EE. Convergence theory guarantees this works for i.i.d. data and, with caveats, for non-i.i.d. data.

The point is that FedAvg is the baseline against which every later FL protocol (secure aggregation, differential privacy, AirComp) is measured. Section 9.2 formalizes the algorithm and states its convergence properties.

FedAvg (McMahan et al.)

Complexity: O(Eβ‹…βˆ£Dk∣/Bβ‹…Cforward-backward)O(E \cdot |\mathcal{D}_k| / B \cdot C_{\text{forward-backward}}) per-user-per-round
Server executes:
1. Initialize w0\mathbf{w}_0.
2. for round t=0,1,…,Tβˆ’1t = 0, 1, \ldots, T - 1 do
3. \quad Select subset StβŠ†[n]\mathcal{S}_t \subseteq [n] of
size CnC n uniformly at random.
4. \quad Broadcast wt\mathbf{w}_t to users in St\mathcal{S}_t.
5. \quad for each k∈Stk \in \mathcal{S}_t in parallel do
6. wt(k)←ClientUpdate(k,wt)\qquad \mathbf{w}_t^{(k)} \leftarrow \text{ClientUpdate}(k, \mathbf{w}_t)
7. \quad end for
8. wt+1β†βˆ‘k∈St(nk/ntot) wt(k)\quad \mathbf{w}_{t+1} \leftarrow \sum_{k \in \mathcal{S}_t} (n_k / n_{\text{tot}}) \, \mathbf{w}_t^{(k)}
9. end for
ClientUpdate(k,w)(k, \mathbf{w}):
1. For each local epoch e=1,…,Ee = 1, \ldots, E:
2. \quad for each mini-batch B\mathcal{B} of Dk\mathcal{D}_k do
3. w←wβˆ’Ξ·β€‰βˆ‡β„“(w;B)\qquad \mathbf{w} \leftarrow \mathbf{w} - \eta \, \nabla \ell(\mathbf{w}; \mathcal{B})
4. \quad end for
5. return w\mathbf{w}

The key parameter is EE β€” the number of local epochs. E=1E = 1 with C=1C = 1 reduces to vanilla distributed SGD. Larger EE saves communication but can introduce client drift (users' local models diverging from the global one), especially on non-IID data.

Definition:

FedAvg Algorithm (Formal)

The FedAvg algorithm for federated learning with nn users, participation rate CC, local epochs EE, and learning rate Ξ·\eta runs the loop:

At round tt: the server selects CnC n users uniformly at random, broadcasts wt\mathbf{w}_t, each selected user performs EE local SGD epochs starting from wt\mathbf{w}_t to produce wt(k)\mathbf{w}_t^{(k)}, and the server computes wt+1=βˆ‘k∈St(nk/ntot)wt(k)\mathbf{w}_{t+1} = \sum_{k \in \mathcal{S}_t} (n_k / n_{\text{tot}}) \mathbf{w}_t^{(k)}.

The aggregation is a weighted average by user data sizes, giving each user's contribution in proportion to their dataset size.

FedAvg

Federated Averaging algorithm introduced by McMahan et al. (2017). Each round, a subset of users runs EE local SGD epochs and the server averages the resulting models. Canonical FL algorithm; reduces communication by factor EE compared to vanilla distributed SGD.

Client Drift

The phenomenon that with large EE (many local epochs per round), each user's local model drifts away from the global model toward its local objective. Causes FedAvg to converge to a biased stationary point on non-IID data.

Theorem: FedAvg Convergence on i.i.d. Data

Assume each user's local objective FkF_k is LL-smooth and ΞΌ\mu-strongly convex, user data is i.i.d. from a common distribution, and the stochastic gradient variance is bounded: Eβˆ₯βˆ‡Fk(w;ΞΎ)βˆ’βˆ‡Fk(w)βˆ₯2≀σ2\mathbb{E}\|\nabla F_k(\mathbf{w}; \xi) - \nabla F_k(\mathbf{w})\|^2 \leq \sigma^2.

For FedAvg with learning rate Ξ·=1/(L+ΞΌt)\eta = 1/(L + \mu t) after tt rounds, the expected suboptimality satisfies: E[F(wt)βˆ’F(wβˆ—)]β€…β€Šβ‰€β€…β€ŠΞΊ2T+tβ‹…(Οƒ2β‹…Eβ‹…1Cn+G2),\mathbb{E}[F(\mathbf{w}_t) - F(\mathbf{w}^*)] \;\leq\; \frac{\kappa^2}{T + t} \cdot \left(\sigma^2 \cdot E \cdot \frac{1}{C n} + G^2\right), where ΞΊ=L/ΞΌ\kappa = L/\mu is the condition number, GG bounds βˆ₯βˆ‡Fkβˆ₯2\|\nabla F_k\|^2, and TT is the initial round count.

The rate is O(1/T)O(1/T) (asymptotically), matching centralized SGD up to constants that depend on EE, CC, and nn.

FedAvg on i.i.d. data converges at the same asymptotic rate as centralized SGD; the local-epoch parameter EE enters the constants but not the scaling. Operationally, this means you can save communication by factor EE (fewer rounds) at modest convergence cost.

The point is that convergence theory supports FedAvg's communication-saving strategy β€” the gains are "free" in the IID case. The non-IID case is harder; see the next theorem.

Theorem: FedAvg Convergence on non-i.i.d. Data

Under the same assumptions as Theorem 9.2.1, but with non-i.i.d. user data (user kk's local objective FkF_k differs from the global FF), FedAvg satisfies: E[F(wt)βˆ’F(wβˆ—)]β€…β€Šβ‰€β€…β€ŠΞΊ2T+tβ‹…(Οƒ2ECn+G2+Eβ‹…Ξ“),\mathbb{E}[F(\mathbf{w}_t) - F(\mathbf{w}^*)] \;\leq\; \frac{\kappa^2}{T + t} \cdot \left(\sigma^2 \frac{E}{C n} + G^2 + E \cdot \Gamma\right), where Ξ“=F(wβˆ—)βˆ’βˆ‘k(nk/ntot)Fk(wkβˆ—)β‰₯0\Gamma = F(\mathbf{w}^*) - \sum_k (n_k/n_{\text{tot}}) F_k(\mathbf{w}^*_k) \geq 0 measures the heterogeneity of user objectives (0 on i.i.d. data; large on highly non-i.i.d. data).

The convergence rate is still O(1/T)O(1/T) but the constant includes a Eβ‹…Ξ“E \cdot \Gamma term that grows with local epochs EE on non-i.i.d. data. Hence: on non-i.i.d. data, too-large EE hurts convergence.

The heterogeneity term Ξ“\Gamma measures how much the user-local optima differ from the global optimum. On i.i.d. data, Ξ“=0\Gamma = 0 β€” users all pull toward the same optimum, so local steps help. On non-i.i.d. data, users pull in different directions; each local epoch is partially wasted on overfitting to the user's local data.

Operationally: choose EE carefully based on how non-i.i.d. the data is. For extremely heterogeneous data (Ξ“\Gamma large), E=1E = 1 is safest. For mild heterogeneity, E=5E = 5–1010 is usually acceptable.

FedAvg Convergence: Effect of Local Epochs EE

Plot (simulated) FedAvg convergence loss as a function of the number of rounds, for several values of EE (local epochs). On i.i.d. data, larger EE is strictly better (fewer rounds to reach target loss). On non-i.i.d. data, too-large EE causes divergence or plateau. The plot illustrates why EE is a critical hyperparameter.

Parameters
0.5

Non-IID strength (0 = IID, 2 = highly non-IID)

100

Number of global rounds

Example: Communication Speedup of FedAvg vs. Vanilla SGD

For a CIFAR-10 training run with n=100n = 100 users, C=0.1C = 0.1, E=10E = 10 local epochs, and i.i.d. data split. Compare the communication cost with E=1E = 1 (vanilla distributed SGD) to reach the same training loss.

Common Mistake: Too-Large Local Epochs on Non-IID Data

Mistake:

Set E=100E = 100 on a heavily non-IID federated-learning task to maximize communication savings.

Correction:

Too-large EE on non-IID data causes client drift β€” each user's local model overfits to its local data, diverging from the global optimum. FedAvg convergence stalls or degrades. Typical production settings: E∈{1,5,10}E \in \{1, 5, 10\}, tuned by validation-accuracy monitoring. SCAFFOLD (Karimireddy et al.) and FedProx (Li et al.) explicitly mitigate client drift and allow larger EE.

⚠️Engineering Note

FedAvg in Production

Production FedAvg deployments (Google Gboard, NVIDIA Flare, Apple on-device FL) typically use:

  • E=1E = 1–55 local epochs per round
  • C=0.01C = 0.01–0.100.10 participation per round
  • 100–1000 rounds per training phase
  • Per-device budget: 10 MB–100 MB uplink per session

Hyperparameter tuning is critical and typically done via offline simulations on held-out data. Production pipelines include automatic drift detection and early-stopping to guard against non-IID-induced divergence.

Practical Constraints
  • β€’

    Typical E=1E = 1–55, C=0.01C = 0.01–0.100.10

  • β€’

    Per-session uplink: 10–100 MB

  • β€’

    Drift-detection via validation accuracy

πŸ“‹ Ref: Google Gboard FL papers; NVIDIA Flare documentation

Key Takeaway

FedAvg saves communication by factor EE (local epochs) per round, with convergence O(1/T)O(1/T) on i.i.d. data. On non-i.i.d. data, client drift limits usable EE; special algorithms (FedProx, SCAFFOLD) can partially mitigate this. FedAvg is the baseline against which every FL protocol β€” including the privacy-preserving variants of Chapters 10–12 β€” is measured.

Quick Check

Compared to vanilla distributed SGD, FedAvg with E=20E = 20 local epochs and the same convergence target:

Uses 20times20\\times less total communication (fewer rounds) on i.i.d. data.

Uses the same communication as vanilla SGD.

Uses 20times20\\times more communication.

Always converges faster than vanilla SGD, regardless of data distribution.