Ferkans — Interactive Telecom Tutor

Why FedAvg is the Canonical FL Algorithm

Vanilla distributed SGD sends each user's single-step gradient to the server every iteration. This has two problems for FL:

High communication cost. With $T$ iterations and uplink $d$ per iteration, each user sends $T d$ scalars — too much for mobile uplink.
Sparse participation. Only $C \cdot n$ of $n$ users can participate per round; requiring every user per iteration is infeasible.

FedAvg (McMahan et al. 2017) solves both: each selected user runs $E$ local SGD steps before uploading. The result is that $T$ "global rounds" correspond to $T \cdot E$ local SGD steps — communication is reduced by factor $E$ . Convergence theory guarantees this works for i.i.d. data and, with caveats, for non-i.i.d. data.

The point is that FedAvg is the baseline against which every later FL protocol (secure aggregation, differential privacy, AirComp) is measured. Section 9.2 formalizes the algorithm and states its convergence properties.

FedAvg (McMahan et al.)

Complexity:

O(E \cdot |\mathcal{D}_k| / B \cdot C_{\text{forward-backward}})

per-user-per-round

Server executes:

1. Initialize

\mathbf{w}_0

.

2. for round

t = 0, 1, \ldots, T - 1

do

3.

\quad

Select subset

\mathcal{S}_t \subseteq [n]

of

size

C n

uniformly at random.

4.

\quad

Broadcast

\mathbf{w}_t

to users in

\mathcal{S}_t

.

5.

\quad

for each

k \in \mathcal{S}_t

in parallel do

6.

\qquad \mathbf{w}_t^{(k)} \leftarrow \text{ClientUpdate}(k, \mathbf{w}_t)

7.

\quad

end for

8.

\quad \mathbf{w}_{t+1} \leftarrow \sum_{k \in \mathcal{S}_t} (n_k / n_{\text{tot}}) \, \mathbf{w}_t^{(k)}

9. end for

ClientUpdate $(k, \mathbf{w})$ :

1. For each local epoch

e = 1, \ldots, E

:

2.

\quad

for each mini-batch

\mathcal{B}

of

\mathcal{D}_k

do

3.

\qquad \mathbf{w} \leftarrow \mathbf{w} - \eta \, \nabla \ell(\mathbf{w}; \mathcal{B})

4.

\quad

end for

5. return

\mathbf{w}

The key parameter is $E$ — the number of local epochs. $E = 1$ with $C = 1$ reduces to vanilla distributed SGD. Larger $E$ saves communication but can introduce client drift (users' local models diverging from the global one), especially on non-IID data.

Definition:
FedAvg Algorithm (Formal)

The FedAvg algorithm for federated learning with $n$ users, participation rate $C$ , local epochs $E$ , and learning rate $\eta$ runs the loop:

At round $t$ : the server selects $C n$ users uniformly at random, broadcasts $\mathbf{w}_t$ , each selected user performs $E$ local SGD epochs starting from $\mathbf{w}_t$ to produce $\mathbf{w}_t^{(k)}$ , and the server computes $\mathbf{w}_{t+1} = \sum_{k \in \mathcal{S}_t} (n_k / n_{\text{tot}}) \mathbf{w}_t^{(k)}$ .

The aggregation is a weighted average by user data sizes, giving each user's contribution in proportion to their dataset size.

FedAvg

Federated Averaging algorithm introduced by McMahan et al. (2017). Each round, a subset of users runs $E$ local SGD epochs and the server averages the resulting models. Canonical FL algorithm; reduces communication by factor $E$ compared to vanilla distributed SGD.

Client Drift

The phenomenon that with large $E$ (many local epochs per round), each user's local model drifts away from the global model toward its local objective. Causes FedAvg to converge to a biased stationary point on non-IID data.

Theorem: FedAvg Convergence on i.i.d. Data

Assume each user's local objective $F_k$ is $L$ -smooth and $\mu$ -strongly convex, user data is i.i.d. from a common distribution, and the stochastic gradient variance is bounded: $\mathbb{E}\|\nabla F_k(\mathbf{w}; \xi) - \nabla F_k(\mathbf{w})\|^2 \leq \sigma^2$ .

For FedAvg with learning rate $\eta = 1/(L + \mu t)$ after $t$ rounds, the expected suboptimality satisfies: $\mathbb{E}[F(\mathbf{w}_t) - F(\mathbf{w}^*)] \;\leq\; \frac{\kappa^2}{T + t} \cdot \left(\sigma^2 \cdot E \cdot \frac{1}{C n} + G^2\right),$ where $\kappa = L/\mu$ is the condition number, $G$ bounds $\|\nabla F_k\|^2$ , and $T$ is the initial round count.

The rate is $O(1/T)$ (asymptotically), matching centralized SGD up to constants that depend on $E$ , $C$ , and $n$ .

FedAvg on i.i.d. data converges at the same asymptotic rate as centralized SGD; the local-epoch parameter $E$ enters the constants but not the scaling. Operationally, this means you can save communication by factor $E$ (fewer rounds) at modest convergence cost.

The point is that convergence theory supports FedAvg's communication-saving strategy — the gains are "free" in the IID case. The non-IID case is harder; see the next theorem.

Proof

Aggregate update vs. centralized step

The FedAvg update after one round with $E$ local epochs looks similar to a single centralized SGD step with effective gradient $\bar{\mathbf{g}} = (1/n)\sum_k \mathbf{g}_k^{\text{local}}$ — approximately, because local steps drift.

Bound the drift

Each user's local trajectory stays within a ball of radius proportional to $\eta \sqrt{E \sigma^2}$ around the global model. On i.i.d. data, the per-user local gradient has mean equal to the global gradient, so local-epoch drift averages out.

Apply centralized SGD analysis

With the drift bounded, the SGD inequality applies modulo the extra variance term from local trajectories. Algebraic manipulations yield the $O(1/T)$ rate.

Constants depend on $E, C, n$

The variance term is $\sigma^2 E / (C n)$ — more local epochs $E$ inflate the variance (client drift); more participants $C n$ reduce it. The full analysis is in Li et al. 2020 Thm. 1. $\blacksquare$

Theorem: FedAvg Convergence on non-i.i.d. Data

Under the same assumptions as Theorem 9.2.1, but with non-i.i.d. user data (user $k$ 's local objective $F_k$ differs from the global $F$ ), FedAvg satisfies: $\mathbb{E}[F(\mathbf{w}_t) - F(\mathbf{w}^*)] \;\leq\; \frac{\kappa^2}{T + t} \cdot \left(\sigma^2 \frac{E}{C n} + G^2 + E \cdot \Gamma\right),$ where $\Gamma = F(\mathbf{w}^*) - \sum_k (n_k/n_{\text{tot}}) F_k(\mathbf{w}^*_k) \geq 0$ measures the heterogeneity of user objectives (0 on i.i.d. data; large on highly non-i.i.d. data).

The convergence rate is still $O(1/T)$ but the constant includes a $E \cdot \Gamma$ term that grows with local epochs $E$ on non-i.i.d. data. Hence: on non-i.i.d. data, too-large $E$ hurts convergence.

The heterogeneity term $\Gamma$ measures how much the user-local optima differ from the global optimum. On i.i.d. data, $\Gamma = 0$ — users all pull toward the same optimum, so local steps help. On non-i.i.d. data, users pull in different directions; each local epoch is partially wasted on overfitting to the user's local data.

Operationally: choose $E$ carefully based on how non-i.i.d. the data is. For extremely heterogeneous data ( $\Gamma$ large), $E = 1$ is safest. For mild heterogeneity, $E = 5$ – $10$ is usually acceptable.

Proof

Heterogeneity enters the drift bound

User $k$ 's local trajectory over $E$ epochs has drift $O(\eta E \sqrt{\Gamma + \sigma^2})$ — larger $\Gamma$ increases the drift and the effective variance.

The $\Gamma$ term

Algebraic analysis reveals that the $E \cdot \Gamma$ term is the additional cost of non-i.i.d. local training: each local epoch moves the user's model toward its own local optimum, away from the global one. Averaging fewer times (smaller $E$ ) reduces the bias at the cost of more communication rounds.

Combine

Li et al. (2020) Thm. 2 establishes the bound rigorously. The key takeaway: non-i.i.d. data is not fatal, but it requires care in choosing $E$ . $\blacksquare$

FedAvg Convergence: Effect of Local Epochs $E$

Plot (simulated) FedAvg convergence loss as a function of the number of rounds, for several values of $E$ (local epochs). On i.i.d. data, larger $E$ is strictly better (fewer rounds to reach target loss). On non-i.i.d. data, too-large $E$ causes divergence or plateau. The plot illustrates why $E$ is a critical hyperparameter.

Parameters

\Gamma

— heterogeneity0.5

Non-IID strength (0 = IID, 2 = highly non-IID)

T

— rounds100

Number of global rounds

Example: Communication Speedup of FedAvg vs. Vanilla SGD

For a CIFAR-10 training run with $n = 100$ users, $C = 0.1$ , $E = 10$ local epochs, and i.i.d. data split. Compare the communication cost with $E = 1$ (vanilla distributed SGD) to reach the same training loss.

Solution

Per-round cost (both)

Selected users: $C n = 10$ . Per-round uplink: $10 d$ scalars.

Rounds to target (from McMahan et al.)

$E = 1$ (vanilla): ~ $500$ rounds to 90% accuracy. $E = 10$ : ~ $50$ rounds to 90% accuracy (empirical result from the FedAvg paper).

Communication ratio

Vanilla: $500 \cdot 10 d = 5000 d$ scalars. FedAvg: $50 \cdot 10 d = 500 d$ scalars. Savings factor: $10\times$ — exactly $E$ , matching theory.

Local compute increase

FedAvg does $E \times$ more local compute per round but $10\times$ fewer rounds. Per user: same total compute ( $E \cdot T$ local steps). Savings are in communication, not compute.

Common Mistake: Too-Large Local Epochs on Non-IID Data

Mistake:

Set $E = 100$ on a heavily non-IID federated-learning task to maximize communication savings.

Correction:

Too-large $E$ on non-IID data causes client drift — each user's local model overfits to its local data, diverging from the global optimum. FedAvg convergence stalls or degrades. Typical production settings: $E \in \{1, 5, 10\}$ , tuned by validation-accuracy monitoring. SCAFFOLD (Karimireddy et al.) and FedProx (Li et al.) explicitly mitigate client drift and allow larger $E$ .

⚠️Engineering Note

FedAvg in Production

Production FedAvg deployments (Google Gboard, NVIDIA Flare, Apple on-device FL) typically use:

$E = 1$ – $5$ local epochs per round
$C = 0.01$ – $0.10$ participation per round
100–1000 rounds per training phase
Per-device budget: 10 MB–100 MB uplink per session

Hyperparameter tuning is critical and typically done via offline simulations on held-out data. Production pipelines include automatic drift detection and early-stopping to guard against non-IID-induced divergence.

Practical Constraints

•
Typical $E = 1$ – $5$ , $C = 0.01$ – $0.10$
•
Per-session uplink: 10–100 MB
•
Drift-detection via validation accuracy

📋 Ref: Google Gboard FL papers; NVIDIA Flare documentation

Key Takeaway

FedAvg saves communication by factor $E$ (local epochs) per round, with convergence $O(1/T)$ on i.i.d. data. On non-i.i.d. data, client drift limits usable $E$ ; special algorithms (FedProx, SCAFFOLD) can partially mitigate this. FedAvg is the baseline against which every FL protocol — including the privacy-preserving variants of Chapters 10–12 — is measured.

Quick Check

Compared to vanilla distributed SGD, FedAvg with $E = 20$ local epochs and the same convergence target:

Uses $20\\times$ less total communication (fewer rounds) on i.i.d. data.

Uses the same communication as vanilla SGD.

Uses $20\\times$ more communication.

Always converges faster than vanilla SGD, regardless of data distribution.