The Federated Learning Paradigm

Why Federated Learning Exists

Consider the GPS app on a smartphone, the typing-suggestion feature of a mobile keyboard, the speech recognizer in a smart speaker, the recommendation engine of a streaming service. Each of these benefits from training on user data. But the data — typing patterns, voice samples, viewing histories, location traces — is highly private. Shipping it to a central server to train a model creates privacy risks, regulatory exposure (GDPR, CCPA), and user mistrust.

Federated Learning (FL), introduced by Google in 2017, reverses the data flow: the model goes to the data, not the other way around. Each user's device trains a local copy of the model on its private data, and only model updates (gradients or weights) are sent to the server. The server aggregates updates from many users to produce a new global model, which is redistributed.

The paradigm is elegant but raises every information-theoretic question in this book:

Privacy. Do gradient updates leak user data? (Yes — §9.4 and Chapter 10.)
Robustness. What if some users send corrupted updates? (Chapter 11's ByzSecAgg addresses this.)
Communication efficiency. How much traffic can we save without hurting convergence? (§9.3 and Chapter 12.)
Over-the-air transmission. Can we exploit the wireless channel? (Chapter 16's AirComp.)

Part III of this book takes each of these questions in turn. Chapter 9 sets the stage: the FL paradigm, the FedAvg algorithm, the communication costs, and the privacy concerns that motivate everything that follows.

Definition:
Federated Learning

A federated-learning system consists of:

$n$ users (clients), each holding a private local dataset $\mathcal{D}_k$ of size $|\mathcal{D}_k| = n_k$ , $k = 1, \ldots, n$ .
A central server (or parameter server) that holds the global model $\mathbf{w}_t \in \mathbb{R}^d$ at each round $t$ .
A global objective $F(\mathbf{w}) \;=\; \sum_{k=1}^n \frac{n_k}{n_{\text{tot}}} F_k(\mathbf{w}),$ where $F_k(\mathbf{w}) = (1/n_k) \sum_{\xi \in \mathcal{D}_k} \ell(\mathbf{w}; \xi)$ is user $k$ 's local objective and $n_{\text{tot}} = \sum_k n_k$ .

Constraint: users' local data $\mathcal{D}_k$ never leaves their devices. Communication is only of model parameters or gradient updates.

At each round $t$ , the protocol proceeds:

Broadcast. Server sends $\mathbf{w}_t$ to a subset of users (the ones participating this round).
Local training. Each participating user $k$ runs some number of local SGD steps on $\mathcal{D}_k$ to produce an updated model $\mathbf{w}_t^{(k)}$ .
Upload. User sends back either its full model $\mathbf{w}_t^{(k)}$ or the gradient update $\mathbf{g}_k = \mathbf{w}_t^{(k)} - \mathbf{w}_t$ .
Aggregate. Server computes a weighted average, producing $\mathbf{w}_{t+1}$ .

The choice of aggregation rule and local-training schedule determines the specific FL algorithm. FedAvg (§9.2) is the canonical choice.

The privacy story is built into the architecture: raw data never leaves the device. But — as §9.4 develops in detail — the gradient updates that do leave the device can still leak information. Parts III–V of this book progressively strengthen the privacy guarantees with explicit cryptographic / information-theoretic protocols.

Federated Learning

A distributed machine-learning paradigm in which user devices train locally on their private data and only model updates are shared with a central server. The server aggregates updates from many users into a global model. Introduced by Google in 2017 (McMahan et al.).

Client Participation Rate $C$

The fraction of the $n$ users selected to participate in each FL round, $C \in (0, 1]$ . Typical values in production systems: $C = 0.1$ (Google Gboard) or smaller — only a small fraction of devices train per round.

Federated Learning vs. Data-Center Distributed SGD

Property	Data-Center SGD (Ch. 1)	Federated Learning (this chapter)
Workers	Homogeneous (same GPUs, fast network)	Heterogeneous (mobile, IoT, slow networks)
Data distribution	Shuffled across workers (§7)	Stays on user devices (non-IID)
Participation	All workers participate every round	Subset ( $C \cdot n$ ) participates per round
Bandwidth	High (10-100 Gbps backhaul)	Low (mobile: Mbps; Wi-Fi: 10s of Mbps)
Privacy	Trust boundary = single data center	Trust boundary = each user device
Dropouts	Rare; re-execute on failure	Common (users go offline, switch networks)
Round frequency	High (seconds or minutes)	Low (hours or days per round)

The Non-IID Challenge

In data-center SGD (Chapter 1), the dataset is centrally shuffled (Chapter 7), so every worker's mini-batch is effectively drawn i.i.d. from the global distribution. In federated learning, this is not true: each user's data reflects their personal patterns — one user types mostly in English, another in Portuguese; one photographs mostly food, another mostly landscapes. The per-user distributions $\mathcal{D}_k$ are heterogeneous, often highly skewed.

The point is that FL's SGD analysis has to handle non-IID data. The consequence: standard FedAvg can diverge or converge very slowly on non-IID FL setups. Fixing this is a central research direction: FedProx (Li et al. 2018), SCAFFOLD (Karimireddy et al. 2020), and many other algorithms address non-IID convergence. Chapter 9's FedAvg analysis (§9.2) assumes i.i.d. data for simplicity; the non-IID case is briefly discussed and pointed to the literature.

FL is NOT Privacy-Preserving by Default

A common misconception: "Federated learning keeps data on devices, therefore it's private." This is wrong. The gradient updates shipped to the server are rich with information about the underlying training data. Gradient-inversion attacks (DLG, iDLG, GradInversion; Zhu-Liu-Han 2019 and follow-ons) have demonstrated that individual training samples can be reconstructed from gradient updates with varying accuracy.

FL's privacy guarantee is architectural (data stays local) but not information-theoretic (gradients leak data). Section 9.4 develops this point quantitatively. The formal privacy guarantees come from the cryptographic / information-theoretic protocols of Chapter 10 onward.

Example: A Single FedAvg Round

Describe a single FedAvg round for $n = 1000$ users, participation rate $C = 0.05$ , local epochs $E = 5$ . Identify the per-round communication traffic.

Solution

Participating users

Server selects $C \cdot n = 50$ users at random. Each receives $\mathbf{w}_t$ (model size $d$ scalars). Downlink: $50 \cdot d$ scalars.

Local training

Each of 50 users runs $E = 5$ local epochs of SGD on its local data $\mathcal{D}_k$ . Produces locally-updated model $\mathbf{w}_t^{(k)}$ .

Upload

Each user sends back either $\mathbf{w}_t^{(k)}$ or the delta $\mathbf{g}_k = \mathbf{w}_t^{(k)} - \mathbf{w}_t$ , which is still $d$ scalars. Uplink: $50 \cdot d$ scalars.

Aggregate

Server computes $\mathbf{w}_{t+1} = \sum_k (n_k/n_{\text{tot}}) \mathbf{w}_t^{(k)}$ , the weighted average. The weights $n_k/n_{\text{tot}}$ are known a priori.

Total round cost

Downlink: 50d scalars (broadcast via multicast if supported; else unicast). Uplink: 50d scalars. Total: $100 d$ scalars per round.

Over training

Assuming $T = 1000$ rounds: $10^5 d$ scalars per user selected (each user is selected ~50 rounds). At $d = 10^7$ , this is $10^{12}$ scalars per selected user — roughly 4 TB at fp32. A non-trivial wireless burden.

Per-Round FL Communication Cost

Plot the per-round communication cost of FedAvg as a function of the model size $d$ and client participation rate $C$ , for fixed number of users $n$ . The communication scales as $C \cdot n \cdot d$ — reducing any of the three factors saves cost. This motivates the compression techniques of §9.3.

Parameters

n

— total users1000

Total user population

C

— participation rate0.1

Fraction of users per round

d

max — model size100000000

Maximum model dimension on x-axis

⚠️Engineering Note

Federated Learning in Production

Major production FL deployments:

Google Gboard (keyboard prediction). The first large-scale FL deployment, circa 2017. Trains next-word prediction on user typing patterns without raw text leaving devices.
Apple Siri / autocorrect. On-device personalization with federated updates.
NVIDIA Flare. Production-grade FL framework for healthcare and research.
OpenFL / IBM Federated Learning. Research-to- production FL platforms.

Common deployment patterns:

Cross-device FL — millions of heterogeneous devices (mobile phones, IoT), participation rate $C < 0.01$ .
Cross-silo FL — small number (10s) of well-connected institutions (hospitals, banks). Higher $C$ , more compute per user.
Hybrid — cross-silo aggregation of cross-device FL rounds.

Each pattern has different system constraints, which motivate different coded / privacy / robustness techniques. Cross-device FL emphasizes communication efficiency; cross-silo FL emphasizes privacy guarantees against institutional adversaries.

Practical Constraints

•
Cross-device: $n \sim 10^6$ devices, $C < 0.01$ per round, Wi-Fi / 4G uplink
•
Cross-silo: $n \sim 10$ institutions, $C = 1$ , dedicated network
•
Round latency: minutes (cross-silo) to hours (cross-device)

📋 Ref: Bonawitz et al. 2019; NVIDIA Flare v2; Google FL whitepaper

Historical Note: From Distributed SGD to FedAvg

2017–present

The term "federated learning" was introduced by H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas in their 2017 AISTATS paper "Communication-Efficient Learning of Deep Networks from Decentralized Data." The motivation was practical: Google's Gboard team wanted to train next-word-prediction on user typing without pulling the typing to the server. The paper introduced the FedAvg algorithm and demonstrated that it converged on both i.i.d. and non-i.i.d. splits of MNIST and CIFAR-10.

The name has since stuck, and the field has exploded: over 10,000 papers on FL since 2017, major production deployments at Google, Apple, NVIDIA, IBM, and others, and a rich theoretical literature on non-IID convergence, privacy, and robustness. The CommIT group has made several key contributions: secure aggregation optimality (Chapter 10), ByzSecAgg (Chapter 11), CCESA (Chapter 12), and information-theoretically secure federated representation learning (Chapter 17).

Key Takeaway

Federated learning keeps data on user devices but exposes gradient updates. The three-axis golden thread of this book (privacy, robustness, communication efficiency) is most urgently needed in FL. Section 9.2 introduces FedAvg; §9.3 covers communication efficiency; §9.4 makes the privacy concern precise. Chapters 10–12 then develop formal information-theoretic protocols that address each axis in turn.

Common Mistake: FL is Not Just Distributed SGD

Mistake:

Apply Chapter 1's distributed-SGD analysis directly to federated learning.

Correction:

FL differs from data-center distributed SGD in several crucial ways: (i) data stays on user devices (privacy concern), (ii) data is non-IID across users (FedAvg can diverge on non-IID data with vanilla local SGD), (iii) participation is sparse (only $C \cdot n$ users per round), (iv) users can drop out mid-round, (v) uplink bandwidth is scarce. Each difference motivates a different algorithmic or information-theoretic tool. Do not conflate the two.

Quick Check

In federated learning, the central server learns:

The aggregate gradient $\mathbf{G} = \\sum_k \\mathbf{g}_k$ — nothing else about individual user data.

Each individual user's gradient update. This is the default in vanilla FL and leaks information about local data.

Nothing about user data — FL is private by design.

Only the aggregate of users with high enough data quality.