The Federated Learning Paradigm
Why Federated Learning Exists
Consider the GPS app on a smartphone, the typing-suggestion feature of a mobile keyboard, the speech recognizer in a smart speaker, the recommendation engine of a streaming service. Each of these benefits from training on user data. But the data — typing patterns, voice samples, viewing histories, location traces — is highly private. Shipping it to a central server to train a model creates privacy risks, regulatory exposure (GDPR, CCPA), and user mistrust.
Federated Learning (FL), introduced by Google in 2017, reverses the data flow: the model goes to the data, not the other way around. Each user's device trains a local copy of the model on its private data, and only model updates (gradients or weights) are sent to the server. The server aggregates updates from many users to produce a new global model, which is redistributed.
The paradigm is elegant but raises every information-theoretic question in this book:
- Privacy. Do gradient updates leak user data? (Yes — §9.4 and Chapter 10.)
- Robustness. What if some users send corrupted updates? (Chapter 11's ByzSecAgg addresses this.)
- Communication efficiency. How much traffic can we save without hurting convergence? (§9.3 and Chapter 12.)
- Over-the-air transmission. Can we exploit the wireless channel? (Chapter 16's AirComp.)
Part III of this book takes each of these questions in turn. Chapter 9 sets the stage: the FL paradigm, the FedAvg algorithm, the communication costs, and the privacy concerns that motivate everything that follows.
Definition: Federated Learning
Federated Learning
A federated-learning system consists of:
- users (clients), each holding a private local dataset of size , .
- A central server (or parameter server) that holds the global model at each round .
- A global objective where is user 's local objective and .
Constraint: users' local data never leaves their devices. Communication is only of model parameters or gradient updates.
At each round , the protocol proceeds:
- Broadcast. Server sends to a subset of users (the ones participating this round).
- Local training. Each participating user runs some number of local SGD steps on to produce an updated model .
- Upload. User sends back either its full model or the gradient update .
- Aggregate. Server computes a weighted average, producing .
The choice of aggregation rule and local-training schedule determines the specific FL algorithm. FedAvg (§9.2) is the canonical choice.
The privacy story is built into the architecture: raw data never leaves the device. But — as §9.4 develops in detail — the gradient updates that do leave the device can still leak information. Parts III–V of this book progressively strengthen the privacy guarantees with explicit cryptographic / information-theoretic protocols.
Federated Learning
A distributed machine-learning paradigm in which user devices train locally on their private data and only model updates are shared with a central server. The server aggregates updates from many users into a global model. Introduced by Google in 2017 (McMahan et al.).
Client Participation Rate
The fraction of the users selected to participate in each FL round, . Typical values in production systems: (Google Gboard) or smaller — only a small fraction of devices train per round.
Federated Learning vs. Data-Center Distributed SGD
| Property | Data-Center SGD (Ch. 1) | Federated Learning (this chapter) |
|---|---|---|
| Workers | Homogeneous (same GPUs, fast network) | Heterogeneous (mobile, IoT, slow networks) |
| Data distribution | Shuffled across workers (§7) | Stays on user devices (non-IID) |
| Participation | All workers participate every round | Subset () participates per round |
| Bandwidth | High (10-100 Gbps backhaul) | Low (mobile: Mbps; Wi-Fi: 10s of Mbps) |
| Privacy | Trust boundary = single data center | Trust boundary = each user device |
| Dropouts | Rare; re-execute on failure | Common (users go offline, switch networks) |
| Round frequency | High (seconds or minutes) | Low (hours or days per round) |
The Non-IID Challenge
In data-center SGD (Chapter 1), the dataset is centrally shuffled (Chapter 7), so every worker's mini-batch is effectively drawn i.i.d. from the global distribution. In federated learning, this is not true: each user's data reflects their personal patterns — one user types mostly in English, another in Portuguese; one photographs mostly food, another mostly landscapes. The per-user distributions are heterogeneous, often highly skewed.
The point is that FL's SGD analysis has to handle non-IID data. The consequence: standard FedAvg can diverge or converge very slowly on non-IID FL setups. Fixing this is a central research direction: FedProx (Li et al. 2018), SCAFFOLD (Karimireddy et al. 2020), and many other algorithms address non-IID convergence. Chapter 9's FedAvg analysis (§9.2) assumes i.i.d. data for simplicity; the non-IID case is briefly discussed and pointed to the literature.
FL is NOT Privacy-Preserving by Default
A common misconception: "Federated learning keeps data on devices, therefore it's private." This is wrong. The gradient updates shipped to the server are rich with information about the underlying training data. Gradient-inversion attacks (DLG, iDLG, GradInversion; Zhu-Liu-Han 2019 and follow-ons) have demonstrated that individual training samples can be reconstructed from gradient updates with varying accuracy.
FL's privacy guarantee is architectural (data stays local) but not information-theoretic (gradients leak data). Section 9.4 develops this point quantitatively. The formal privacy guarantees come from the cryptographic / information-theoretic protocols of Chapter 10 onward.
Example: A Single FedAvg Round
Describe a single FedAvg round for users, participation rate , local epochs . Identify the per-round communication traffic.
Participating users
Server selects users at random. Each receives (model size scalars). Downlink: scalars.
Local training
Each of 50 users runs local epochs of SGD on its local data . Produces locally-updated model .
Upload
Each user sends back either or the delta , which is still scalars. Uplink: scalars.
Aggregate
Server computes , the weighted average. The weights are known a priori.
Total round cost
Downlink: 50d scalars (broadcast via multicast if supported; else unicast). Uplink: 50d scalars. Total: scalars per round.
Over training
Assuming rounds: scalars per user selected (each user is selected ~50 rounds). At , this is scalars per selected user — roughly 4 TB at fp32. A non-trivial wireless burden.
Per-Round FL Communication Cost
Plot the per-round communication cost of FedAvg as a function of the model size and client participation rate , for fixed number of users . The communication scales as — reducing any of the three factors saves cost. This motivates the compression techniques of §9.3.
Parameters
Total user population
Fraction of users per round
Maximum model dimension on x-axis
Federated Learning in Production
Major production FL deployments:
- Google Gboard (keyboard prediction). The first large-scale FL deployment, circa 2017. Trains next-word prediction on user typing patterns without raw text leaving devices.
- Apple Siri / autocorrect. On-device personalization with federated updates.
- NVIDIA Flare. Production-grade FL framework for healthcare and research.
- OpenFL / IBM Federated Learning. Research-to- production FL platforms.
Common deployment patterns:
- Cross-device FL — millions of heterogeneous devices (mobile phones, IoT), participation rate .
- Cross-silo FL — small number (10s) of well-connected institutions (hospitals, banks). Higher , more compute per user.
- Hybrid — cross-silo aggregation of cross-device FL rounds.
Each pattern has different system constraints, which motivate different coded / privacy / robustness techniques. Cross-device FL emphasizes communication efficiency; cross-silo FL emphasizes privacy guarantees against institutional adversaries.
- •
Cross-device: devices, per round, Wi-Fi / 4G uplink
- •
Cross-silo: institutions, , dedicated network
- •
Round latency: minutes (cross-silo) to hours (cross-device)
Historical Note: From Distributed SGD to FedAvg
2017–presentThe term "federated learning" was introduced by H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas in their 2017 AISTATS paper "Communication-Efficient Learning of Deep Networks from Decentralized Data." The motivation was practical: Google's Gboard team wanted to train next-word-prediction on user typing without pulling the typing to the server. The paper introduced the FedAvg algorithm and demonstrated that it converged on both i.i.d. and non-i.i.d. splits of MNIST and CIFAR-10.
The name has since stuck, and the field has exploded: over 10,000 papers on FL since 2017, major production deployments at Google, Apple, NVIDIA, IBM, and others, and a rich theoretical literature on non-IID convergence, privacy, and robustness. The CommIT group has made several key contributions: secure aggregation optimality (Chapter 10), ByzSecAgg (Chapter 11), CCESA (Chapter 12), and information-theoretically secure federated representation learning (Chapter 17).
Key Takeaway
Federated learning keeps data on user devices but exposes gradient updates. The three-axis golden thread of this book (privacy, robustness, communication efficiency) is most urgently needed in FL. Section 9.2 introduces FedAvg; §9.3 covers communication efficiency; §9.4 makes the privacy concern precise. Chapters 10–12 then develop formal information-theoretic protocols that address each axis in turn.
Common Mistake: FL is Not Just Distributed SGD
Mistake:
Apply Chapter 1's distributed-SGD analysis directly to federated learning.
Correction:
FL differs from data-center distributed SGD in several crucial ways: (i) data stays on user devices (privacy concern), (ii) data is non-IID across users (FedAvg can diverge on non-IID data with vanilla local SGD), (iii) participation is sparse (only users per round), (iv) users can drop out mid-round, (v) uplink bandwidth is scarce. Each difference motivates a different algorithmic or information-theoretic tool. Do not conflate the two.
Quick Check
In federated learning, the central server learns:
The aggregate gradient — nothing else about individual user data.
Each individual user's gradient update. This is the default in vanilla FL and leaks information about local data.
Nothing about user data — FL is private by design.
Only the aggregate of users with high enough data quality.
Plaintext FL shares individual gradients with the server. The server's "knowledge" of each user's data is bounded by what can be inferred from its gradient — which, via gradient inversion, can be substantial.