Federated and Distributed Learning
Learning Without Centralising Data
The supervised and reinforcement learning methods of the previous sections assume that training data is available at a central server (e.g., a cloud data centre). In wireless networks, this assumption is often violated:
- Data is generated at the edge: Channel measurements, CSI reports, and traffic patterns are observed locally at each base station or user device.
- Privacy constraints: User-level data (location, traffic, channel fingerprints) may be subject to regulations (GDPR) or corporate policy that prohibit centralisation.
- Communication cost: Uploading raw high-dimensional data (e.g., channel matrices from 64 antennas 1200 subcarriers 100 time slots) from hundreds of base stations to a central server is prohibitively expensive.
Federated learning (FL) addresses these challenges by keeping data local and only sharing model updates (gradient vectors or model parameters). This section develops the FedAvg algorithm, analyses its convergence, and discusses its application to wireless communications.
Definition: Federated Learning (FedAvg)
Federated Learning (FedAvg)
Consider clients (e.g., base stations), each with a local dataset of size . The global objective is to minimise the weighted empirical risk:
where is the total number of samples and is the loss function.
Federated Averaging (FedAvg) (McMahan et al., 2017) proceeds in communication rounds :
- Broadcast: The server sends the current global model to all (or a random subset of) clients.
- Local training: Each client initialises and performs epochs of SGD on its local data:
- Upload: Each client sends (or the update ) to the server.
- Aggregation: The server averages:
After rounds, the global model is deployed at all clients.
The communication cost per round is per client (model size), which is much less than uploading raw data for most practical models. For a model with parameters and clients with data samples of dimension , the communication reduction is roughly .
Theorem: Convergence of FedAvg under Non-IID Data
Federated Learning: One Round of FedAvg
Federated Learning vs Centralised Training
Compare FedAvg with centralised training on a simple regression task. Each client has a non-IID variant of the ground-truth model (rotated input distribution), simulating the heterogeneity that arises when different base stations see different channel statistics. The -axis shows communication rounds and the -axis shows the global MSE loss (log scale). Centralised training (which has access to all data) converges faster and to a lower floor. FedAvg converges to a neighbourhood of the optimum, with the gap increasing under higher heterogeneity. More clients generally help (more data) but also increase heterogeneity --- observe the trade-off by adjusting the number of clients.
Parameters
Federated Learning Applications in Wireless Networks
Federated learning is particularly well-suited to wireless networks, both as a tool for wireless and as a system that operates over wireless:
FL for wireless (the application):
- Channel prediction: Each BS trains a local channel prediction model on its own CSI data. FedAvg aggregates the models to learn a general predictor that captures common channel statistics while the local data captures site-specific features.
- Beam management: UEs collaboratively learn a model that predicts optimal beams from position or uplink measurements, without sharing raw position data.
- Anomaly detection: Distributed intrusion detection across edge nodes, where sharing raw traffic data is prohibited by privacy regulations.
FL over wireless (the system):
The FL communication rounds share the wireless medium with data traffic. Key co-design challenges include:
- Over-the-air aggregation: Instead of orthogonal uploads, clients transmit simultaneously and the superposition of signals implements the averaging: . This exploits the MAC channel for free aggregation but requires power control for alignment.
- Client selection: Under limited uplink bandwidth, only a subset of clients can participate per round. Selecting clients with informative (high-gradient-norm) updates accelerates convergence.
- Compression: Gradient quantisation (1-bit SGD, SignSGD) and sparsification reduce the per-round communication cost at the expense of convergence speed.
Privacy Guarantees and Limitations
Federated learning is often described as "privacy-preserving" because raw data stays on-device. However, model updates can leak information about the local data:
- Gradient inversion attacks: An adversary who observes the gradient can approximately reconstruct training samples, especially for small batches and simple models.
- Membership inference: Determining whether a specific sample was in client 's dataset is possible from the model updates.
Differential privacy (DP) provides a formal guarantee: each client adds calibrated Gaussian noise to its update:
The noise variance is calibrated to ensure -differential privacy, where controls the privacy-utility trade-off: smaller means stronger privacy but slower convergence.
Secure aggregation provides another layer: cryptographic protocols ensure that the server only sees the aggregate and cannot inspect individual updates. This prevents gradient inversion attacks at the server.
Example: Communication Cost of Federated Learning
A cellular network has base stations. Each BS collects CSI samples per day, each of dimension (128 subcarriers real/imaginary). A channel prediction model has parameters (32-bit floats). FedAvg runs rounds.
(a) Compute the total communication cost (in MB) for centralised training (uploading all raw data) vs federated learning (uploading model updates).
(b) If each communication round uses 5 ms of uplink airtime at 100 Mbps per BS, compute the total training time.
Centralised cost
Each sample: floats 4 bytes bytes.
Per BS: MB.
Total: MB.
This is a one-time upload cost, but the raw data may contain sensitive location and channel information.
Federated cost
Per round, each client uploads: KB.
Per round, total: MB.
Over rounds: MB.
The federated cost is comparable to the centralised cost in this example ( vs MB), but the federated approach transmits model parameters (no privacy-sensitive raw data) and the communication is spread over 100 rounds rather than concentrated in a single upload.
Training time
Upload per BS per round: .
At 100 Mbps: s ms.
But the stated airtime budget is 5 ms, so with compression (e.g., top- sparsification keeping 50% of entries), the upload fits within the budget.
Total wall-clock time: . If local training takes 1 s per round (on BS hardware), the total is s minutes.
ByzSecAgg: Byzantine-Resilient Secure Aggregation
Jahani-Nezhad, Maddah-Ali, and Caire developed ByzSecAgg, a federated learning aggregation protocol that is simultaneously:
- Secure: The server learns only the aggregate model update, not individual client updates (protecting against gradient inversion attacks).
- Byzantine-resilient: The protocol tolerates up to malicious clients that may send arbitrary (adversarial) updates designed to corrupt the global model.
- Communication-efficient: The overhead scales logarithmically with the number of clients.
The key insight is combining secret sharing (for privacy) with coded computing (for Byzantine resilience) in a single framework. Prior work addressed these properties separately; ByzSecAgg achieves both simultaneously. This is critical for wireless FL deployments where edge devices may be compromised and where the wireless channel itself can be exploited for adversarial attacks.
Federated Learning (FL)
A distributed machine learning paradigm where clients train a shared model by exchanging model updates (not raw data) with a central server. FedAvg is the standard algorithm. FL preserves data privacy and reduces communication cost compared to centralised training.
Related: Supervised Learning
Quick Check
In FedAvg, what happens when the data across clients becomes increasingly non-IID (e.g., each BS sees a completely different channel model)?
FedAvg converges faster because each client specialises
FedAvg still converges to the exact global optimum but requires more rounds
FedAvg converges to a neighbourhood of the optimum, and the gap grows with data heterogeneity; the bias does not vanish with more rounds
FedAvg diverges immediately under non-IID data
Under non-IID data, the heterogeneity term in the convergence bound is nonzero, creating a persistent bias. Local SGD steps drive each client toward its own local optimum, and averaging cannot fully correct this drift. Techniques like FedProx (adding a proximal term) or SCAFFOLD (variance reduction) can reduce this bias.