Federated and Distributed Learning

Learning Without Centralising Data

The supervised and reinforcement learning methods of the previous sections assume that training data is available at a central server (e.g., a cloud data centre). In wireless networks, this assumption is often violated:

  • Data is generated at the edge: Channel measurements, CSI reports, and traffic patterns are observed locally at each base station or user device.
  • Privacy constraints: User-level data (location, traffic, channel fingerprints) may be subject to regulations (GDPR) or corporate policy that prohibit centralisation.
  • Communication cost: Uploading raw high-dimensional data (e.g., channel matrices from 64 antennas Γ—\times 1200 subcarriers Γ—\times 100 time slots) from hundreds of base stations to a central server is prohibitively expensive.

Federated learning (FL) addresses these challenges by keeping data local and only sharing model updates (gradient vectors or model parameters). This section develops the FedAvg algorithm, analyses its convergence, and discusses its application to wireless communications.

Definition:

Federated Learning (FedAvg)

Consider CC clients (e.g., base stations), each with a local dataset Dc\mathcal{D}_c of size ncn_c. The global objective is to minimise the weighted empirical risk:

min⁑wβ€…β€ŠF(w)=βˆ‘c=1CncnFc(w),Fc(w)=1ncβˆ‘(x,y)∈Dcβ„“(fw(x),y)\min_{\mathbf{w}} \; F(\mathbf{w}) = \sum_{c=1}^{C} \frac{n_c}{n} F_c(\mathbf{w}), \qquad F_c(\mathbf{w}) = \frac{1}{n_c} \sum_{(\mathbf{x},\mathbf{y}) \in \mathcal{D}_c} \ell(f_{\mathbf{w}}(\mathbf{x}), \mathbf{y})

where n=βˆ‘cncn = \sum_c n_c is the total number of samples and β„“\ell is the loss function.

Federated Averaging (FedAvg) (McMahan et al., 2017) proceeds in communication rounds r=1,2,…,Rr = 1, 2, \ldots, R:

  1. Broadcast: The server sends the current global model wglobal(r)\mathbf{w}_{\text{global}}^{(r)} to all (or a random subset of) clients.
  2. Local training: Each client cc initialises wc←wglobal(r)\mathbf{w}_c \leftarrow \mathbf{w}_{\text{global}}^{(r)} and performs EE epochs of SGD on its local data: wc←wcβˆ’Ξ·β€‰βˆ‡Fc(wc)\mathbf{w}_c \leftarrow \mathbf{w}_c - \eta \, \nabla F_c(\mathbf{w}_c)
  3. Upload: Each client sends wc\mathbf{w}_c (or the update Ξ”wc=wcβˆ’wglobal(r)\Delta\mathbf{w}_c = \mathbf{w}_c - \mathbf{w}_{\text{global}}^{(r)}) to the server.
  4. Aggregation: The server averages: wglobal(r+1)=βˆ‘c=1Cncn wc\mathbf{w}_{\text{global}}^{(r+1)} = \sum_{c=1}^{C} \frac{n_c}{n}\, \mathbf{w}_c

After RR rounds, the global model wglobal(R)\mathbf{w}_{\text{global}}^{(R)} is deployed at all clients.

The communication cost per round is O(∣w∣)O(|\mathbf{w}|) per client (model size), which is much less than uploading raw data O(ncβ‹…dinput)O(n_c \cdot d_{\text{input}}) for most practical models. For a model with 10410^4 parameters and clients with 10510^5 data samples of dimension 128128, the communication reduction is roughly 1000Γ—1000\times.

Theorem: Convergence of FedAvg under Non-IID Data

Federated Learning: One Round of FedAvg

Animated walkthrough of one FedAvg communication round: the server broadcasts the global model, clients train locally on their private data, upload model updates, and the server aggregates them into a new global model.
FedAvg keeps data local and only shares model parameters, reducing communication cost and preserving privacy.

Federated Learning vs Centralised Training

Compare FedAvg with centralised training on a simple regression task. Each client has a non-IID variant of the ground-truth model (rotated input distribution), simulating the heterogeneity that arises when different base stations see different channel statistics. The xx-axis shows communication rounds and the yy-axis shows the global MSE loss (log scale). Centralised training (which has access to all data) converges faster and to a lower floor. FedAvg converges to a neighbourhood of the optimum, with the gap increasing under higher heterogeneity. More clients generally help (more data) but also increase heterogeneity --- observe the trade-off by adjusting the number of clients.

Parameters
5
50

Federated Learning Applications in Wireless Networks

Federated learning is particularly well-suited to wireless networks, both as a tool for wireless and as a system that operates over wireless:

FL for wireless (the application):

  • Channel prediction: Each BS trains a local channel prediction model on its own CSI data. FedAvg aggregates the models to learn a general predictor that captures common channel statistics while the local data captures site-specific features.
  • Beam management: UEs collaboratively learn a model that predicts optimal beams from position or uplink measurements, without sharing raw position data.
  • Anomaly detection: Distributed intrusion detection across edge nodes, where sharing raw traffic data is prohibited by privacy regulations.

FL over wireless (the system):

The FL communication rounds share the wireless medium with data traffic. Key co-design challenges include:

  • Over-the-air aggregation: Instead of orthogonal uploads, clients transmit simultaneously and the superposition of signals implements the averaging: wglobalβˆβˆ‘chc wc\mathbf{w}_{\text{global}} \propto \sum_c h_c \, \mathbf{w}_c. This exploits the MAC channel for free aggregation but requires power control for alignment.
  • Client selection: Under limited uplink bandwidth, only a subset of clients can participate per round. Selecting clients with informative (high-gradient-norm) updates accelerates convergence.
  • Compression: Gradient quantisation (1-bit SGD, SignSGD) and sparsification reduce the per-round communication cost at the expense of convergence speed.

Privacy Guarantees and Limitations

Federated learning is often described as "privacy-preserving" because raw data stays on-device. However, model updates can leak information about the local data:

  • Gradient inversion attacks: An adversary who observes the gradient βˆ‡Fc(w)\nabla F_c(\mathbf{w}) can approximately reconstruct training samples, especially for small batches and simple models.
  • Membership inference: Determining whether a specific sample was in client cc's dataset is possible from the model updates.

Differential privacy (DP) provides a formal guarantee: each client adds calibrated Gaussian noise to its update:

w~c=wc+N(0,ΟƒDP2I)\tilde{\mathbf{w}}_c = \mathbf{w}_c + \mathcal{N}(\mathbf{0}, \sigma_{\mathrm{DP}}^2 \mathbf{I})

The noise variance ΟƒDP2\sigma_{\mathrm{DP}}^2 is calibrated to ensure (Ο΅,Ξ΄)(\epsilon, \delta)-differential privacy, where Ο΅\epsilon controls the privacy-utility trade-off: smaller Ο΅\epsilon means stronger privacy but slower convergence.

Secure aggregation provides another layer: cryptographic protocols ensure that the server only sees the aggregate βˆ‘cwc\sum_c \mathbf{w}_c and cannot inspect individual updates. This prevents gradient inversion attacks at the server.

Example: Communication Cost of Federated Learning

A cellular network has C=50C = 50 base stations. Each BS collects nc=10 000n_c = 10\,000 CSI samples per day, each of dimension d=256d = 256 (128 subcarriers Γ—\times real/imaginary). A channel prediction model has ∣w∣=20 000|\mathbf{w}| = 20\,000 parameters (32-bit floats). FedAvg runs R=100R = 100 rounds.

(a) Compute the total communication cost (in MB) for centralised training (uploading all raw data) vs federated learning (uploading model updates).

(b) If each communication round uses 5 ms of uplink airtime at 100 Mbps per BS, compute the total training time.

πŸŽ“CommIT Contribution(2022)

ByzSecAgg: Byzantine-Resilient Secure Aggregation

T. Jahani-Nezhad, M. A. Maddah-Ali, G. Caire β€” IEEE Journal on Selected Areas in Communications

Jahani-Nezhad, Maddah-Ali, and Caire developed ByzSecAgg, a federated learning aggregation protocol that is simultaneously:

  1. Secure: The server learns only the aggregate model update, not individual client updates (protecting against gradient inversion attacks).
  2. Byzantine-resilient: The protocol tolerates up to ff malicious clients that may send arbitrary (adversarial) updates designed to corrupt the global model.
  3. Communication-efficient: The overhead scales logarithmically with the number of clients.

The key insight is combining secret sharing (for privacy) with coded computing (for Byzantine resilience) in a single framework. Prior work addressed these properties separately; ByzSecAgg achieves both simultaneously. This is critical for wireless FL deployments where edge devices may be compromised and where the wireless channel itself can be exploited for adversarial attacks.

federated-learningsecure-aggregationbyzantine-resilienceprivacy

Federated Learning (FL)

A distributed machine learning paradigm where CC clients train a shared model by exchanging model updates (not raw data) with a central server. FedAvg is the standard algorithm. FL preserves data privacy and reduces communication cost compared to centralised training.

Related: Supervised Learning

Quick Check

In FedAvg, what happens when the data across clients becomes increasingly non-IID (e.g., each BS sees a completely different channel model)?

FedAvg converges faster because each client specialises

FedAvg still converges to the exact global optimum but requires more rounds

FedAvg converges to a neighbourhood of the optimum, and the gap grows with data heterogeneity; the bias does not vanish with more rounds

FedAvg diverges immediately under non-IID data