Ferkans — Interactive Telecom Tutor

Learning Without Centralising Data

The supervised and reinforcement learning methods of the previous sections assume that training data is available at a central server (e.g., a cloud data centre). In wireless networks, this assumption is often violated:

Data is generated at the edge: Channel measurements, CSI reports, and traffic patterns are observed locally at each base station or user device.
Privacy constraints: User-level data (location, traffic, channel fingerprints) may be subject to regulations (GDPR) or corporate policy that prohibit centralisation.
Communication cost: Uploading raw high-dimensional data (e.g., channel matrices from 64 antennas $\times$ 1200 subcarriers $\times$ 100 time slots) from hundreds of base stations to a central server is prohibitively expensive.

Federated learning (FL) addresses these challenges by keeping data local and only sharing model updates (gradient vectors or model parameters). This section develops the FedAvg algorithm, analyses its convergence, and discusses its application to wireless communications.

Definition:
Federated Learning (FedAvg)

Consider $C$ clients (e.g., base stations), each with a local dataset $\mathcal{D}_c$ of size $n_c$ . The global objective is to minimise the weighted empirical risk:

$\min_{\mathbf{w}} \; F(\mathbf{w}) = \sum_{c=1}^{C} \frac{n_c}{n} F_c(\mathbf{w}), \qquad F_c(\mathbf{w}) = \frac{1}{n_c} \sum_{(\mathbf{x},\mathbf{y}) \in \mathcal{D}_c} \ell(f_{\mathbf{w}}(\mathbf{x}), \mathbf{y})$

where $n = \sum_c n_c$ is the total number of samples and $\ell$ is the loss function.

Federated Averaging (FedAvg) (McMahan et al., 2017) proceeds in communication rounds $r = 1, 2, \ldots, R$ :

Broadcast: The server sends the current global model $\mathbf{w}_{\text{global}}^{(r)}$ to all (or a random subset of) clients.
Local training: Each client $c$ initialises $\mathbf{w}_c \leftarrow \mathbf{w}_{\text{global}}^{(r)}$ and performs $E$ epochs of SGD on its local data: $\mathbf{w}_c \leftarrow \mathbf{w}_c - \eta \, \nabla F_c(\mathbf{w}_c)$
Upload: Each client sends $\mathbf{w}_c$ (or the update $\Delta\mathbf{w}_c = \mathbf{w}_c - \mathbf{w}_{\text{global}}^{(r)}$ ) to the server.
Aggregation: The server averages: $\mathbf{w}_{\text{global}}^{(r+1)} = \sum_{c=1}^{C} \frac{n_c}{n}\, \mathbf{w}_c$

After $R$ rounds, the global model $\mathbf{w}_{\text{global}}^{(R)}$ is deployed at all clients.

The communication cost per round is $O(|\mathbf{w}|)$ per client (model size), which is much less than uploading raw data $O(n_c \cdot d_{\text{input}})$ for most practical models. For a model with $10^4$ parameters and clients with $10^5$ data samples of dimension $128$ , the communication reduction is roughly $1000\times$ .

Theorem: Convergence of FedAvg under Non-IID Data

Federated Learning: One Round of FedAvg

Animated walkthrough of one FedAvg communication round: the server broadcasts the global model, clients train locally on their private data, upload model updates, and the server aggregates them into a new global model.

FedAvg keeps data local and only shares model parameters, reducing communication cost and preserving privacy.

Federated Learning vs Centralised Training

Compare FedAvg with centralised training on a simple regression task. Each client has a non-IID variant of the ground-truth model (rotated input distribution), simulating the heterogeneity that arises when different base stations see different channel statistics. The $x$ -axis shows communication rounds and the $y$ -axis shows the global MSE loss (log scale). Centralised training (which has access to all data) converges faster and to a lower floor. FedAvg converges to a neighbourhood of the optimum, with the gap increasing under higher heterogeneity. More clients generally help (more data) but also increase heterogeneity --- observe the trade-off by adjusting the number of clients.

Parameters

Number of clients

C

5

Communication rounds

R

50

Federated Learning Applications in Wireless Networks

Federated learning is particularly well-suited to wireless networks, both as a tool for wireless and as a system that operates over wireless:

FL for wireless (the application):

Channel prediction: Each BS trains a local channel prediction model on its own CSI data. FedAvg aggregates the models to learn a general predictor that captures common channel statistics while the local data captures site-specific features.
Beam management: UEs collaboratively learn a model that predicts optimal beams from position or uplink measurements, without sharing raw position data.
Anomaly detection: Distributed intrusion detection across edge nodes, where sharing raw traffic data is prohibited by privacy regulations.

FL over wireless (the system):

The FL communication rounds share the wireless medium with data traffic. Key co-design challenges include:

Over-the-air aggregation: Instead of orthogonal uploads, clients transmit simultaneously and the superposition of signals implements the averaging: $\mathbf{w}_{\text{global}} \propto \sum_c h_c \, \mathbf{w}_c$ . This exploits the MAC channel for free aggregation but requires power control for alignment.
Client selection: Under limited uplink bandwidth, only a subset of clients can participate per round. Selecting clients with informative (high-gradient-norm) updates accelerates convergence.
Compression: Gradient quantisation (1-bit SGD, SignSGD) and sparsification reduce the per-round communication cost at the expense of convergence speed.

Privacy Guarantees and Limitations

Federated learning is often described as "privacy-preserving" because raw data stays on-device. However, model updates can leak information about the local data:

Gradient inversion attacks: An adversary who observes the gradient $\nabla F_c(\mathbf{w})$ can approximately reconstruct training samples, especially for small batches and simple models.
Membership inference: Determining whether a specific sample was in client $c$ 's dataset is possible from the model updates.

Differential privacy (DP) provides a formal guarantee: each client adds calibrated Gaussian noise to its update:

$\tilde{\mathbf{w}}_c = \mathbf{w}_c + \mathcal{N}(\mathbf{0}, \sigma_{\mathrm{DP}}^2 \mathbf{I})$

The noise variance $\sigma_{\mathrm{DP}}^2$ is calibrated to ensure $(\epsilon, \delta)$ -differential privacy, where $\epsilon$ controls the privacy-utility trade-off: smaller $\epsilon$ means stronger privacy but slower convergence.

Secure aggregation provides another layer: cryptographic protocols ensure that the server only sees the aggregate $\sum_c \mathbf{w}_c$ and cannot inspect individual updates. This prevents gradient inversion attacks at the server.

Example: Communication Cost of Federated Learning

A cellular network has $C = 50$ base stations. Each BS collects $n_c = 10\,000$ CSI samples per day, each of dimension $d = 256$ (128 subcarriers $\times$ real/imaginary). A channel prediction model has $|\mathbf{w}| = 20\,000$ parameters (32-bit floats). FedAvg runs $R = 100$ rounds.

(a) Compute the total communication cost (in MB) for centralised training (uploading all raw data) vs federated learning (uploading model updates).

(b) If each communication round uses 5 ms of uplink airtime at 100 Mbps per BS, compute the total training time.

Solution

Centralised cost

Each sample: $d = 256$ floats $\times$ 4 bytes $= 1024$ bytes.

Per BS: $10\,000 \times 1024 = 10.24$ MB.

Total: $50 \times 10.24 = 512$ MB.

This is a one-time upload cost, but the raw data may contain sensitive location and channel information.

Federated cost

Per round, each client uploads: $|\mathbf{w}| \times 4 = 20\,000 \times 4 = 80$ KB.

Per round, total: $50 \times 80\;\text{KB} = 4$ MB.

Over $R = 100$ rounds: $100 \times 4 = 400$ MB.

The federated cost is comparable to the centralised cost in this example ( $400$ vs $512$ MB), but the federated approach transmits model parameters (no privacy-sensitive raw data) and the communication is spread over 100 rounds rather than concentrated in a single upload.

Training time

Upload per BS per round: $80\;\text{KB} = 640\;\text{Kb}$ .

At 100 Mbps: $640 / 100\,000 = 0.0064$ s $= 6.4$ ms.

But the stated airtime budget is 5 ms, so with compression (e.g., top- $k$ sparsification keeping 50% of entries), the upload fits within the budget.

Total wall-clock time: $100 \times (5\;\text{ms upload} + T_{\text{local}})$ . If local training takes 1 s per round (on BS hardware), the total is $100 \times 1.005 \approx 100.5$ s $\approx 1.7$ minutes.

🎓CommIT Contribution(2022)

ByzSecAgg: Byzantine-Resilient Secure Aggregation

T. Jahani-Nezhad, M. A. Maddah-Ali, G. Caire — IEEE Journal on Selected Areas in Communications

Jahani-Nezhad, Maddah-Ali, and Caire developed ByzSecAgg, a federated learning aggregation protocol that is simultaneously:

Secure: The server learns only the aggregate model update, not individual client updates (protecting against gradient inversion attacks).
Byzantine-resilient: The protocol tolerates up to $f$ malicious clients that may send arbitrary (adversarial) updates designed to corrupt the global model.
Communication-efficient: The overhead scales logarithmically with the number of clients.

The key insight is combining secret sharing (for privacy) with coded computing (for Byzantine resilience) in a single framework. Prior work addressed these properties separately; ByzSecAgg achieves both simultaneously. This is critical for wireless FL deployments where edge devices may be compromised and where the wireless channel itself can be exploited for adversarial attacks.

federated-learningsecure-aggregationbyzantine-resilienceprivacy

Federated Learning (FL)

A distributed machine learning paradigm where $C$ clients train a shared model by exchanging model updates (not raw data) with a central server. FedAvg is the standard algorithm. FL preserves data privacy and reduces communication cost compared to centralised training.

Related: Supervised Learning

Quick Check

In FedAvg, what happens when the data across clients becomes increasingly non-IID (e.g., each BS sees a completely different channel model)?

FedAvg converges faster because each client specialises

FedAvg still converges to the exact global optimum but requires more rounds

FedAvg converges to a neighbourhood of the optimum, and the gap grows with data heterogeneity; the bias does not vanish with more rounds

FedAvg diverges immediately under non-IID data

Correction: