Ferkans — Interactive Telecom Tutor

Why Gradient Descent is the Common Workload

Most modern machine learning is the empirical-risk-minimization problem $\min_{\mathbf{w} \in \mathbb{R}^d} F(\mathbf{w}) \;\triangleq\; \frac{1}{n} \sum_{k=1}^n \ell(\mathbf{w}; \xi_k),$ where $\xi_k$ is the $k$ -th data sample and $\ell$ is a per-sample loss. First-order methods (SGD, mini-batch SGD, Adam, ...) iterate $\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \widehat{\nabla F}(\mathbf{w}_t)$ by computing a gradient estimate at each round. Once $n$ is too large for one machine, the gradient estimate must be computed in parallel across $N$ workers (or, in the federated setting, across $n$ users) and aggregated by a central server. That aggregation — happening every iteration, possibly tens of thousands of times — is where the cost of distribution actually lives.

This section formalizes the synchronous distributed-SGD architecture, isolates the aggregation step as the dominant cost, and motivates the privacy concerns that make Part III necessary.

Definition:
Synchronous Distributed Gradient Descent

A synchronous distributed gradient descent iteration over $n$ workers proceeds as follows. At round $t$ the master holds the model $\mathbf{w}_t \in \mathbb{R}^d$ and broadcasts it to all workers. Worker $k$ holds a local data partition $\mathcal{D}_k$ and computes the local gradient $\mathbf{g}_k(\mathbf{w}_t) \;=\; \frac{1}{|\mathcal{D}_k|} \sum_{\xi \in \mathcal{D}_k} \nabla_{\mathbf{w}} \ell(\mathbf{w}_t; \xi) \;\in\; \mathbb{R}^d,$ sends it to the master, which applies $\mathbf{w}_{t+1} \;=\; \mathbf{w}_t - \eta \cdot \frac{1}{n} \sum_{k=1}^n \mathbf{g}_k(\mathbf{w}_t).$ The aggregated gradient $\mathbf{G}_t = \sum_k \mathbf{g}_k$ is the only quantity the master needs from the round; individual gradients are intermediate values, not deliverables.

The fact that the master only needs $\sum_k \mathbf{g}_k$ — not the individual gradients — is the structural opening that secure aggregation (Chapter 10) and over-the-air computation (Chapter 16) will exploit. From an information-theoretic standpoint, computing a function of the inputs costs less than learning each input.

Distributed SGD

Synchronous distributed stochastic gradient descent: each round, every worker computes a local gradient on its data partition, the master averages all gradients, and updates the model. The dominant per-round costs are gradient broadcast ( $d$ floats down) and gradient aggregation ( $d$ floats up per worker).

Parameter Server

The master node in a distributed-SGD architecture: it holds the canonical model parameters, broadcasts them to workers each round, and aggregates returned gradients. In federated learning the parameter server is the cloud and the workers are user devices.

Theorem: Per-Round Communication Cost of Distributed SGD

For synchronous distributed SGD with $n$ workers, model dimension $d$ , and $b$ -bit floating-point representation, the aggregate communication per training round is $C_{\text{round}} \;=\; n \cdot d \cdot b \;\;\text{(uplink)} \;+\; n \cdot d \cdot b \;\;\text{(downlink)} \;=\; 2 n d b \;\text{ bits}.$ Over $T$ rounds the total communication is $2 n d b T$ . For a contemporary deep network with $d \sim 10^9$ parameters, $n \sim 10^4$ workers, $b = 32$ bits, and $T \sim 10^5$ rounds, the total inter-machine traffic is $\approx 6.4 \cdot 10^{19}$ bits — many orders of magnitude larger than the dataset itself.

Distributed training is dominated by gradient communication, not by data movement. The dataset is partitioned and read once; the gradient cycles through every worker every iteration. This asymmetry is why gradient compression, sparsification, and quantization are now standard in production federated systems — and why the privacy of gradients, not raw data, is the central concern of Chapter 10.

Proof

Per-worker uplink and downlink

Each round, every one of the $n$ workers receives the full model $\mathbf{w}_t \in \mathbb{R}^d$ ( $d \cdot b$ bits) and sends back the local gradient $\mathbf{g}_k \in \mathbb{R}^d$ ( $d \cdot b$ bits). Total per-worker traffic per round is $2 d b$ .

Aggregate over workers and rounds

Multiplying by $n$ workers gives $C_{\text{round}} = 2 n d b$ per round; summing over $T$ rounds gives the total $2 n d b T$ . $\blacksquare$

Distributed-SGD Communication Cost vs. Workers and Model Size

Sweep the number of users $n$ and observe how the per-round communication cost scales. We plot the total uplink traffic (bits per round) for three model sizes that bracket modern deep learning: $d = 10^7$ (a small CNN), $d = 10^9$ (a large transformer), and $d = 10^{11}$ (a frontier mixture-of-experts model). The vertical axis uses a log scale. The shaded region marks the bandwidth available on a typical 1 Gbps backhaul during a one-second round — anything above the shaded region triggers per-round latency that exceeds the round itself.

Parameters

n_{\max}

10000

Maximum number of users to plot

b

(bits per scalar)32

Quantization precision for gradients

T

(rounds)100

Training rounds aggregated

Example: How Long Does a Single Distributed-SGD Round Take?

A model has $d = 10^9$ parameters at $b = 32$ bits each. We train with $n = 1000$ workers. Each worker has a $1$ Gbps uplink. Ignoring computation time and broadcast cost, what is the minimum per-round communication time?

Solution

Per-worker uplink size

$d b = 10^9 \cdot 32 = 3.2 \cdot 10^{10}$ bits per worker.

Per-worker upload time at 1 Gbps

$3.2 \cdot 10^{10} / 10^9 = 32$ seconds per worker.

Implication

Even if all workers upload in parallel, each round takes at least 32 seconds just for the uplink. With $T = 10^5$ rounds this is $3.2 \cdot 10^6$ seconds $\approx 37$ days.

The point is that for modern model sizes, distributed SGD is bottlenecked by gradient communication, not by computation or by moving the dataset. This is what makes gradient compression (Chapter 9), secure aggregation (Chapter 10), and over-the-air computation (Chapter 16) practical necessities — they each attack the gradient bottleneck from a different angle.

The Gradient is Not Just a Number — It Leaks the Data

A common misconception is that distributed SGD is "private" because the raw data $\mathcal{D}_k$ never leaves user $k$ 's device — only the gradient $\mathbf{g}_k$ does. This is wrong. A series of recent "gradient inversion" attacks have demonstrated that the gradient $\mathbf{g}_k = \nabla \ell(\mathbf{w}; \xi_k)$ contains enough information to reconstruct individual training samples, sometimes pixel-perfectly, given knowledge of the model architecture. This works even for batch sizes up to a few dozen, and even when only a few successive rounds of gradients are observed.

The implication for system design is sharp: shipping plaintext gradients to a central server is not a privacy guarantee. The server can reconstruct user data from gradients alone, and so can any party who eavesdrops on the network. This is the operational motivation for secure aggregation (Chapter 10), where the protocol is designed so that the server learns only the aggregate $\mathbf{G} = \sum_k \mathbf{g}_k$ and nothing about any individual $\mathbf{g}_k$ .

Historical Note: From DLG to Inversion of Foundation Models

2019–2024

The first influential gradient-inversion attack — Deep Leakage from Gradients (DLG) by Zhu, Liu, and Han (NeurIPS 2019) — showed that for a small image classifier and a single training sample, the input pixels can be recovered to near-perfect fidelity from the gradient alone, by solving an inverse optimization problem. The technique was extended to mini-batches (iDLG, GradInversion) and subsequently to federated learning of language models, where partial reconstruction of training text from gradient updates has been demonstrated. The take-away for federated-learning system designers is unambiguous: a plaintext gradient is, for cryptographic purposes, a noisy view of the underlying training samples — not an opaque numeric vector.

🚨Critical Engineering Note

Gradient Compression Is Standard, But Not Privacy

Production federated-learning systems (Google's TensorFlow Federated, NVIDIA Flare, Apple's PrivateFL stack) routinely ship gradients in 8-bit or even 1-bit precision (signSGD), and combine quantization with top- $K$ sparsification. These tricks reduce $C_{\text{round}}$ by 1–2 orders of magnitude without measurably hurting convergence in many workloads. They are not, however, a privacy mechanism: a quantized gradient is still informative enough for inversion attacks. Privacy has to be added explicitly, either via cryptographic secure aggregation (Chapter 10) or via differential-privacy noise addition (Chapter 18). Compression and privacy are orthogonal axes; engineering one does not buy the other.

Practical Constraints

•
8-bit gradient quantization typically loses < 0.5% top-1 accuracy on ImageNet-scale models
•
Top- $1\%$ sparsification reduces uplink size by $\sim 100\times$ but introduces stale gradient components
•
The gradient inversion threat is independent of compression strength

📋 Ref: Google FL system; TF Federated; OpenFL

Common Mistake: Federated Learning Is Not "Private by Default"

Mistake:

Federated learning keeps raw data on user devices, therefore the server learns nothing private about individual users.

Correction:

The server receives gradients, and gradients leak the underlying samples. Without secure aggregation or differential privacy, a federated-learning protocol provides less privacy than a well-implemented data-center pipeline with strict access controls. The privacy guarantee in FL has to come from the cryptographic or information-theoretic protocol on top of it, not from the architecture itself.

Why This Matters: Aggregation Is a Function: Enter Over-the-Air Computation

Notice the mathematical structure of the aggregation step: the master only needs $\sum_k \mathbf{g}_k$ , not the individual gradients. In a wireless network, this aggregation can be performed in the channel itself by letting all users transmit simultaneously — the multiple- access channel naturally superposes the signals to produce $\sum_k \mathbf{g}_k + \text{noise}$ . This is the analog over-the-air computation paradigm of Chapter 16, which short-circuits the per-user uplink cost from $n d b$ to $d b$ at the price of an additive noise on the aggregate.

Quick Check

A federated-learning round trains a model with $d = 10^8$ parameters on $n = 10^4$ users, each at $b = 16$ bits per gradient scalar. What is the per-round aggregate uplink traffic?

$1.6 \cdot 10^{13}$ bits = 2 TB

$10^8$ bits = 12.5 MB

$10^4$ bits = 1.25 KB

$10^{12}$ bits = 125 GB