Distributed Gradient Descent for Machine Learning

Why Gradient Descent is the Common Workload

Most modern machine learning is the empirical-risk-minimization problem min⁑w∈RdF(w)β€…β€Šβ‰œβ€…β€Š1nβˆ‘k=1nβ„“(w;ΞΎk),\min_{\mathbf{w} \in \mathbb{R}^d} F(\mathbf{w}) \;\triangleq\; \frac{1}{n} \sum_{k=1}^n \ell(\mathbf{w}; \xi_k), where ΞΎk\xi_k is the kk-th data sample and β„“\ell is a per-sample loss. First-order methods (SGD, mini-batch SGD, Adam, ...) iterate wt+1=wtβˆ’Ξ·βˆ‡F^(wt)\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \widehat{\nabla F}(\mathbf{w}_t) by computing a gradient estimate at each round. Once nn is too large for one machine, the gradient estimate must be computed in parallel across NN workers (or, in the federated setting, across nn users) and aggregated by a central server. That aggregation β€” happening every iteration, possibly tens of thousands of times β€” is where the cost of distribution actually lives.

This section formalizes the synchronous distributed-SGD architecture, isolates the aggregation step as the dominant cost, and motivates the privacy concerns that make Part III necessary.

Definition:

Synchronous Distributed Gradient Descent

A synchronous distributed gradient descent iteration over nn workers proceeds as follows. At round tt the master holds the model wt∈Rd\mathbf{w}_t \in \mathbb{R}^d and broadcasts it to all workers. Worker kk holds a local data partition Dk\mathcal{D}_k and computes the local gradient gk(wt)β€…β€Š=β€…β€Š1∣Dkβˆ£βˆ‘ΞΎβˆˆDkβˆ‡wβ„“(wt;ΞΎ)β€…β€Šβˆˆβ€…β€ŠRd,\mathbf{g}_k(\mathbf{w}_t) \;=\; \frac{1}{|\mathcal{D}_k|} \sum_{\xi \in \mathcal{D}_k} \nabla_{\mathbf{w}} \ell(\mathbf{w}_t; \xi) \;\in\; \mathbb{R}^d, sends it to the master, which applies wt+1β€…β€Š=β€…β€Šwtβˆ’Ξ·β‹…1nβˆ‘k=1ngk(wt).\mathbf{w}_{t+1} \;=\; \mathbf{w}_t - \eta \cdot \frac{1}{n} \sum_{k=1}^n \mathbf{g}_k(\mathbf{w}_t). The aggregated gradient Gt=βˆ‘kgk\mathbf{G}_t = \sum_k \mathbf{g}_k is the only quantity the master needs from the round; individual gradients are intermediate values, not deliverables.

The fact that the master only needs βˆ‘kgk\sum_k \mathbf{g}_k β€” not the individual gradients β€” is the structural opening that secure aggregation (Chapter 10) and over-the-air computation (Chapter 16) will exploit. From an information-theoretic standpoint, computing a function of the inputs costs less than learning each input.

Distributed SGD

Synchronous distributed stochastic gradient descent: each round, every worker computes a local gradient on its data partition, the master averages all gradients, and updates the model. The dominant per-round costs are gradient broadcast (dd floats down) and gradient aggregation (dd floats up per worker).

Parameter Server

The master node in a distributed-SGD architecture: it holds the canonical model parameters, broadcasts them to workers each round, and aggregates returned gradients. In federated learning the parameter server is the cloud and the workers are user devices.

Theorem: Per-Round Communication Cost of Distributed SGD

For synchronous distributed SGD with nn workers, model dimension dd, and bb-bit floating-point representation, the aggregate communication per training round is Croundβ€…β€Š=β€…β€Šnβ‹…dβ‹…bβ€…β€Šβ€…β€Š(uplink)β€…β€Š+β€…β€Šnβ‹…dβ‹…bβ€…β€Šβ€…β€Š(downlink)β€…β€Š=β€…β€Š2ndbβ€…β€ŠΒ bits.C_{\text{round}} \;=\; n \cdot d \cdot b \;\;\text{(uplink)} \;+\; n \cdot d \cdot b \;\;\text{(downlink)} \;=\; 2 n d b \;\text{ bits}. Over TT rounds the total communication is 2ndbT2 n d b T. For a contemporary deep network with d∼109d \sim 10^9 parameters, n∼104n \sim 10^4 workers, b=32b = 32 bits, and T∼105T \sim 10^5 rounds, the total inter-machine traffic is β‰ˆ6.4β‹…1019\approx 6.4 \cdot 10^{19} bits β€” many orders of magnitude larger than the dataset itself.

Distributed training is dominated by gradient communication, not by data movement. The dataset is partitioned and read once; the gradient cycles through every worker every iteration. This asymmetry is why gradient compression, sparsification, and quantization are now standard in production federated systems β€” and why the privacy of gradients, not raw data, is the central concern of Chapter 10.

Distributed-SGD Communication Cost vs. Workers and Model Size

Sweep the number of users nn and observe how the per-round communication cost scales. We plot the total uplink traffic (bits per round) for three model sizes that bracket modern deep learning: d=107d = 10^7 (a small CNN), d=109d = 10^9 (a large transformer), and d=1011d = 10^{11} (a frontier mixture-of-experts model). The vertical axis uses a log scale. The shaded region marks the bandwidth available on a typical 1 Gbps backhaul during a one-second round β€” anything above the shaded region triggers per-round latency that exceeds the round itself.

Parameters
10000

Maximum number of users to plot

32

Quantization precision for gradients

100

Training rounds aggregated

Example: How Long Does a Single Distributed-SGD Round Take?

A model has d=109d = 10^9 parameters at b=32b = 32 bits each. We train with n=1000n = 1000 workers. Each worker has a 11 Gbps uplink. Ignoring computation time and broadcast cost, what is the minimum per-round communication time?

The Gradient is Not Just a Number β€” It Leaks the Data

A common misconception is that distributed SGD is "private" because the raw data Dk\mathcal{D}_k never leaves user kk's device β€” only the gradient gk\mathbf{g}_k does. This is wrong. A series of recent "gradient inversion" attacks have demonstrated that the gradient gk=βˆ‡β„“(w;ΞΎk)\mathbf{g}_k = \nabla \ell(\mathbf{w}; \xi_k) contains enough information to reconstruct individual training samples, sometimes pixel-perfectly, given knowledge of the model architecture. This works even for batch sizes up to a few dozen, and even when only a few successive rounds of gradients are observed.

The implication for system design is sharp: shipping plaintext gradients to a central server is not a privacy guarantee. The server can reconstruct user data from gradients alone, and so can any party who eavesdrops on the network. This is the operational motivation for secure aggregation (Chapter 10), where the protocol is designed so that the server learns only the aggregate G=βˆ‘kgk\mathbf{G} = \sum_k \mathbf{g}_k and nothing about any individual gk\mathbf{g}_k.

Historical Note: From DLG to Inversion of Foundation Models

2019–2024

The first influential gradient-inversion attack β€” Deep Leakage from Gradients (DLG) by Zhu, Liu, and Han (NeurIPS 2019) β€” showed that for a small image classifier and a single training sample, the input pixels can be recovered to near-perfect fidelity from the gradient alone, by solving an inverse optimization problem. The technique was extended to mini-batches (iDLG, GradInversion) and subsequently to federated learning of language models, where partial reconstruction of training text from gradient updates has been demonstrated. The take-away for federated-learning system designers is unambiguous: a plaintext gradient is, for cryptographic purposes, a noisy view of the underlying training samples β€” not an opaque numeric vector.

🚨Critical Engineering Note

Gradient Compression Is Standard, But Not Privacy

Production federated-learning systems (Google's TensorFlow Federated, NVIDIA Flare, Apple's PrivateFL stack) routinely ship gradients in 8-bit or even 1-bit precision (signSGD), and combine quantization with top-KK sparsification. These tricks reduce CroundC_{\text{round}} by 1–2 orders of magnitude without measurably hurting convergence in many workloads. They are not, however, a privacy mechanism: a quantized gradient is still informative enough for inversion attacks. Privacy has to be added explicitly, either via cryptographic secure aggregation (Chapter 10) or via differential-privacy noise addition (Chapter 18). Compression and privacy are orthogonal axes; engineering one does not buy the other.

Practical Constraints
  • β€’

    8-bit gradient quantization typically loses < 0.5% top-1 accuracy on ImageNet-scale models

  • β€’

    Top-1%1\% sparsification reduces uplink size by ∼100Γ—\sim 100\times but introduces stale gradient components

  • β€’

    The gradient inversion threat is independent of compression strength

πŸ“‹ Ref: Google FL system; TF Federated; OpenFL

Common Mistake: Federated Learning Is Not "Private by Default"

Mistake:

Federated learning keeps raw data on user devices, therefore the server learns nothing private about individual users.

Correction:

The server receives gradients, and gradients leak the underlying samples. Without secure aggregation or differential privacy, a federated-learning protocol provides less privacy than a well-implemented data-center pipeline with strict access controls. The privacy guarantee in FL has to come from the cryptographic or information-theoretic protocol on top of it, not from the architecture itself.

Why This Matters: Aggregation Is a Function: Enter Over-the-Air Computation

Notice the mathematical structure of the aggregation step: the master only needs βˆ‘kgk\sum_k \mathbf{g}_k, not the individual gradients. In a wireless network, this aggregation can be performed in the channel itself by letting all users transmit simultaneously β€” the multiple- access channel naturally superposes the signals to produce βˆ‘kgk+noise\sum_k \mathbf{g}_k + \text{noise}. This is the analog over-the-air computation paradigm of Chapter 16, which short-circuits the per-user uplink cost from ndbn d b to dbd b at the price of an additive noise on the aggregate.

Quick Check

A federated-learning round trains a model with d=108d = 10^8 parameters on n=104n = 10^4 users, each at b=16b = 16 bits per gradient scalar. What is the per-round aggregate uplink traffic?

1.6β‹…10131.6 \cdot 10^{13} bits = 2 TB

10810^8 bits = 12.5 MB

10410^4 bits = 1.25 KB

101210^{12} bits = 125 GB