Distributed Learning and Information-Theoretic Limits
Why Communication Constraints Matter for Learning
In federated learning and distributed estimation, users each observe local data and must communicate with a central server to learn a global model. The fundamental question is: how many bits must each user transmit? This is a distributed source coding problem — each user compresses its local sufficient statistic, and the server must reconstruct the global parameter. The information-theoretic limits of this setup determine the minimum communication cost of distributed learning, and they reveal when communication is the bottleneck and when computation is.
Definition: Distributed Statistical Estimation
Distributed Statistical Estimation
Consider users, where user observes i.i.d. samples from a parametric family . Each user sends a message of at most bits to a central server. The server produces an estimate . The goal is to characterize the minimax risk: as a function of the total samples and the total communication budget .
Theorem: Communication Lower Bound for Distributed Mean Estimation
For the problem of estimating the mean of a Gaussian distribution with users, each observing samples and communicating bits total, the minimax MSE satisfies: up to constant factors. The first term is the centralized (unlimited communication) rate; the second is the price of communication.
The term is the statistical limit — even with infinite communication, you cannot do better than the centralized MLE. The term is the communication limit — each bit of communication can reduce the MSE by at most . When , communication is the bottleneck and the distributed system pays a penalty over the centralized one.
Statistical lower bound
The Cramér-Rao bound gives for the centralized problem. Since the distributed setting is more constrained (each user can only send bits), the distributed MSE is at least as large: .
Communication lower bound via strong data processing
Each user's message satisfies (since takes at most values). By the van Trees inequality (Bayesian Cramér-Rao) with a uniform prior on a hypercube of side length : where is the Fisher information available at the server. Since mutual information constrains the information flow, , yielding when .
Combine the bounds
Taking the maximum of the two lower bounds: . The transition occurs at — below this threshold, communication is the bottleneck; above it, statistics is.
Definition: Federated Learning
Federated Learning
Federated learning is a distributed optimization framework where users collaboratively minimize a global objective without sharing raw data. Each user has a local dataset and computes local gradients . The communication protocol typically involves:
- Downlink: Server broadcasts current model
- Local computation: Each user runs local SGD steps
- Uplink: Each user sends a compressed update
- Aggregation: Server updates
The information-theoretic question is: what is the minimum number of bits per round (uplink communication) to achieve a target optimization accuracy ? This is fundamentally a joint source-channel coding problem when the uplink is a noisy channel.
Theorem: Communication Complexity of Federated SGD
For federated optimization of a -dimensional convex objective with -Lipschitz gradients, to achieve -accuracy after rounds with users, the total uplink communication must satisfy: This is achievable (up to logarithmic factors) using stochastic gradient quantization with bits per coordinate per user per round.
Each user must communicate at least enough information to specify its local gradient to within accuracy (so that the average of noisy gradients has error ). Since the gradient is -dimensional and each coordinate requires bits for this accuracy, the total is .
Reduction to distributed mean estimation
At each round, the server needs to estimate where . This is a distributed mean estimation problem.
Apply the distributed estimation lower bound
From the previous theorem, estimating a -dimensional mean from users to MSE requires bits when . More precisely, when each user has one sample (one gradient), the bound becomes bits total across rounds.
Achievability via quantized SGD
Random dithering quantization with bits per coordinate produces an unbiased estimator of the gradient with variance . Setting appropriately and accumulating over rounds achieves -accuracy with total communication bits, matching the lower bound up to logarithmic factors.
Example: Random Dithering for Gradient Compression
User has a gradient vector with . Design a randomized quantizer using bits per coordinate such that (unbiased) and the variance per coordinate is minimized.
Random dithering construction
For each coordinate , define the quantization levels with step size . For , randomly round:
Verify unbiasedness
.
Compute the variance
. Per user, total coordinates, total variance . Communication: bits per user per round. The signal-to-quantization-noise ratio is , analogous to a -bit ADC in signal processing.
Federated Learning: Communication vs. Accuracy Tradeoff
Explore the tradeoff between per-user communication budget and convergence accuracy for federated SGD with gradient quantization.
Parameters
Differential Privacy Amplifies the Communication Bottleneck
In practice, federated learning must also satisfy differential privacy constraints, which require adding noise to each user's message. This noise compounds the quantization noise, further increasing the communication cost. For -differential privacy with Gaussian mechanism, each user adds noise where . The total MSE per round becomes: The privacy noise dominates when , i.e., when the privacy requirement is strict relative to the quantization resolution.
Common Mistake: Non-IID Data Breaks Simple Averaging
Mistake:
Assuming that averaging local model updates produces a good global model when users have heterogeneous (non-i.i.d.) data distributions.
Correction:
When local data distributions differ significantly, local SGD on each user converges toward different optima. Simple averaging of these divergent models can perform worse than training on any single user's data. Techniques like FedProx (adding a proximal term to keep local models close to the global model), SCAFFOLD (variance reduction via control variates), or gradient tracking are needed. The information-theoretic analysis must account for the heterogeneity gap .
Historical Note: From Distributed Estimation to Federated Learning
2013–presentThe information-theoretic study of distributed estimation dates back to the 1990s (Zhang et al., Duchi et al.), but the term "federated learning" was coined by McMahan et al. at Google in 2017, motivated by training models on mobile phone data without centralizing it. The original FedAvg algorithm is remarkably simple: each user runs local SGD and uploads the resulting model; the server averages them. The information-theoretic community quickly recognized this as a distributed source coding problem with an interesting twist: the "source" (gradients) changes at each round, creating a sequential source coding problem over a time-varying source.
Quick Check
In distributed mean estimation with users and total budget bits, when is communication the bottleneck (rather than statistics)?
When (more bits than needed)
When (fewer bits than the statistical limit would need)
When (single user)
Always, regardless of
When , the communication term exceeds the statistical term , making communication the bottleneck.
Federated Learning
A distributed machine learning paradigm where multiple users collaboratively train a shared model under communication and privacy constraints, without sharing raw data.
Related: Federated Learning
Gradient Quantization
The process of reducing the bit precision of gradient vectors before communication in distributed learning, trading off quantization noise for reduced communication cost.