Distributed Learning and Information-Theoretic Limits

Why Communication Constraints Matter for Learning

In federated learning and distributed estimation, KK users each observe local data and must communicate with a central server to learn a global model. The fundamental question is: how many bits must each user transmit? This is a distributed source coding problem — each user compresses its local sufficient statistic, and the server must reconstruct the global parameter. The information-theoretic limits of this setup determine the minimum communication cost of distributed learning, and they reveal when communication is the bottleneck and when computation is.

Definition:

Distributed Statistical Estimation

Consider KK users, where user kk observes nkn_k i.i.d. samples XknkPθnkX_k^{n_k} \sim P_\theta^{n_k} from a parametric family {Pθ:θΘ}\{P_\theta : \theta \in \Theta\}. Each user kk sends a message Mk=fk(Xknk)M_k = f_k(X_k^{n_k}) of at most BkB_k bits to a central server. The server produces an estimate θ^=g(M1,,MK)\hat{\theta} = g(M_1, \ldots, M_K). The goal is to characterize the minimax risk: R(n,B)=inff1,,fK,gsupθΘE[θ^θ2]R^*(n, B) = \inf_{f_1, \ldots, f_K, g} \sup_{\theta \in \Theta} \mathbb{E}\left[\|\hat{\theta} - \theta\|^2\right] as a function of the total samples n=knkn = \sum_k n_k and the total communication budget B=kBkB = \sum_k B_k.

Theorem: Communication Lower Bound for Distributed Mean Estimation

For the problem of estimating the mean θRd\theta \in \mathbb{R}^d of a Gaussian distribution N(θ,Id)\mathcal{N}(\theta, I_d) with KK users, each observing n/Kn/K samples and communicating BB bits total, the minimax MSE satisfies: R(n,B)max{dn,d2B}R^*(n, B) \geq \max\left\{\frac{d}{n}, \frac{d^2}{B}\right\} up to constant factors. The first term is the centralized (unlimited communication) rate; the second is the price of communication.

The term d/nd/n is the statistical limit — even with infinite communication, you cannot do better than the centralized MLE. The term d2/Bd^2/B is the communication limit — each bit of communication can reduce the MSE by at most d2/Bd^2/B. When BdnB \ll dn, communication is the bottleneck and the distributed system pays a penalty over the centralized one.

Definition:

Federated Learning

Federated learning is a distributed optimization framework where KK users collaboratively minimize a global objective F(θ)=1Kk=1KFk(θ)F(\theta) = \frac{1}{K}\sum_{k=1}^K F_k(\theta) without sharing raw data. Each user kk has a local dataset and computes local gradients Fk(θ)\nabla F_k(\theta). The communication protocol typically involves:

  1. Downlink: Server broadcasts current model θ(t)\theta^{(t)}
  2. Local computation: Each user runs local SGD steps
  3. Uplink: Each user sends a compressed update Δk(t)\Delta_k^{(t)}
  4. Aggregation: Server updates θ(t+1)=θ(t)+1KkΔk(t)\theta^{(t+1)} = \theta^{(t)} + \frac{1}{K}\sum_k \Delta_k^{(t)}

The information-theoretic question is: what is the minimum number of bits per round (uplink communication) to achieve a target optimization accuracy ϵ\epsilon? This is fundamentally a joint source-channel coding problem when the uplink is a noisy channel.

Theorem: Communication Complexity of Federated SGD

For federated optimization of a dd-dimensional convex objective with LL-Lipschitz gradients, to achieve ϵ\epsilon-accuracy after TT rounds with KK users, the total uplink communication BB must satisfy: BΩ(dKϵ2) bitsB \geq \Omega\left(\frac{dK}{\epsilon^2}\right) \text{ bits} This is achievable (up to logarithmic factors) using stochastic gradient quantization with O(log(d/ϵ))O(\log(d/\epsilon)) bits per coordinate per user per round.

Each user must communicate at least enough information to specify its local gradient to within accuracy ϵ/K\epsilon/\sqrt{K} (so that the average of KK noisy gradients has error ϵ\epsilon). Since the gradient is dd-dimensional and each coordinate requires Ω(1/ϵ2)\Omega(1/\epsilon^2) bits for this accuracy, the total is Ω(dK/ϵ2)\Omega(dK/\epsilon^2).

Example: Random Dithering for Gradient Compression

User kk has a gradient vector gkRdg_k \in \mathbb{R}^d with gkG\|g_k\|_\infty \leq G. Design a randomized quantizer Q(gk)Q(g_k) using bb bits per coordinate such that E[Q(gk)]=gk\mathbb{E}[Q(g_k)] = g_k (unbiased) and the variance per coordinate is minimized.

Federated Learning: Communication vs. Accuracy Tradeoff

Explore the tradeoff between per-user communication budget and convergence accuracy for federated SGD with gradient quantization.

Parameters
100
10
8
1000
⚠️Engineering Note

Differential Privacy Amplifies the Communication Bottleneck

In practice, federated learning must also satisfy differential privacy constraints, which require adding noise to each user's message. This noise compounds the quantization noise, further increasing the communication cost. For (ε,δ)(\varepsilon, \delta)-differential privacy with Gaussian mechanism, each user adds noise N(0,σDP2Id)\mathcal{N}(0, \sigma_{\text{DP}}^2 I_d) where σDP1/ε\sigma_{\text{DP}} \propto 1/\varepsilon. The total MSE per round becomes: MSEdG222bK+dσDP2K\text{MSE} \approx \frac{dG^2 \cdot 2^{-2b}}{K} + \frac{d \sigma_{\text{DP}}^2}{K} The privacy noise dominates when σDP2G222b\sigma_{\text{DP}}^2 \gg G^2 \cdot 2^{-2b}, i.e., when the privacy requirement is strict relative to the quantization resolution.

Common Mistake: Non-IID Data Breaks Simple Averaging

Mistake:

Assuming that averaging local model updates produces a good global model when users have heterogeneous (non-i.i.d.) data distributions.

Correction:

When local data distributions differ significantly, local SGD on each user converges toward different optima. Simple averaging of these divergent models can perform worse than training on any single user's data. Techniques like FedProx (adding a proximal term to keep local models close to the global model), SCAFFOLD (variance reduction via control variates), or gradient tracking are needed. The information-theoretic analysis must account for the heterogeneity gap maxkFk(θ)F(θ)\max_k \|\nabla F_k(\theta^*) - \nabla F(\theta^*)\|.

Historical Note: From Distributed Estimation to Federated Learning

2013–present

The information-theoretic study of distributed estimation dates back to the 1990s (Zhang et al., Duchi et al.), but the term "federated learning" was coined by McMahan et al. at Google in 2017, motivated by training models on mobile phone data without centralizing it. The original FedAvg algorithm is remarkably simple: each user runs local SGD and uploads the resulting model; the server averages them. The information-theoretic community quickly recognized this as a distributed source coding problem with an interesting twist: the "source" (gradients) changes at each round, creating a sequential source coding problem over a time-varying source.

Quick Check

In distributed mean estimation with KK users and total budget BB bits, when is communication the bottleneck (rather than statistics)?

When B>dnB > dn (more bits than needed)

When B<dnB < dn (fewer bits than the statistical limit would need)

When K=1K = 1 (single user)

Always, regardless of BB

Federated Learning

A distributed machine learning paradigm where multiple users collaboratively train a shared model under communication and privacy constraints, without sharing raw data.

Related: Federated Learning

Gradient Quantization

The process of reducing the bit precision of gradient vectors before communication in distributed learning, trading off quantization noise for reduced communication cost.

Related: Random Dithering for Gradient Compression