Ferkans — Interactive Telecom Tutor

Why Communication Constraints Matter for Learning

In federated learning and distributed estimation, $K$ users each observe local data and must communicate with a central server to learn a global model. The fundamental question is: how many bits must each user transmit? This is a distributed source coding problem — each user compresses its local sufficient statistic, and the server must reconstruct the global parameter. The information-theoretic limits of this setup determine the minimum communication cost of distributed learning, and they reveal when communication is the bottleneck and when computation is.

Definition:
Distributed Statistical Estimation

Consider $K$ users, where user $k$ observes $n_k$ i.i.d. samples $X_k^{n_k} \sim P_\theta^{n_k}$ from a parametric family $\{P_\theta : \theta \in \Theta\}$ . Each user $k$ sends a message $M_k = f_k(X_k^{n_k})$ of at most $B_k$ bits to a central server. The server produces an estimate $\hat{\theta} = g(M_1, \ldots, M_K)$ . The goal is to characterize the minimax risk: $R^*(n, B) = \inf_{f_1, \ldots, f_K, g} \sup_{\theta \in \Theta} \mathbb{E}\left[\|\hat{\theta} - \theta\|^2\right]$ as a function of the total samples $n = \sum_k n_k$ and the total communication budget $B = \sum_k B_k$ .

Theorem: Communication Lower Bound for Distributed Mean Estimation

For the problem of estimating the mean $\theta \in \mathbb{R}^d$ of a Gaussian distribution $\mathcal{N}(\theta, I_d)$ with $K$ users, each observing $n/K$ samples and communicating $B$ bits total, the minimax MSE satisfies: $R^*(n, B) \geq \max\left\{\frac{d}{n}, \frac{d^2}{B}\right\}$ up to constant factors. The first term is the centralized (unlimited communication) rate; the second is the price of communication.

The term $d/n$ is the statistical limit — even with infinite communication, you cannot do better than the centralized MLE. The term $d^2/B$ is the communication limit — each bit of communication can reduce the MSE by at most $d^2/B$ . When $B \ll dn$ , communication is the bottleneck and the distributed system pays a penalty over the centralized one.

Proof

Statistical lower bound

The Cramér-Rao bound gives $R^* \geq d/n$ for the centralized problem. Since the distributed setting is more constrained (each user can only send $B_k$ bits), the distributed MSE is at least as large: $R^*(n,B) \geq d/n$ .

Communication lower bound via strong data processing

Each user's message $M_k$ satisfies $I(\theta; M_k) \leq B_k$ (since $M_k$ takes at most $2^{B_k}$ values). By the van Trees inequality (Bayesian Cramér-Rao) with a uniform prior on a hypercube of side length $L$ : $\mathbb{E}[\|\hat{\theta} - \theta\|^2] \geq \frac{d^2 L^2}{d L^2 J(\theta) + d^2}$ where $J(\theta)$ is the Fisher information available at the server. Since mutual information constrains the information flow, $J(\theta) \lesssim B/d$ , yielding $R^*(n,B) \gtrsim d^2/B$ when $B \lesssim dn$ .

Combine the bounds

Taking the maximum of the two lower bounds: $R^*(n,B) \geq \max\{d/n, d^2/B\}$ . The transition occurs at $B^* = dn$ — below this threshold, communication is the bottleneck; above it, statistics is.

Definition:
Federated Learning

Federated learning is a distributed optimization framework where $K$ users collaboratively minimize a global objective $F(\theta) = \frac{1}{K}\sum_{k=1}^K F_k(\theta)$ without sharing raw data. Each user $k$ has a local dataset and computes local gradients $\nabla F_k(\theta)$ . The communication protocol typically involves:

Downlink: Server broadcasts current model $\theta^{(t)}$
Local computation: Each user runs local SGD steps
Uplink: Each user sends a compressed update $\Delta_k^{(t)}$
Aggregation: Server updates $\theta^{(t+1)} = \theta^{(t)} + \frac{1}{K}\sum_k \Delta_k^{(t)}$

The information-theoretic question is: what is the minimum number of bits per round (uplink communication) to achieve a target optimization accuracy $\epsilon$ ? This is fundamentally a joint source-channel coding problem when the uplink is a noisy channel.

Theorem: Communication Complexity of Federated SGD

For federated optimization of a $d$ -dimensional convex objective with $L$ -Lipschitz gradients, to achieve $\epsilon$ -accuracy after $T$ rounds with $K$ users, the total uplink communication $B$ must satisfy: $B \geq \Omega\left(\frac{dK}{\epsilon^2}\right) \text{ bits}$ This is achievable (up to logarithmic factors) using stochastic gradient quantization with $O(\log(d/\epsilon))$ bits per coordinate per user per round.

Each user must communicate at least enough information to specify its local gradient to within accuracy $\epsilon/\sqrt{K}$ (so that the average of $K$ noisy gradients has error $\epsilon$ ). Since the gradient is $d$ -dimensional and each coordinate requires $\Omega(1/\epsilon^2)$ bits for this accuracy, the total is $\Omega(dK/\epsilon^2)$ .

Proof

Reduction to distributed mean estimation

At each round, the server needs to estimate $\bar{g} = \frac{1}{K}\sum_k g_k$ where $g_k = \nabla F_k(\theta^{(t)})$ . This is a distributed mean estimation problem.

Apply the distributed estimation lower bound

From the previous theorem, estimating a $d$ -dimensional mean from $K$ users to MSE $\epsilon^2$ requires $B \geq d^2 / \epsilon^2$ bits when $d > K$ . More precisely, when each user has one sample (one gradient), the bound becomes $B \geq \Omega(dK/\epsilon^2)$ bits total across $T$ rounds.

Achievability via quantized SGD

Random dithering quantization with $b = O(\log(d/\epsilon))$ bits per coordinate produces an unbiased estimator of the gradient with variance $O(d/2^{2b})$ . Setting $b$ appropriately and accumulating over $T = O(1/\epsilon)$ rounds achieves $\epsilon$ -accuracy with total communication $O(dKb/\epsilon) = O(dK\log(d/\epsilon)/\epsilon)$ bits, matching the lower bound up to logarithmic factors.

Example: Random Dithering for Gradient Compression

User $k$ has a gradient vector $g_k \in \mathbb{R}^d$ with $\|g_k\|_\infty \leq G$ . Design a randomized quantizer $Q(g_k)$ using $b$ bits per coordinate such that $\mathbb{E}[Q(g_k)] = g_k$ (unbiased) and the variance per coordinate is minimized.

Solution

Random dithering construction

For each coordinate $j$ , define the quantization levels $\{-G, -G + \Delta, \ldots, G\}$ with step size $\Delta = 2G / (2^b - 1)$ . For $g_{k,j} \in [\ell \Delta, (\ell+1)\Delta]$ , randomly round: $Q(g_{k,j}) = \begin{cases} (\ell+1)\Delta & \text{w.p. } (g_{k,j} - \ell\Delta)/\Delta \\ \ell\Delta & \text{otherwise}\end{cases}$

Verify unbiasedness

$\mathbb{E}[Q(g_{k,j})] = (\ell+1)\Delta \cdot \frac{g_{k,j} - \ell\Delta}{\Delta} + \ell\Delta \cdot \frac{(\ell+1)\Delta - g_{k,j}}{\Delta} = g_{k,j}$ .

Compute the variance

$\text{Var}(Q(g_{k,j})) \leq \Delta^2/4 = G^2/(2^b - 1)^2 \approx G^2 \cdot 2^{-2b}$ . Per user, total $d$ coordinates, total variance $\leq dG^2 \cdot 2^{-2b}$ . Communication: $db$ bits per user per round. The signal-to-quantization-noise ratio is $\text{SQNR} \approx 2^{2b}$ , analogous to a $b$ -bit ADC in signal processing.

Federated Learning: Communication vs. Accuracy Tradeoff

Explore the tradeoff between per-user communication budget and convergence accuracy for federated SGD with gradient quantization.

Parameters

Model dimension d100

Number of users K10

Bits per coordinate8

Communication rounds T1000

⚠️Engineering Note

Differential Privacy Amplifies the Communication Bottleneck

In practice, federated learning must also satisfy differential privacy constraints, which require adding noise to each user's message. This noise compounds the quantization noise, further increasing the communication cost. For $(\varepsilon, \delta)$ -differential privacy with Gaussian mechanism, each user adds noise $\mathcal{N}(0, \sigma_{\text{DP}}^2 I_d)$ where $\sigma_{\text{DP}} \propto 1/\varepsilon$ . The total MSE per round becomes: $\text{MSE} \approx \frac{dG^2 \cdot 2^{-2b}}{K} + \frac{d \sigma_{\text{DP}}^2}{K}$ The privacy noise dominates when $\sigma_{\text{DP}}^2 \gg G^2 \cdot 2^{-2b}$ , i.e., when the privacy requirement is strict relative to the quantization resolution.

Common Mistake: Non-IID Data Breaks Simple Averaging

Mistake:

Assuming that averaging local model updates produces a good global model when users have heterogeneous (non-i.i.d.) data distributions.

Correction:

When local data distributions differ significantly, local SGD on each user converges toward different optima. Simple averaging of these divergent models can perform worse than training on any single user's data. Techniques like FedProx (adding a proximal term to keep local models close to the global model), SCAFFOLD (variance reduction via control variates), or gradient tracking are needed. The information-theoretic analysis must account for the heterogeneity gap $\max_k \|\nabla F_k(\theta^*) - \nabla F(\theta^*)\|$ .

Historical Note: From Distributed Estimation to Federated Learning

2013–present

The information-theoretic study of distributed estimation dates back to the 1990s (Zhang et al., Duchi et al.), but the term "federated learning" was coined by McMahan et al. at Google in 2017, motivated by training models on mobile phone data without centralizing it. The original FedAvg algorithm is remarkably simple: each user runs local SGD and uploads the resulting model; the server averages them. The information-theoretic community quickly recognized this as a distributed source coding problem with an interesting twist: the "source" (gradients) changes at each round, creating a sequential source coding problem over a time-varying source.

Quick Check

In distributed mean estimation with $K$ users and total budget $B$ bits, when is communication the bottleneck (rather than statistics)?

When $B > dn$ (more bits than needed)

When $B < dn$ (fewer bits than the statistical limit would need)

When $K = 1$ (single user)

Always, regardless of $B$

Correction:

When

B < dn

(fewer bits than the statistical limit would need)

When $B < dn$ , the communication term $d^2/B$ exceeds the statistical term $d/n$ , making communication the bottleneck.

Federated Learning

A distributed machine learning paradigm where multiple users collaboratively train a shared model under communication and privacy constraints, without sharing raw data.

Related: Federated Learning

Gradient Quantization

The process of reducing the bit precision of gradient vectors before communication in distributed learning, trading off quantization noise for reduced communication cost.

Distributed Learning and Information-Theoretic Limits