Ferkans — Interactive Telecom Tutor

Beyond the Central Server

Chapters 9-17 all assume a central server that aggregates gradients. This model is convenient — a single trusted point coordinating the protocol — but has fundamental limitations:

Single point of failure. Server downtime stops all learning.
Single point of trust. All users must trust the server (honest-but-curious in SecAgg; honest in other contexts).
Bandwidth bottleneck. All gradient uploads converge to one endpoint.
Regulatory / sovereignty constraints. Cross-border FL may require no single jurisdiction holds all aggregates.

Decentralized FL replaces the server with peer-to-peer gradient exchange over a communication graph $\mathcal{G}$ . Users aggregate with their neighbors; the global aggregate emerges via gossip. The mathematical framework is decentralized SGD (D-SGD), studied since Lian-Zhang-Liu 2017.

The point is that decentralized FL transforms the book's framework: coded computing, secure aggregation, and PIR must be re-developed over communication graphs rather than star topologies. This is a major open area.

,

Definition:
Decentralized FL over Communication Graph

Setup:

$n$ users connected by a graph $\mathcal{G} = ([n], \mathcal{E})$ .
Each user $k$ maintains a local model $\boldsymbol{\theta}_k^{(t)}$ (not a global one).
At each round, user $k$ $k$ :
1. Exchanges models with neighbors $\mathcal{N}(k) = \{j : (k, j) \in \mathcal{E}\}$ .
2. Averages received neighbor models: $\boldsymbol{\theta}_k^{(t+1/2)} = \sum_{j \in \{k\} \cup \mathcal{N}(k)} w_{jk} \boldsymbol{\theta}_j^{(t)}$ where $\{w_{jk}\}$ are mixing weights (doubly-stochastic matrix rows).
3. Takes a local gradient step: $\boldsymbol{\theta}_k^{(t+1)} = \boldsymbol{\theta}_k^{(t+1/2)} - \eta_{\text{lr}} \nabla F_k(\boldsymbol{\theta}_k^{(t+1/2)})$ .

Consensus guarantee (under connectivity and doubly-stochastic mixing): $\|\boldsymbol{\theta}_k^{(t)} - \bar{\boldsymbol{\theta}}^{(t)}\| \to 0$ where $\bar{\boldsymbol{\theta}}^{(t)} = \frac{1}{n}\sum_k \boldsymbol{\theta}_k^{(t)}$ — all users converge to the global average exponentially in the graph mixing time $T_{\text{mix}}$ .

No central server. No single point of trust.

Theorem: D-SGD Convergence over a Communication Graph

Under standard smoothness and bounded variance assumptions, with doubly-stochastic mixing matrix $W$ satisfying $\|W - \mathbf{1}\mathbf{1}^T/n\|_2 \leq \rho < 1$ , D-SGD converges at rate $\mathbb{E}\|\nabla F(\bar{\boldsymbol{\theta}}_T)\|^2 \;\leq\; O\!\left(\frac{1}{\sqrt{nT}} + \frac{1}{T}\right).$ Interpretation. Decentralized SGD matches centralized SGD's $O(1/\sqrt{nT})$ rate, with a small additive $O(1/T)$ consensus error. The graph mixing rate $\rho$ enters in the hidden constants, but does not change the asymptotic scaling.

Proof

Consensus error bound

Under doubly-stochastic mixing, $\|\boldsymbol{\theta}_k^{(t)} - \bar{\boldsymbol{\theta}}^{(t)}\| \leq C \rho^t$ — exponential consensus.

Effective gradient update

Average model $\bar{\boldsymbol{\theta}}^{(t+1)}$ satisfies a centralized-SGD-like recursion, with consensus error bounded by $\rho^t$ .

Rate matching

Standard SGD analysis applied to $\bar{\boldsymbol{\theta}}$ gives the stated rate. The graph enters only in the hidden constants.

Caveat

For graphs with small $\rho$ (strong connectivity), the rate is essentially centralized. For graphs close to disconnected ( $\rho \to 1$ ), consensus is slow and D-SGD may require more rounds.

Security in Decentralized FL

Without a central server, the security landscape changes:

Privacy. There is no server to defend against — but neighbors might be honest-but-curious. Privacy schemes must now work over arbitrary graphs, not stars.
Byzantine robustness. Byzantine workers can now corrupt their neighbors' models directly (not just the aggregate). The aggregation filters (Krum, trimmed mean) must be applied per-user at each round.
DP. Local DP at each node's upload gives per-user privacy, but the graph structure affects aggregate-level DP differently than stars.

Research open directions:

Adapting ByzSecAgg (Ch 11) to graph topologies.
CCESA-style sparse-graph communication- privacy trade-offs applied to D-SGD rather than SecAgg.
Consensus-based Byzantine-resilient aggregation rules.
Joint graph-design + privacy- robustness-DP framework.

Each is a significant open problem.

,

Definition:
Sparse Decentralized FL

Sparse D-SGD uses a sparse communication graph $\mathcal{G}$ (e.g., Erdős-Rényi random graph with edge probability $p$ ) to reduce per-round communication.

Trade-off:

Per-round communication: $O(pn)$ neighbors per user $\times$ $O(n)$ users $= O(pn^2)$ total exchanges. Linear in $p$ .
Consensus rate: graph mixing rate $\rho(p) \approx 1 - cp$ for small $p$ — slow consensus with sparse graphs.
Convergence: $O(1/\sqrt{nT} + 1/T \cdot (1 - \rho)^{-1})$ — the sparsification cost appears in hidden constants.

Finding the sparsest graph for a given convergence target parallels the CCESA (Ch 12) sparsification for centralized SecAgg.

Sparse D-SGD: Graph Density vs. Convergence

Explore the trade-off in sparse D-SGD: edge probability $p$ determines per-round communication (linear in $p$ ) and consensus rate (slow as $p \to 0$ ). The plot shows the total communication cost to reach a target loss vs. $p$ . The sweet spot is typically $p = \Theta(\log n / n)$ — CCESA-like sparsity.

Parameters

n

— users100

T_{\max}

— convergence rounds200

Centralized vs. Decentralized FL

Property	Centralized FL (Ch 9-17)	Decentralized FL (§18.3)
Single point of failure	Yes (server)	No
Convergence rate	$O(1/\sqrt{nT})$	$O(1/\sqrt{nT} + 1/T \cdot \frac{1}{1-\rho})$
Per-round bandwidth	$O(n)$ orthogonal uploads	$O(pn^2)$ peer exchanges (sparse)
Privacy model	Server + colluding users	Any neighbor subset
Byzantine robustness	Central filter	Per-node filters
Deployment complexity	Simpler (central orchestration)	More complex (peer-to-peer)
Regulatory flexibility	All aggregates in one jurisdiction	Aggregation stays local

Open Problems in Decentralized FL

Active research directions:

Privacy-preserving gossip. How to add IT privacy to peer-to-peer gradient exchange? SecAgg requires a central coordinator; gossip does not.
Byzantine-resilient consensus. Each node applies a local Byzantine filter; but what if many neighbors are malicious? Consensus protocols (like PBFT) impose their own costs.
Graph-aware coded computing. Can coded computing be extended to D-SGD? The input partitioning, encoding, and decoding must respect the graph topology. No general framework exists.
AirComp for gossip. Each pairwise peer-to-peer exchange is a small MAC; can AirComp be applied per-edge or per-cluster? Preliminary work (Liu-Simeone 2021) shows promise.
Convergence-communication-privacy trade-off. The three axes in sparse D-SGD. Full characterization is open.
Asynchronous decentralized FL. Real P2P networks are asynchronous; the synchronous D-SGD analysis fails. Non-trivial convergence proofs needed.

This is perhaps the richest open area of the book — combining graph theory, optimization, and information theory.

⚠️Engineering Note

Deploying Decentralized FL

For decentralized FL deployments:

Graph design. Use CCESA-style sparsification: $p = \log n / n$ gives O( $\log n$ ) neighbors per user, sufficient for connectivity with high probability. Computation-storage- communication trade-off as in Ch 12.
Consensus weights. Metropolis- Hastings or Laplacian-based weights give doubly-stochastic mixing matrices. Local-to-local computation; no central coordination.
Byzantine filter per-node. Each user applies a local trimmed-mean filter before updating. Costs $O(|\mathcal{N}(k)|^2)$ per user per round — tolerable for sparse graphs.
Privacy via local DP. If stronger privacy is needed, each user adds local Gaussian dither. No cryptographic setup required.
Asynchrony tolerance. Use bounded- delay assumptions: each message arrives within a known bound $\tau$ . The analysis adapts — slower but converges.
Failure recovery. If a user fails, the graph topology changes (an edge set disappears). Convergence continues on the reduced graph with slightly larger $\rho$ .

Practical Constraints

•
Graph density: $p = \log n / n$ (CCESA-like)
•
Metropolis-Hastings mixing weights
•
Per-node Byzantine filter: $O(|\mathcal{N}(k)|^2)$ per round
•
Local DP for privacy
•
Bounded-delay asynchrony tolerance

📋 Ref: Lalitha et al. 2018

Key Takeaway

Decentralized FL removes the central server — shifting the threat model, convergence analysis, and communication pattern. D-SGD achieves the same asymptotic rate as centralized but with graph-dependent constants. Extending coded computing, secure aggregation, and PIR to graph-based communication is a major open area. CCESA-style sparsification and AirComp-over-gossip are promising directions.

Quick Check

In decentralized FL, the convergence rate relative to centralized FL is:

Identical in asymptotic rate ( $O(1/\sqrt{nT})$ ).

Much slower — impossible to match centralized without a server.

Faster — peer-to-peer parallelism.

Requires fewer users $n$ .

Correction:

Identical in asymptotic rate (

O(1/\sqrt{nT})

).

D-SGD matches centralized SGD's asymptotic rate, with an additive $O(1/T)$ consensus-error term and graph-mixing constants hidden in the bounds.

Decentralized FL and Serverless Aggregation