Decentralized FL and Serverless Aggregation
Beyond the Central Server
Chapters 9-17 all assume a central server that aggregates gradients. This model is convenient — a single trusted point coordinating the protocol — but has fundamental limitations:
- Single point of failure. Server downtime stops all learning.
- Single point of trust. All users must trust the server (honest-but-curious in SecAgg; honest in other contexts).
- Bandwidth bottleneck. All gradient uploads converge to one endpoint.
- Regulatory / sovereignty constraints. Cross-border FL may require no single jurisdiction holds all aggregates.
Decentralized FL replaces the server with peer-to-peer gradient exchange over a communication graph . Users aggregate with their neighbors; the global aggregate emerges via gossip. The mathematical framework is decentralized SGD (D-SGD), studied since Lian-Zhang-Liu 2017.
The point is that decentralized FL transforms the book's framework: coded computing, secure aggregation, and PIR must be re-developed over communication graphs rather than star topologies. This is a major open area.
Definition: Decentralized FL over Communication Graph
Decentralized FL over Communication Graph
Setup:
- users connected by a graph .
- Each user maintains a local model (not a global one).
- At each round, user :
- Exchanges models with neighbors .
- Averages received neighbor models: where are mixing weights (doubly-stochastic matrix rows).
- Takes a local gradient step: .
Consensus guarantee (under connectivity and doubly-stochastic mixing): where — all users converge to the global average exponentially in the graph mixing time .
No central server. No single point of trust.
Theorem: D-SGD Convergence over a Communication Graph
Under standard smoothness and bounded variance assumptions, with doubly-stochastic mixing matrix satisfying , D-SGD converges at rate Interpretation. Decentralized SGD matches centralized SGD's rate, with a small additive consensus error. The graph mixing rate enters in the hidden constants, but does not change the asymptotic scaling.
Consensus error bound
Under doubly-stochastic mixing, — exponential consensus.
Effective gradient update
Average model satisfies a centralized-SGD-like recursion, with consensus error bounded by .
Rate matching
Standard SGD analysis applied to gives the stated rate. The graph enters only in the hidden constants.
Caveat
For graphs with small (strong connectivity), the rate is essentially centralized. For graphs close to disconnected (), consensus is slow and D-SGD may require more rounds.
Security in Decentralized FL
Without a central server, the security landscape changes:
-
Privacy. There is no server to defend against — but neighbors might be honest-but-curious. Privacy schemes must now work over arbitrary graphs, not stars.
-
Byzantine robustness. Byzantine workers can now corrupt their neighbors' models directly (not just the aggregate). The aggregation filters (Krum, trimmed mean) must be applied per-user at each round.
-
DP. Local DP at each node's upload gives per-user privacy, but the graph structure affects aggregate-level DP differently than stars.
Research open directions:
- Adapting ByzSecAgg (Ch 11) to graph topologies.
- CCESA-style sparse-graph communication- privacy trade-offs applied to D-SGD rather than SecAgg.
- Consensus-based Byzantine-resilient aggregation rules.
- Joint graph-design + privacy- robustness-DP framework.
Each is a significant open problem.
Definition: Sparse Decentralized FL
Sparse Decentralized FL
Sparse D-SGD uses a sparse communication graph (e.g., Erdős-Rényi random graph with edge probability ) to reduce per-round communication.
Trade-off:
- Per-round communication: neighbors per user users total exchanges. Linear in .
- Consensus rate: graph mixing rate for small — slow consensus with sparse graphs.
- Convergence: — the sparsification cost appears in hidden constants.
Finding the sparsest graph for a given convergence target parallels the CCESA (Ch 12) sparsification for centralized SecAgg.
Sparse D-SGD: Graph Density vs. Convergence
Explore the trade-off in sparse D-SGD: edge probability determines per-round communication (linear in ) and consensus rate (slow as ). The plot shows the total communication cost to reach a target loss vs. . The sweet spot is typically — CCESA-like sparsity.
Parameters
Centralized vs. Decentralized FL
| Property | Centralized FL (Ch 9-17) | Decentralized FL (§18.3) |
|---|---|---|
| Single point of failure | Yes (server) | No |
| Convergence rate | ||
| Per-round bandwidth | orthogonal uploads | peer exchanges (sparse) |
| Privacy model | Server + colluding users | Any neighbor subset |
| Byzantine robustness | Central filter | Per-node filters |
| Deployment complexity | Simpler (central orchestration) | More complex (peer-to-peer) |
| Regulatory flexibility | All aggregates in one jurisdiction | Aggregation stays local |
Open Problems in Decentralized FL
Active research directions:
-
Privacy-preserving gossip. How to add IT privacy to peer-to-peer gradient exchange? SecAgg requires a central coordinator; gossip does not.
-
Byzantine-resilient consensus. Each node applies a local Byzantine filter; but what if many neighbors are malicious? Consensus protocols (like PBFT) impose their own costs.
-
Graph-aware coded computing. Can coded computing be extended to D-SGD? The input partitioning, encoding, and decoding must respect the graph topology. No general framework exists.
-
AirComp for gossip. Each pairwise peer-to-peer exchange is a small MAC; can AirComp be applied per-edge or per-cluster? Preliminary work (Liu-Simeone 2021) shows promise.
-
Convergence-communication-privacy trade-off. The three axes in sparse D-SGD. Full characterization is open.
-
Asynchronous decentralized FL. Real P2P networks are asynchronous; the synchronous D-SGD analysis fails. Non-trivial convergence proofs needed.
This is perhaps the richest open area of the book — combining graph theory, optimization, and information theory.
Deploying Decentralized FL
For decentralized FL deployments:
-
Graph design. Use CCESA-style sparsification: gives O() neighbors per user, sufficient for connectivity with high probability. Computation-storage- communication trade-off as in Ch 12.
-
Consensus weights. Metropolis- Hastings or Laplacian-based weights give doubly-stochastic mixing matrices. Local-to-local computation; no central coordination.
-
Byzantine filter per-node. Each user applies a local trimmed-mean filter before updating. Costs per user per round — tolerable for sparse graphs.
-
Privacy via local DP. If stronger privacy is needed, each user adds local Gaussian dither. No cryptographic setup required.
-
Asynchrony tolerance. Use bounded- delay assumptions: each message arrives within a known bound . The analysis adapts — slower but converges.
-
Failure recovery. If a user fails, the graph topology changes (an edge set disappears). Convergence continues on the reduced graph with slightly larger .
- •
Graph density: (CCESA-like)
- •
Metropolis-Hastings mixing weights
- •
Per-node Byzantine filter: per round
- •
Local DP for privacy
- •
Bounded-delay asynchrony tolerance
Key Takeaway
Decentralized FL removes the central server — shifting the threat model, convergence analysis, and communication pattern. D-SGD achieves the same asymptotic rate as centralized but with graph-dependent constants. Extending coded computing, secure aggregation, and PIR to graph-based communication is a major open area. CCESA-style sparsification and AirComp-over-gossip are promising directions.
Quick Check
In decentralized FL, the convergence rate relative to centralized FL is:
Identical in asymptotic rate ().
Much slower — impossible to match centralized without a server.
Faster — peer-to-peer parallelism.
Requires fewer users .
D-SGD matches centralized SGD's asymptotic rate, with an additive consensus-error term and graph-mixing constants hidden in the bounds.