Decentralized FL and Serverless Aggregation

Beyond the Central Server

Chapters 9-17 all assume a central server that aggregates gradients. This model is convenient — a single trusted point coordinating the protocol — but has fundamental limitations:

  • Single point of failure. Server downtime stops all learning.
  • Single point of trust. All users must trust the server (honest-but-curious in SecAgg; honest in other contexts).
  • Bandwidth bottleneck. All gradient uploads converge to one endpoint.
  • Regulatory / sovereignty constraints. Cross-border FL may require no single jurisdiction holds all aggregates.

Decentralized FL replaces the server with peer-to-peer gradient exchange over a communication graph G\mathcal{G}. Users aggregate with their neighbors; the global aggregate emerges via gossip. The mathematical framework is decentralized SGD (D-SGD), studied since Lian-Zhang-Liu 2017.

The point is that decentralized FL transforms the book's framework: coded computing, secure aggregation, and PIR must be re-developed over communication graphs rather than star topologies. This is a major open area.

,

Definition:

Decentralized FL over Communication Graph

Setup:

  • nn users connected by a graph G=([n],E)\mathcal{G} = ([n], \mathcal{E}).
  • Each user kk maintains a local model θk(t)\boldsymbol{\theta}_k^{(t)} (not a global one).
  • At each round, user kk:
    1. Exchanges models with neighbors N(k)={j:(k,j)E}\mathcal{N}(k) = \{j : (k, j) \in \mathcal{E}\}.
    2. Averages received neighbor models: θk(t+1/2)=j{k}N(k)wjkθj(t)\boldsymbol{\theta}_k^{(t+1/2)} = \sum_{j \in \{k\} \cup \mathcal{N}(k)} w_{jk} \boldsymbol{\theta}_j^{(t)} where {wjk}\{w_{jk}\} are mixing weights (doubly-stochastic matrix rows).
    3. Takes a local gradient step: θk(t+1)=θk(t+1/2)ηlrFk(θk(t+1/2))\boldsymbol{\theta}_k^{(t+1)} = \boldsymbol{\theta}_k^{(t+1/2)} - \eta_{\text{lr}} \nabla F_k(\boldsymbol{\theta}_k^{(t+1/2)}).

Consensus guarantee (under connectivity and doubly-stochastic mixing): θk(t)θˉ(t)0\|\boldsymbol{\theta}_k^{(t)} - \bar{\boldsymbol{\theta}}^{(t)}\| \to 0 where θˉ(t)=1nkθk(t)\bar{\boldsymbol{\theta}}^{(t)} = \frac{1}{n}\sum_k \boldsymbol{\theta}_k^{(t)} — all users converge to the global average exponentially in the graph mixing time TmixT_{\text{mix}}.

No central server. No single point of trust.

Theorem: D-SGD Convergence over a Communication Graph

Under standard smoothness and bounded variance assumptions, with doubly-stochastic mixing matrix WW satisfying W11T/n2ρ<1\|W - \mathbf{1}\mathbf{1}^T/n\|_2 \leq \rho < 1, D-SGD converges at rate EF(θˉT)2    O ⁣(1nT+1T).\mathbb{E}\|\nabla F(\bar{\boldsymbol{\theta}}_T)\|^2 \;\leq\; O\!\left(\frac{1}{\sqrt{nT}} + \frac{1}{T}\right). Interpretation. Decentralized SGD matches centralized SGD's O(1/nT)O(1/\sqrt{nT}) rate, with a small additive O(1/T)O(1/T) consensus error. The graph mixing rate ρ\rho enters in the hidden constants, but does not change the asymptotic scaling.

Security in Decentralized FL

Without a central server, the security landscape changes:

  • Privacy. There is no server to defend against — but neighbors might be honest-but-curious. Privacy schemes must now work over arbitrary graphs, not stars.

  • Byzantine robustness. Byzantine workers can now corrupt their neighbors' models directly (not just the aggregate). The aggregation filters (Krum, trimmed mean) must be applied per-user at each round.

  • DP. Local DP at each node's upload gives per-user privacy, but the graph structure affects aggregate-level DP differently than stars.

Research open directions:

  1. Adapting ByzSecAgg (Ch 11) to graph topologies.
  2. CCESA-style sparse-graph communication- privacy trade-offs applied to D-SGD rather than SecAgg.
  3. Consensus-based Byzantine-resilient aggregation rules.
  4. Joint graph-design + privacy- robustness-DP framework.

Each is a significant open problem.

,

Definition:

Sparse Decentralized FL

Sparse D-SGD uses a sparse communication graph G\mathcal{G} (e.g., Erdős-Rényi random graph with edge probability pp) to reduce per-round communication.

Trade-off:

  • Per-round communication: O(pn)O(pn) neighbors per user ×\times O(n)O(n) users =O(pn2)= O(pn^2) total exchanges. Linear in pp.
  • Consensus rate: graph mixing rate ρ(p)1cp\rho(p) \approx 1 - cp for small pp — slow consensus with sparse graphs.
  • Convergence: O(1/nT+1/T(1ρ)1)O(1/\sqrt{nT} + 1/T \cdot (1 - \rho)^{-1}) — the sparsification cost appears in hidden constants.

Finding the sparsest graph for a given convergence target parallels the CCESA (Ch 12) sparsification for centralized SecAgg.

Sparse D-SGD: Graph Density vs. Convergence

Explore the trade-off in sparse D-SGD: edge probability pp determines per-round communication (linear in pp) and consensus rate (slow as p0p \to 0). The plot shows the total communication cost to reach a target loss vs. pp. The sweet spot is typically p=Θ(logn/n)p = \Theta(\log n / n) — CCESA-like sparsity.

Parameters
100
200

Centralized vs. Decentralized FL

PropertyCentralized FL (Ch 9-17)Decentralized FL (§18.3)
Single point of failureYes (server)No
Convergence rateO(1/nT)O(1/\sqrt{nT})O(1/nT+1/T11ρ)O(1/\sqrt{nT} + 1/T \cdot \frac{1}{1-\rho})
Per-round bandwidthO(n)O(n) orthogonal uploadsO(pn2)O(pn^2) peer exchanges (sparse)
Privacy modelServer + colluding usersAny neighbor subset
Byzantine robustnessCentral filterPer-node filters
Deployment complexitySimpler (central orchestration)More complex (peer-to-peer)
Regulatory flexibilityAll aggregates in one jurisdictionAggregation stays local

Open Problems in Decentralized FL

Active research directions:

  1. Privacy-preserving gossip. How to add IT privacy to peer-to-peer gradient exchange? SecAgg requires a central coordinator; gossip does not.

  2. Byzantine-resilient consensus. Each node applies a local Byzantine filter; but what if many neighbors are malicious? Consensus protocols (like PBFT) impose their own costs.

  3. Graph-aware coded computing. Can coded computing be extended to D-SGD? The input partitioning, encoding, and decoding must respect the graph topology. No general framework exists.

  4. AirComp for gossip. Each pairwise peer-to-peer exchange is a small MAC; can AirComp be applied per-edge or per-cluster? Preliminary work (Liu-Simeone 2021) shows promise.

  5. Convergence-communication-privacy trade-off. The three axes in sparse D-SGD. Full characterization is open.

  6. Asynchronous decentralized FL. Real P2P networks are asynchronous; the synchronous D-SGD analysis fails. Non-trivial convergence proofs needed.

This is perhaps the richest open area of the book — combining graph theory, optimization, and information theory.

⚠️Engineering Note

Deploying Decentralized FL

For decentralized FL deployments:

  • Graph design. Use CCESA-style sparsification: p=logn/np = \log n / n gives O(logn\log n) neighbors per user, sufficient for connectivity with high probability. Computation-storage- communication trade-off as in Ch 12.

  • Consensus weights. Metropolis- Hastings or Laplacian-based weights give doubly-stochastic mixing matrices. Local-to-local computation; no central coordination.

  • Byzantine filter per-node. Each user applies a local trimmed-mean filter before updating. Costs O(N(k)2)O(|\mathcal{N}(k)|^2) per user per round — tolerable for sparse graphs.

  • Privacy via local DP. If stronger privacy is needed, each user adds local Gaussian dither. No cryptographic setup required.

  • Asynchrony tolerance. Use bounded- delay assumptions: each message arrives within a known bound τ\tau. The analysis adapts — slower but converges.

  • Failure recovery. If a user fails, the graph topology changes (an edge set disappears). Convergence continues on the reduced graph with slightly larger ρ\rho.

Practical Constraints
  • Graph density: p=logn/np = \log n / n (CCESA-like)

  • Metropolis-Hastings mixing weights

  • Per-node Byzantine filter: O(N(k)2)O(|\mathcal{N}(k)|^2) per round

  • Local DP for privacy

  • Bounded-delay asynchrony tolerance

📋 Ref: Lalitha et al. 2018

Key Takeaway

Decentralized FL removes the central server — shifting the threat model, convergence analysis, and communication pattern. D-SGD achieves the same asymptotic rate as centralized but with graph-dependent constants. Extending coded computing, secure aggregation, and PIR to graph-based communication is a major open area. CCESA-style sparsification and AirComp-over-gossip are promising directions.

Quick Check

In decentralized FL, the convergence rate relative to centralized FL is:

Identical in asymptotic rate (O(1/nT)O(1/\sqrt{nT})).

Much slower — impossible to match centralized without a server.

Faster — peer-to-peer parallelism.

Requires fewer users nn.