Handling User Dropouts

Dropouts Break the Naive Protocol

The Bonawitz protocol of Β§10.2 assumes all nn users successfully complete the round. In production FL, dropouts are frequent: mobile devices go offline, networks time out, users close apps. A Ξ΄=5\delta = 5–30%30\% dropout rate per round is typical.

The naive protocol breaks catastrophically under dropouts. If user kk drops out, the server never receives g~k=gk+βˆ‘jrkj\tilde{\mathbf{g}}_k = \mathbf{g}_k + \sum_j \mathbf{r}_{kj}. Summing only the surviving uploads gives βˆ‘k∈Sg~k=βˆ‘k∈Sgk+βˆ‘k∈S,j∈Drkj\sum_{k \in \mathcal{S}} \tilde{\mathbf{g}}_k = \sum_{k \in \mathcal{S}} \mathbf{g}_k + \sum_{k \in \mathcal{S}, j \in \mathcal{D}} \mathbf{r}_{kj}. The mask contributions between surviving and dropped users no longer cancel β€” the server gets a noisy aggregate at best, and privacy may also break.

The point of Section 10.3 is the fix: Shamir-shared seeds with per-user threshold reconstruction. Each user's seeds are Shamir-shared among the other users before the round; if a user drops, any tt surviving users can reconstruct its seeds and subtract the leftover masks. The threshold tt is chosen to tolerate the maximum number of colluders plus the expected dropout rate.

This fix preserves both correctness (aggregation succeeds under dropouts) and privacy (colluders cannot recover secrets of honest users). It is the machinery that makes Bonawitz's protocol production-viable.

Definition:

Dropout-Resilient Secure Aggregation

A dropout-resilient secure-aggregation protocol on nn users, privacy threshold TT, and tolerable dropout rate Ξ΄\delta satisfies:

  1. Correctness under dropouts. For any set DβŠ†[n]\mathcal{D} \subseteq [n] of dropouts with ∣Dβˆ£β‰€Ξ΄n|\mathcal{D}| \leq \delta n, and the surviving set S=[n]βˆ–D\mathcal{S} = [n] \setminus \mathcal{D}, the server correctly computes GS=βˆ‘k∈Sgk\mathbf{G}_\mathcal{S} = \sum_{k \in \mathcal{S}} \mathbf{g}_k.

  2. Privacy. The adversary (server + up to TT colluders) learns only GS\mathbf{G}_\mathcal{S} and nothing else about individual gk\mathbf{g}_k for k∈Sβˆ–Uk \in \mathcal{S} \setminus \mathcal{U}.

The design challenge is to preserve privacy even when the masks of dropped users must be reconstructed. Naive reconstruction (asking surviving users to reveal their seeds with dropped ones) would expose the masks to the server; the Shamir secret sharing of seeds fixes this by requiring a threshold of surviving users to cooperate.

The privacy threshold TT and dropout rate Ξ΄\delta are related: the Shamir threshold tt for seed sharing must satisfy tβ‰₯T+1t \geq T + 1 (ensuring TT colluders cannot reconstruct a seed alone) and t≀nβˆ’Ξ΄nβˆ’Tt \leq n - \delta n - T (ensuring enough honest survivors to reconstruct dropped-user seeds). This gives the feasibility constraint T+Ξ΄n≀nβˆ’Tβˆ’1T + \delta n \leq n - T - 1, or equivalently 2T+Ξ΄n≀nβˆ’12T + \delta n \leq n - 1.

Dropout Resilience

A property of a secure-aggregation protocol that allows some users to drop out mid-round without breaking correctness or privacy. Achieved via Shamir-shared seeds with per-user threshold reconstruction.

Bonawitz Protocol with Dropout Handling (Full)

Complexity: Per-round: O(n)O(n) DH exchanges, O(nlog⁑t)O(n \log t) share distribution, O(d)O(d) gradient upload per user. Server side: O(n2log⁑t)O(n^2 \log t) reconstruction work.
Setup. Each user kk has DH key pair
(skk,pkk)(\text{sk}_k, \text{pk}_k) and additional "self-mask"
seed bkb_k for handling the "did not drop" case.
Phase 0: Key Distribution.
Users exchange DH public keys.
Phase 1: Seed Sharing.
1. For each other user jj: user kk derives pairwise
seed skj=DH(skk,pkj)s_{kj} = \text{DH}(\text{sk}_k, \text{pk}_j).
2. For the self-mask bkb_k (another random seed), user
kk Shamir-shares bkb_k with threshold tt over
distinct evaluation points associated with other users.
3. For each pairwise seed skjs_{kj}, user kk
Shamir-shares skjs_{kj} with threshold tt over
the same set of evaluation points.
4. Each user kk receives nβˆ’1n - 1 share packets (one
per other user) holding Shamir shares of the
pairwise seeds and self-mask seeds.
Phase 2: Masked Upload.
5. User kk derives actual pairwise masks
rkj\mathbf{r}_{kj} (via PRG on skjs_{kj}) and
self-mask mk\mathbf{m}_k (via PRG on bkb_k).
6. User kk uploads
g~k=gk+βˆ‘jsign(k,j)rkj+mk\tilde{\mathbf{g}}_k = \mathbf{g}_k + \sum_j \text{sign}(k,j) \mathbf{r}_{kj} + \mathbf{m}_k to the server.
Phase 3: Dropout Handling.
7. Server identifies surviving users S\mathcal{S} and
dropped users D\mathcal{D}.
8. Server requests from each surviving user kk their
Shamir shares:
- For each dropped user j∈Dj \in \mathcal{D}: share
of pairwise seed skjs_{kj}.
- For each surviving user j∈Sj \in \mathcal{S}:
share of self-mask seed bjb_j (to cancel the
self-masks).
9. Server, with β‰₯t\geq t shares per dropped/surviving
user, reconstructs the needed seeds.
10. Server subtracts the leftover masks and produces
GS=βˆ‘k∈Sgk\mathbf{G}_\mathcal{S} = \sum_{k \in \mathcal{S}} \mathbf{g}_k.
Phase 4: Output. Server delivers GS\mathbf{G}_\mathcal{S}
to users (typically as the basis for the next round's
model update).

The self-mask mk\mathbf{m}_k is a subtle but important addition: if a user only has pairwise masks and never drops, the reconstruction of its seeds (triggered by the server to cancel masks of dropped users) would reveal its mask contribution. The self-mask adds an extra layer so that this reconstruction still preserves the user's individual gradient's privacy.

Theorem: Bonawitz with Dropouts: Correctness + Privacy Under Collusion

Let tt be the Shamir threshold, TT the privacy threshold (colluders), and Ξ΄\delta the tolerable dropout fraction. The Bonawitz protocol with dropout handling achieves:

  1. Correctness. For dropouts of size ≀nβˆ’t\leq n - t (enough surviving users to reconstruct seeds), the server computes GS\mathbf{G}_\mathcal{S} exactly.

  2. Privacy. Any coalition of size TT users + server cannot learn individual gradients beyond GS\mathbf{G}_\mathcal{S}, provided T<tT < t.

The feasibility constraint is T+1≀t≀nβˆ’Ξ΄nT + 1 \leq t \leq n - \delta n, i.e., T+1+Ξ΄n≀nT + 1 + \delta n \leq n, equivalently T+Ξ΄n≀nβˆ’1T + \delta n \leq n - 1. Production settings: T=O(n)T = O(\sqrt{n}), Ξ΄=0.2\delta = 0.2, t=⌈2n/3βŒ‰t = \lceil 2n/3 \rceil.

The Shamir threshold tt sits between the privacy threshold TT (colluders cannot reconstruct) and the dropout tolerance nβˆ’Ξ΄nn - \delta n (surviving users can reconstruct). Each seed is a secret in a (t,nβˆ’1)(t, n-1)-Shamir scheme. The server, with access to shares from β‰₯t\geq t surviving users, reconstructs dropped users' seeds; a coalition of T<tT < t users cannot.

The point is that the same algebraic primitive (Shamir) from Chapter 3 provides both robustness (dropout handling) and privacy (threshold). This is the standard modular-design pattern of modern MPC: build complex protocols from simple information- theoretic primitives.

Example: Feasibility Regions for Dropout + Privacy

For n=100n = 100 users, compute the feasibility region over (T,Ξ΄)(T, \delta) for the Bonawitz dropout-resilient protocol. What are the feasible (T,Ξ΄)(T, \delta) pairs?

Dropout-Resilience vs. Privacy-Collusion Tradeoff

Plot the feasibility region for the Bonawitz protocol in the (T,Ξ΄)(T, \delta) plane, for various nn. The feasibility boundary is T+Ξ΄n=nβˆ’1T + \delta n = n - 1; above this boundary, no combination of Shamir threshold and pairwise masks achieves both privacy and dropout resilience. Higher nn shifts the boundary outward β€” larger user populations tolerate more collusion and dropouts simultaneously.

Parameters
100

Total users

Common Mistake: Shamir Threshold Must Exceed Collusion Threshold

Mistake:

Choose the Shamir threshold t=Tt = T (= privacy threshold).

Correction:

Setting t=Tt = T allows a coalition of exactly TT colluders to reconstruct the Shamir-shared seeds, then compute the pairwise masks and extract individual gradients. The correct choice is tβ‰₯T+1t \geq T + 1 (strictly more than the collusion threshold), so any TT colluders have T<tT < t shares and cannot reconstruct.

Production deployments use t=⌈2n/3βŒ‰t = \lceil 2n/3 \rceil or similar large fractions, giving large safety margins over typical TT values. The cost of larger tt is that more surviving users are needed for reconstruction β€” but Shamir's per-share cost is small (Β§3.2), so this is affordable.

⚠️Engineering Note

Dropout Handling in Production

Production Bonawitz deployments:

  • Typical dropout rate: Ξ΄=0.15\delta = 0.15–0.300.30 (15–30% of users drop per round on mobile FL).
  • Typical privacy threshold: T=⌈n/5βŒ‰T = \lceil n/5 \rceil to ⌈n/3βŒ‰\lceil n/3 \rceil.
  • Shamir threshold: t=⌈2n/3βŒ‰t = \lceil 2n/3 \rceil (comfortable margin over TT, enough for typical dropouts).
  • Reconstruction latency: ∼1\sim 1 second per reconstruction; nβ‹…0.3=30n \cdot 0.3 = 30 reconstructions per round at n=100n = 100, so ∼30\sim 30 seconds added to each round.

The Shamir-reconstruction step dominates server-side CPU time in production. CCESA (Chapter 12) reduces the number of seeds to reconstruct (sparser graph), but within the Bonawitz framework, the overhead is intrinsic.

Practical Constraints
  • β€’

    Dropout rate: δ∈[0.15,0.30]\delta \in [0.15, 0.30] typical

  • β€’

    Shamir threshold: t=⌈2n/3βŒ‰t = \lceil 2n/3 \rceil standard

  • β€’

    Reconstruction: O(n2log⁑t)O(n^2 \log t) server CPU per round

πŸ“‹ Ref: Bonawitz et al. 2017 Β§4; Google FL systems 2019

Historical Note: Shamir Secret Sharing in Modern Cryptography

1979–2017 (Shamir's generalization); 2017 (Bonawitz application)

Shamir's secret sharing, introduced in 1979 (Chapter 3), found a natural home in secure aggregation nearly four decades later. The connection is apt: both problems require splitting a secret (in Β§10.3, the mask seed) among parties such that a threshold of them can reconstruct while fewer cannot. Bonawitz et al. (2017) brought Shamir to federated learning via the seed-sharing mechanism described in this section.

The pattern β€” secret-share a cryptographic primitive, then reconstruct on demand β€” is now standard in threshold cryptography (Chapter 3 Β§3.5), threshold signatures (Byzantine agreement), and verifiable secret sharing (Chapter 11's ByzSecAgg). The abstraction is clean, and the Shamir implementation is robust. Modern MPC libraries (SPDZ, MP-SPDZ, FRESCO) expose Shamir- shared computation as a primitive, with Bonawitz-style secure aggregation as one of the simplest applications.

,

Key Takeaway

Shamir secret sharing of the mask seeds bridges dropouts and privacy. The threshold tt is chosen so that TT colluders (fewer than tt) cannot reconstruct but the server, with tt shares from surviving users, can reconstruct dropped-user seeds and finalize the aggregate. This is the production-standard protocol; Β§10.4 proves its communication cost is information-theoretically optimal.

Quick Check

For a Bonawitz-protocol deployment with n=100n = 100 users, privacy threshold T=20T = 20 colluders, and dropout rate Ξ΄=0.20\delta = 0.20, what is an appropriate Shamir threshold tt?

t=20t = 20 (equal to TT)

t=67t = 67 (⌈2n/3βŒ‰\lceil 2n/3 \rceil)

t=80t = 80 (nβˆ’Ξ΄nn - \delta n)

t=99t = 99 (all but one user)