Ferkans — Interactive Telecom Tutor

Dropouts Break the Naive Protocol

The Bonawitz protocol of §10.2 assumes all $n$ users successfully complete the round. In production FL, dropouts are frequent: mobile devices go offline, networks time out, users close apps. A $\delta = 5$ – $30\%$ dropout rate per round is typical.

The naive protocol breaks catastrophically under dropouts. If user $k$ drops out, the server never receives $\tilde{\mathbf{g}}_k = \mathbf{g}_k + \sum_j \mathbf{r}_{kj}$ . Summing only the surviving uploads gives $\sum_{k \in \mathcal{S}} \tilde{\mathbf{g}}_k = \sum_{k \in \mathcal{S}} \mathbf{g}_k + \sum_{k \in \mathcal{S}, j \in \mathcal{D}} \mathbf{r}_{kj}$ . The mask contributions between surviving and dropped users no longer cancel — the server gets a noisy aggregate at best, and privacy may also break.

The point of Section 10.3 is the fix: Shamir-shared seeds with per-user threshold reconstruction. Each user's seeds are Shamir-shared among the other users before the round; if a user drops, any $t$ surviving users can reconstruct its seeds and subtract the leftover masks. The threshold $t$ is chosen to tolerate the maximum number of colluders plus the expected dropout rate.

This fix preserves both correctness (aggregation succeeds under dropouts) and privacy (colluders cannot recover secrets of honest users). It is the machinery that makes Bonawitz's protocol production-viable.

Definition:
Dropout-Resilient Secure Aggregation

A dropout-resilient secure-aggregation protocol on $n$ users, privacy threshold $T$ , and tolerable dropout rate $\delta$ satisfies:

Correctness under dropouts. For any set $\mathcal{D} \subseteq [n]$ of dropouts with $|\mathcal{D}| \leq \delta n$ , and the surviving set $\mathcal{S} = [n] \setminus \mathcal{D}$ , the server correctly computes $\mathbf{G}_\mathcal{S} = \sum_{k \in \mathcal{S}} \mathbf{g}_k$ .
Privacy. The adversary (server + up to $T$ colluders) learns only $\mathbf{G}_\mathcal{S}$ and nothing else about individual $\mathbf{g}_k$ for $k \in \mathcal{S} \setminus \mathcal{U}$ .

The design challenge is to preserve privacy even when the masks of dropped users must be reconstructed. Naive reconstruction (asking surviving users to reveal their seeds with dropped ones) would expose the masks to the server; the Shamir secret sharing of seeds fixes this by requiring a threshold of surviving users to cooperate.

The privacy threshold $T$ and dropout rate $\delta$ are related: the Shamir threshold $t$ for seed sharing must satisfy $t \geq T + 1$ (ensuring $T$ colluders cannot reconstruct a seed alone) and $t \leq n - \delta n - T$ (ensuring enough honest survivors to reconstruct dropped-user seeds). This gives the feasibility constraint $T + \delta n \leq n - T - 1$ , or equivalently $2T + \delta n \leq n - 1$ .

Dropout Resilience

A property of a secure-aggregation protocol that allows some users to drop out mid-round without breaking correctness or privacy. Achieved via Shamir-shared seeds with per-user threshold reconstruction.

Bonawitz Protocol with Dropout Handling (Full)

Complexity: Per-round:

O(n)

DH exchanges,

O(n \log t)

share distribution,

O(d)

gradient upload per user. Server side:

O(n^2 \log t)

reconstruction work.

Setup. Each user

k

has DH key pair

(\text{sk}_k, \text{pk}_k)

and additional "self-mask"

seed

b_k

for handling the "did not drop" case.

Phase 0: Key Distribution.

Users exchange DH public keys.

Phase 1: Seed Sharing.

1. For each other user

j

: user

k

derives pairwise

seed

s_{kj} = \text{DH}(\text{sk}_k, \text{pk}_j)

.

2. For the self-mask

b_k

(another random seed), user

k

Shamir-shares

b_k

with threshold

t

over

distinct evaluation points associated with other users.

3. For each pairwise seed

s_{kj}

, user

k

Shamir-shares

s_{kj}

with threshold

t

over

the same set of evaluation points.

4. Each user

k

receives

n - 1

share packets (one

per other user) holding Shamir shares of the

pairwise seeds and self-mask seeds.

Phase 2: Masked Upload.

5. User

k

derives actual pairwise masks

\mathbf{r}_{kj}

(via PRG on

s_{kj}

) and

self-mask

\mathbf{m}_k

(via PRG on

b_k

).

6. User

k

uploads

\tilde{\mathbf{g}}_k = \mathbf{g}_k + \sum_j \text{sign}(k,j) \mathbf{r}_{kj} + \mathbf{m}_k

to the server.

Phase 3: Dropout Handling.

7. Server identifies surviving users

\mathcal{S}

and

dropped users

\mathcal{D}

.

8. Server requests from each surviving user

k

their

Shamir shares:

- For each dropped user

j \in \mathcal{D}

: share

of pairwise seed

s_{kj}

.

- For each surviving user

j \in \mathcal{S}

:

share of self-mask seed

b_j

(to cancel the

self-masks).

9. Server, with

\geq t

shares per dropped/surviving

user, reconstructs the needed seeds.

10. Server subtracts the leftover masks and produces

\mathbf{G}_\mathcal{S} = \sum_{k \in \mathcal{S}} \mathbf{g}_k

.

Phase 4: Output. Server delivers

\mathbf{G}_\mathcal{S}

to users (typically as the basis for the next round's

model update).

The self-mask $\mathbf{m}_k$ is a subtle but important addition: if a user only has pairwise masks and never drops, the reconstruction of its seeds (triggered by the server to cancel masks of dropped users) would reveal its mask contribution. The self-mask adds an extra layer so that this reconstruction still preserves the user's individual gradient's privacy.

Theorem: Bonawitz with Dropouts: Correctness + Privacy Under Collusion

Let $t$ be the Shamir threshold, $T$ the privacy threshold (colluders), and $\delta$ the tolerable dropout fraction. The Bonawitz protocol with dropout handling achieves:

Correctness. For dropouts of size $\leq n - t$ (enough surviving users to reconstruct seeds), the server computes $\mathbf{G}_\mathcal{S}$ exactly.
Privacy. Any coalition of size $T$ users + server cannot learn individual gradients beyond $\mathbf{G}_\mathcal{S}$ , provided $T < t$ .

The feasibility constraint is $T + 1 \leq t \leq n - \delta n$ , i.e., $T + 1 + \delta n \leq n$ , equivalently $T + \delta n \leq n - 1$ . Production settings: $T = O(\sqrt{n})$ , $\delta = 0.2$ , $t = \lceil 2n/3 \rceil$ .

The Shamir threshold $t$ sits between the privacy threshold $T$ (colluders cannot reconstruct) and the dropout tolerance $n - \delta n$ (surviving users can reconstruct). Each seed is a secret in a $(t, n-1)$ -Shamir scheme. The server, with access to shares from $\geq t$ surviving users, reconstructs dropped users' seeds; a coalition of $T < t$ users cannot.

The point is that the same algebraic primitive (Shamir) from Chapter 3 provides both robustness (dropout handling) and privacy (threshold). This is the standard modular-design pattern of modern MPC: build complex protocols from simple information- theoretic primitives.

Proof

Correctness

After dropouts, the server has: (i) masked uploads from surviving users, (ii) $\geq t$ shares of each dropped user's pairwise seeds (from surviving users). With $\geq t$ shares, the server reconstructs each dropped user's seeds via Lagrange interpolation (Chapter 3 §3.2), derives the pairwise masks, and subtracts them from the surviving sum.

Self-masks

For each surviving user $k \in \mathcal{S}$ , the self-mask $\mathbf{m}_k$ is present in the sum. The server reconstructs $b_k$ from surviving-user shares (using threshold $t$ ) and subtracts $\mathbf{m}_k$ from the sum.

Privacy against $T < t$ colluders

A coalition of $T < t$ users sees: (i) $T$ user gradients $\{\mathbf{g}_k : k \in \mathcal{U}\}$ , (ii) $T$ shares of each seed. By Shamir's perfect- secrecy property, $T$ shares reveal nothing about any seed (which has $t$ -threshold privacy). Hence the pairwise masks cannot be reconstructed by the coalition alone, and the masked uploads of honest users are uniform modulo the aggregate.

Composition via hybrid argument

The full protocol composes these guarantees via a standard simulation-based proof (Bonawitz et al. 2017 §4.2). The reader is referred to the paper for the cryptographic details. The key observation is that dropouts and privacy are handled by the same Shamir-share threshold, with $T < t \leq n - \delta n$ . $\blacksquare$

Example: Feasibility Regions for Dropout + Privacy

For $n = 100$ users, compute the feasibility region over $(T, \delta)$ for the Bonawitz dropout-resilient protocol. What are the feasible $(T, \delta)$ pairs?

Solution

Constraint

$T + \delta n \leq n - 1$ , or equivalently $T + \delta \cdot 100 \leq 99$ , or $T \leq 99 - 100 \delta$ .

Example points

$T = 0$ : $\delta \leq 0.99$ (tolerate up to $99\%$ dropouts, no collusion).
$T = 10$ : $\delta \leq 0.89$ .
$T = 25$ : $\delta \leq 0.74$ .
$T = 50$ : $\delta \leq 0.49$ .
$T = 99$ : $\delta \leq 0$ (no dropouts, $T = n - 1$ max collusion).

Typical production setting

$T = 20$ (tolerate up to 20 colluders), $\delta = 0.2$ (tolerate 20 dropouts): feasibility check: $20 + 20 = 40 \leq 99$ . Feasible. Shamir threshold $t$ can be chosen in $[21, 80]$ — typically $t = \lceil 2n/3 \rceil = 67$ .

Tight regime

At high $\delta$ and high $T$ simultaneously (e.g., $T = 40, \delta = 0.5$ ), the constraint $T + \delta n \leq n - 1$ becomes $40 + 50 = 90 \leq 99$ — feasible but close to the boundary. Margins should be designed conservatively.

Dropout-Resilience vs. Privacy-Collusion Tradeoff

Plot the feasibility region for the Bonawitz protocol in the $(T, \delta)$ plane, for various $n$ . The feasibility boundary is $T + \delta n = n - 1$ ; above this boundary, no combination of Shamir threshold and pairwise masks achieves both privacy and dropout resilience. Higher $n$ shifts the boundary outward — larger user populations tolerate more collusion and dropouts simultaneously.

Parameters

n

— users100

Total users

Common Mistake: Shamir Threshold Must Exceed Collusion Threshold

Mistake:

Choose the Shamir threshold $t = T$ (= privacy threshold).

Correction:

Setting $t = T$ allows a coalition of exactly $T$ colluders to reconstruct the Shamir-shared seeds, then compute the pairwise masks and extract individual gradients. The correct choice is $t \geq T + 1$ (strictly more than the collusion threshold), so any $T$ colluders have $T < t$ shares and cannot reconstruct.

Production deployments use $t = \lceil 2n/3 \rceil$ or similar large fractions, giving large safety margins over typical $T$ values. The cost of larger $t$ is that more surviving users are needed for reconstruction — but Shamir's per-share cost is small (§3.2), so this is affordable.

⚠️Engineering Note

Dropout Handling in Production

Production Bonawitz deployments:

Typical dropout rate: $\delta = 0.15$ – $0.30$ (15–30% of users drop per round on mobile FL).
Typical privacy threshold: $T = \lceil n/5 \rceil$ to $\lceil n/3 \rceil$ .
Shamir threshold: $t = \lceil 2n/3 \rceil$ (comfortable margin over $T$ , enough for typical dropouts).
Reconstruction latency: $\sim 1$ second per reconstruction; $n \cdot 0.3 = 30$ reconstructions per round at $n = 100$ , so $\sim 30$ seconds added to each round.

The Shamir-reconstruction step dominates server-side CPU time in production. CCESA (Chapter 12) reduces the number of seeds to reconstruct (sparser graph), but within the Bonawitz framework, the overhead is intrinsic.

Practical Constraints

•
Dropout rate: $\delta \in [0.15, 0.30]$ typical
•
Shamir threshold: $t = \lceil 2n/3 \rceil$ standard
•
Reconstruction: $O(n^2 \log t)$ server CPU per round

📋 Ref: Bonawitz et al. 2017 §4; Google FL systems 2019

Historical Note: Shamir Secret Sharing in Modern Cryptography

1979–2017 (Shamir's generalization); 2017 (Bonawitz application)

Shamir's secret sharing, introduced in 1979 (Chapter 3), found a natural home in secure aggregation nearly four decades later. The connection is apt: both problems require splitting a secret (in §10.3, the mask seed) among parties such that a threshold of them can reconstruct while fewer cannot. Bonawitz et al. (2017) brought Shamir to federated learning via the seed-sharing mechanism described in this section.

The pattern — secret-share a cryptographic primitive, then reconstruct on demand — is now standard in threshold cryptography (Chapter 3 §3.5), threshold signatures (Byzantine agreement), and verifiable secret sharing (Chapter 11's ByzSecAgg). The abstraction is clean, and the Shamir implementation is robust. Modern MPC libraries (SPDZ, MP-SPDZ, FRESCO) expose Shamir- shared computation as a primitive, with Bonawitz-style secure aggregation as one of the simplest applications.

,

Key Takeaway

Shamir secret sharing of the mask seeds bridges dropouts and privacy. The threshold $t$ is chosen so that $T$ colluders (fewer than $t$ ) cannot reconstruct but the server, with $t$ shares from surviving users, can reconstruct dropped-user seeds and finalize the aggregate. This is the production-standard protocol; §10.4 proves its communication cost is information-theoretically optimal.

Quick Check

For a Bonawitz-protocol deployment with $n = 100$ users, privacy threshold $T = 20$ colluders, and dropout rate $\delta = 0.20$ , what is an appropriate Shamir threshold $t$ ?

$t = 20$ (equal to $T$ )

$t = 67$ ( $\lceil 2n/3 \rceil$ )

$t = 80$ ( $n - \delta n$ )

$t = 99$ (all but one user)