Coded Outlier Detection via Distance Aggregation

Computing Distances Without Seeing Gradients

Section 11.2 sketched ByzSecAgg's six-phase protocol and identified the core challenge: detecting Byzantine users requires some information about gradients, but privacy demands no leakage of individual values. The middle ground is to compute pairwise distances βˆ₯giβˆ’gjβˆ₯2\|\mathbf{g}_i - \mathbf{g}_j\|^2 β€” a quadratic function of the gradients β€” without revealing the gradients themselves.

The point is that pairwise distance is enough for Krum-style outlier detection: Byzantine users typically have abnormal distance profiles (far from the honest cluster). Computing distances on shares (rather than on plaintext) preserves privacy while enabling filtering. This is exactly what Lagrange Coded Computing (Chapter 8 Β§8.3) does for general polynomial functions.

Section 11.3 develops the coded distance computation in detail. The construction is one of the few instances where Chapter 8's LCC framework is used in production-relevant settings β€” a direct application of coded computing to privacy-preserving machine learning.

Definition:

Coded Pairwise Distance Computation

Given users 1,…,n1, \ldots, n with ramp-shared gradients {gk}\{\mathbf{g}_k\} over Fq\mathbb{F}_q, the coded pairwise distance between users ii and jj is dijβ€…β€Š=β€…β€Šβˆ₯giβˆ’gjβˆ₯2β€…β€Š=β€…β€Šβˆ‘β„“=1d(gi,β„“βˆ’gj,β„“)2.d_{ij} \;=\; \|\mathbf{g}_i - \mathbf{g}_j\|^2 \;=\; \sum_{\ell = 1}^{d} (g_{i,\ell} - g_{j,\ell})^2. This is a quadratic function of the gradients β€” degree df=2d_f = 2 in the LCC framework (Chapter 8 Β§8.3). The coded distance computation reuses the LCC infrastructure:

  1. Each user kk shares gk\mathbf{g}_k via ramp sharing into pieces {sk,β„“}β„“\{s_{k, \ell}\}_{\ell} held by other users.
  2. Each pair of users (i,j)(i, j) engages in an LCC-style exchange: compute the share of dijd_{ij} from the shares of gi\mathbf{g}_i and gj\mathbf{g}_j via polynomial multiplication and Lagrange interpolation.
  3. Server collects the dijd_{ij} values for all pairs without learning any individual gradient.

The construction works because dijd_{ij} is a polynomial of degree 2 in the inputs, and the LCC recovery threshold (Chapter 8 Β§8.3 Thm. 1) is Krec=df(nβˆ’1)+1=2(nβˆ’1)+1K_{\text{rec}} = d_f (n - 1) + 1 = 2(n-1) + 1. For ByzSecAgg's parameters, this is well within the available ramp shares.

,

Coded Pairwise Distance

The Euclidean distance βˆ₯giβˆ’gjβˆ₯2\|\mathbf{g}_i - \mathbf{g}_j\|^2 between two users' gradients, computed on shares via Lagrange Coded Computing without revealing the individual gradients to anyone (server or other users).

Definition:

Krum-Style Filtering on Coded Distances

The Krum outlier-detection rule (Blanchard et al. 2017) operates as follows on the matrix of pairwise distances {dij}i,j\{d_{ij}\}_{i, j}:

  1. For each user kk, compute the Krum score Skβ€…β€Š=β€…β€Šβˆ‘j∈nearest(k,nβˆ’Bβˆ’1)dkj,S_k \;=\; \sum_{j \in \text{nearest}(k, n - B - 1)} d_{kj}, where nearest(k,m)\text{nearest}(k, m) denotes the mm users with the smallest dkjd_{kj} (excluding kk itself).

  2. Identify the BB users with the largest SkS_k values as the Byzantine set B\mathcal{B}.

  3. Aggregate over the surviving honest users H=[n]βˆ–B\mathcal{H} = [n] \setminus \mathcal{B}.

The intuition: Byzantine users typically have gradients far from the honest cluster (otherwise they would not be effective Byzantines). Their Krum scores are abnormally large because their nearest-neighbor sums include many "honest" distances that are still large. The honest users, by contrast, have small Krum scores (they cluster around the true gradient).

In ByzSecAgg, this filtering operates on coded distances β€” the server never sees individual gradients, only the pairwise distance values. The filtering decision is the same as plaintext Krum but without the privacy violation.

The choice nearest(k,nβˆ’Bβˆ’1)\text{nearest}(k, n - B - 1) is deliberate: by considering only the nβˆ’Bβˆ’1n - B - 1 nearest users (excluding the BB furthest, which might be Byzantines themselves), the score is robust to the Byzantine influence. Bulyan (El Mhamdi et al. 2018) refines this further; ByzSecAgg can use any Byzantine-resilient aggregator that operates on pairwise distances.

,

Theorem: Krum Filtering Correctly Identifies Byzantines (Sketch)

Let the honest gradients {gk:k∈H}\{\mathbf{g}_k : k \in \mathcal{H}\} be drawn from a distribution concentrated around the true gradient gβˆ—\mathbf{g}^* (variance Οƒ2\sigma^2), and let the Byzantine gradients be arbitrary. For sufficiently well-separated Byzantine gradients (i.e., distance from gβˆ—\mathbf{g}^* exceeding a threshold Θ(Οƒ)\Theta(\sigma) with high probability), Krum's filtering correctly identifies the Byzantine set B\mathcal{B} with probability β‰₯1βˆ’2exp⁑(βˆ’cn)\geq 1 - 2 \exp(-c n) for some constant c>0c > 0.

The probability of error decays exponentially in nn, making Krum reliable for moderate-to-large user populations.

A Byzantine user's gradient, if far from gβˆ—\mathbf{g}^*, is also far from any honest user (which clusters around gβˆ—\mathbf{g}^*). The Byzantine user's distance scores to honest users are large. Conversely, an honest user has small distances to its nβˆ’Bβˆ’1n - B - 1 nearest others (mostly honest), giving a small Krum score.

The exponential concentration follows from Chernoff-style inequalities on the empirical distance distribution β€” a standard concentration-of- measure argument. The point is that as nn grows, the filtering becomes effectively deterministic.

Example: Krum Filter on Coded Distances: n=10,B=2n = 10, B = 2

Walk through Krum's filtering for n=10n = 10 users with B=2B = 2 Byzantines, given a hypothetical distance matrix.

Coded Pairwise Distance Computation (LCC Specialization)

Complexity: Per pair: O(d/g+nlog⁑n)O(d/g + n \log n) communication, O(n)O(n) field operations.
Input: Ramp shares {sk,β„“}\{s_{k, \ell}\} for each
user kk's gradient gk\mathbf{g}_k, distributed
among the nβˆ’1n - 1 other users. Pair (i,j)(i, j) for
distance computation.
Output: Pairwise distance dij=βˆ₯giβˆ’gjβˆ₯2d_{ij} = \|\mathbf{g}_i - \mathbf{g}_j\|^2 at the server.
1. **Each user β„“\ell holds shares si,β„“s_{i, \ell}
and sj,β„“s_{j, \ell}.**
2. Local LCC computation: User β„“\ell computes
d^(β„“)=(si,β„“βˆ’sj,β„“)T(si,β„“βˆ’sj,β„“)=βˆ₯si,β„“βˆ’sj,β„“βˆ₯2\hat d^{(\ell)} = (s_{i, \ell} - s_{j, \ell})^T (s_{i, \ell} - s_{j, \ell}) = \|s_{i, \ell} - s_{j, \ell}\|^2 β€” a quadratic function on
the share vectors.
3. Each user uploads d^(β„“)\hat d^{(\ell)} to the
server.
4. Server interpolates dijd_{ij} from the
collected {d^(β„“)}\{\hat d^{(\ell)}\}. By the LCC
recovery threshold (Chapter 8 Β§8.3) for
degree-2 functions: Krec=2(nβˆ’1)+1=2nβˆ’1K_{\text{rec}} = 2(n-1) + 1 = 2n - 1 shares suffice. For typical
ByzSecAgg parameters with nβ‰₯2nβˆ’1n \geq 2n - 1
satisfied trivially, the server has more than
enough.
5. Output dijd_{ij}.

The construction reuses the standard LCC framework of Chapter 8. The key adaptation is that the "input" is now ramp-shared (Chapter 3 Β§3.4 instead of Shamir-shared), which preserves the Byzantine-error-correction structure of the overall scheme. The polynomial degree is df=2d_f = 2 (quadratic in the gradients), so the LCC recovery threshold is linear in nn β€” feasible.

Krum Filtering Accuracy vs. Byzantine Fraction

Plot the empirical accuracy of Krum-style filtering (correctly identifying Byzantine users) as a function of the Byzantine fraction B/nB/n. As B/nB/n approaches 0.50.5, filtering becomes unreliable; for B/n≀0.25B/n \leq 0.25, accuracy approaches 100% with sufficient nn. The plot illustrates the operational regime where ByzSecAgg's filtering is effective.

Parameters
100
1

Common Mistake: Krum Filtering Has a Bias for Sparse Byzantine Patterns

Mistake:

Assume Krum filtering works for arbitrary Byzantine strategies, including coordinated dense attacks.

Correction:

Krum's effectiveness depends on Byzantine gradients being outliers in the distance distribution. If multiple Byzantines coordinate to send gradients that are close to each other (and close to honest gradients but in the wrong direction), their Krum scores can match honest users' β€” and they survive the filter. Bulyan (El Mhamdi et al. 2018) and other refinements address this; in production, ByzSecAgg can use Bulyan or similar robust aggregators on top of the coded distance computation. The information-theoretic guarantees of Β§11.2 hold regardless of which specific aggregator is used.

πŸ”§Engineering Note

Krum, Bulyan, Trimmed Mean: Choices Within ByzSecAgg

ByzSecAgg's outlier-detection step (Phase 4) is pluggable. Production deployments choose among:

  • Krum: Simple, fast, O(n2)O(n^2) score computation. Works well for B/n<0.25B/n < 0.25.
  • Bulyan: Krum + trimmed-mean refinement. Better tolerance against coordinated attacks, O(n2)O(n^2) computation. Industry default for Byzantine FL.
  • Trimmed Mean (Yin et al. 2018): Coordinate- wise removal of Ξ²\beta-fraction extremes. Lighter computation; loses some information-theoretic privacy because per- coordinate operations interact with ramp shares.
  • FedSeg (Sun et al. 2021): ByzSecAgg- compatible variant with explicit segment-based filtering.

The choice depends on attack model and per- aggregator complexity budget. ByzSecAgg's information-theoretic privacy holds for any aggregator that operates on the pairwise distance matrix.

Practical Constraints
  • β€’

    Krum: simplest, B/n<0.25B/n < 0.25

  • β€’

    Bulyan: more robust, slightly higher cost

  • β€’

    Choice depends on attack model

πŸ“‹ Ref: Blanchard et al. 2017; El Mhamdi et al. 2018; Yin et al. 2018
,

Historical Note: The Byzantine Aggregator Family Tree

2017–2023

The Byzantine-resilient aggregator family in FL began with Krum (Blanchard et al. 2017), inspired by Median-style robust statistics from the distributed-estimation literature. Yin et al. (2018) introduced Trimmed Mean, which has tighter statistical properties for sub-Gaussian noise. Bulyan (El Mhamdi et al. 2018) combined Krum with trimmed-mean post-processing for stronger guarantees.

Until ByzSecAgg (2023), all these aggregators operated on plaintext gradients β€” incompatible with secure aggregation's privacy guarantee. The coded-distance trick of Β§11.3, lifted from Lagrange Coded Computing (Yu et al. 2019, Chapter 8), enabled the combination. ByzSecAgg showed that any distance-based Byzantine aggregator can be combined with secure aggregation via this technique β€” a structural insight that has shaped the post-2023 literature.

, ,

Key Takeaway

Coded distance computation enables Byzantine detection without privacy violation. By computing pairwise distances on ramp shares using Lagrange Coded Computing, the server can apply any distance-based Byzantine aggregator (Krum, Bulyan, Trimmed Mean) without learning individual gradients. The construction is Chapter 8's LCC framework specialized to quadratic distance functions.

Quick Check

In ByzSecAgg, the server computes pairwise distances dij=βˆ₯giβˆ’gjβˆ₯2d_{ij} = \|\mathbf{g}_i - \mathbf{g}_j\|^2 via:

Plaintext gradient inspection by the server.

Lagrange Coded Computing on ramp shares β€” each user computes a quadratic LCC contribution on its shares; server interpolates the distance.

Encrypted gradient computation via homomorphic encryption.

Random projections that approximately preserve distances.