Coded Outlier Detection via Distance Aggregation
Computing Distances Without Seeing Gradients
Section 11.2 sketched ByzSecAgg's six-phase protocol and identified the core challenge: detecting Byzantine users requires some information about gradients, but privacy demands no leakage of individual values. The middle ground is to compute pairwise distances β a quadratic function of the gradients β without revealing the gradients themselves.
The point is that pairwise distance is enough for Krum-style outlier detection: Byzantine users typically have abnormal distance profiles (far from the honest cluster). Computing distances on shares (rather than on plaintext) preserves privacy while enabling filtering. This is exactly what Lagrange Coded Computing (Chapter 8 Β§8.3) does for general polynomial functions.
Section 11.3 develops the coded distance computation in detail. The construction is one of the few instances where Chapter 8's LCC framework is used in production-relevant settings β a direct application of coded computing to privacy-preserving machine learning.
Definition: Coded Pairwise Distance Computation
Coded Pairwise Distance Computation
Given users with ramp-shared gradients over , the coded pairwise distance between users and is This is a quadratic function of the gradients β degree in the LCC framework (Chapter 8 Β§8.3). The coded distance computation reuses the LCC infrastructure:
- Each user shares via ramp sharing into pieces held by other users.
- Each pair of users engages in an LCC-style exchange: compute the share of from the shares of and via polynomial multiplication and Lagrange interpolation.
- Server collects the values for all pairs without learning any individual gradient.
The construction works because is a polynomial of degree 2 in the inputs, and the LCC recovery threshold (Chapter 8 Β§8.3 Thm. 1) is . For ByzSecAgg's parameters, this is well within the available ramp shares.
Coded Pairwise Distance
The Euclidean distance between two users' gradients, computed on shares via Lagrange Coded Computing without revealing the individual gradients to anyone (server or other users).
Definition: Krum-Style Filtering on Coded Distances
Krum-Style Filtering on Coded Distances
The Krum outlier-detection rule (Blanchard et al. 2017) operates as follows on the matrix of pairwise distances :
-
For each user , compute the Krum score where denotes the users with the smallest (excluding itself).
-
Identify the users with the largest values as the Byzantine set .
-
Aggregate over the surviving honest users .
The intuition: Byzantine users typically have gradients far from the honest cluster (otherwise they would not be effective Byzantines). Their Krum scores are abnormally large because their nearest-neighbor sums include many "honest" distances that are still large. The honest users, by contrast, have small Krum scores (they cluster around the true gradient).
In ByzSecAgg, this filtering operates on coded distances β the server never sees individual gradients, only the pairwise distance values. The filtering decision is the same as plaintext Krum but without the privacy violation.
The choice is deliberate: by considering only the nearest users (excluding the furthest, which might be Byzantines themselves), the score is robust to the Byzantine influence. Bulyan (El Mhamdi et al. 2018) refines this further; ByzSecAgg can use any Byzantine-resilient aggregator that operates on pairwise distances.
Theorem: Krum Filtering Correctly Identifies Byzantines (Sketch)
Let the honest gradients be drawn from a distribution concentrated around the true gradient (variance ), and let the Byzantine gradients be arbitrary. For sufficiently well-separated Byzantine gradients (i.e., distance from exceeding a threshold with high probability), Krum's filtering correctly identifies the Byzantine set with probability for some constant .
The probability of error decays exponentially in , making Krum reliable for moderate-to-large user populations.
A Byzantine user's gradient, if far from , is also far from any honest user (which clusters around ). The Byzantine user's distance scores to honest users are large. Conversely, an honest user has small distances to its nearest others (mostly honest), giving a small Krum score.
The exponential concentration follows from Chernoff-style inequalities on the empirical distance distribution β a standard concentration-of- measure argument. The point is that as grows, the filtering becomes effectively deterministic.
Setup
Honest gradients i.i.d. with mean and variance ; Byzantine gradients arbitrary but assumed at distance from for some sufficient constant .
Honest distance concentration
For honest , the distance with probability (Chernoff). The Krum score .
Byzantine distance separation
For Byzantine , the distance to most honest users exceeds . Hence , much larger than honest scores.
Filtering correctness
With high probability, the users with largest scores are Byzantine. Probability of misclassification for some . Full proof: Blanchard et al. 2017 Thm. 1.
Example: Krum Filter on Coded Distances:
Walk through Krum's filtering for users with Byzantines, given a hypothetical distance matrix.
Setup
. Krum looks at the nearest neighbors for each user.
Hypothetical distances
Imagine users 1β8 have honest gradients with pairwise distances (clustered). Users 9 and 10 are Byzantines with gradients far from honest, distances to honest .
Krum scores
Honest user : sum of 7 nearest distances .
Byzantine user 9: sum of 7 nearest, i.e., users 1, 2, ... 7 (all honest at distance each): .
Same for user 10.
Filtering decision
Sort scores: 8 honest at , 2 Byzantines at . The 2 largest scores are users 9 and 10 β correctly identified as Byzantine.
Aggregation
Surviving set . Server reconstructs from ramp shares.
Coded Pairwise Distance Computation (LCC Specialization)
Complexity: Per pair: communication, field operations.The construction reuses the standard LCC framework of Chapter 8. The key adaptation is that the "input" is now ramp-shared (Chapter 3 Β§3.4 instead of Shamir-shared), which preserves the Byzantine-error-correction structure of the overall scheme. The polynomial degree is (quadratic in the gradients), so the LCC recovery threshold is linear in β feasible.
Krum Filtering Accuracy vs. Byzantine Fraction
Plot the empirical accuracy of Krum-style filtering (correctly identifying Byzantine users) as a function of the Byzantine fraction . As approaches , filtering becomes unreliable; for , accuracy approaches 100% with sufficient . The plot illustrates the operational regime where ByzSecAgg's filtering is effective.
Parameters
Common Mistake: Krum Filtering Has a Bias for Sparse Byzantine Patterns
Mistake:
Assume Krum filtering works for arbitrary Byzantine strategies, including coordinated dense attacks.
Correction:
Krum's effectiveness depends on Byzantine gradients being outliers in the distance distribution. If multiple Byzantines coordinate to send gradients that are close to each other (and close to honest gradients but in the wrong direction), their Krum scores can match honest users' β and they survive the filter. Bulyan (El Mhamdi et al. 2018) and other refinements address this; in production, ByzSecAgg can use Bulyan or similar robust aggregators on top of the coded distance computation. The information-theoretic guarantees of Β§11.2 hold regardless of which specific aggregator is used.
Krum, Bulyan, Trimmed Mean: Choices Within ByzSecAgg
ByzSecAgg's outlier-detection step (Phase 4) is pluggable. Production deployments choose among:
- Krum: Simple, fast, score computation. Works well for .
- Bulyan: Krum + trimmed-mean refinement. Better tolerance against coordinated attacks, computation. Industry default for Byzantine FL.
- Trimmed Mean (Yin et al. 2018): Coordinate- wise removal of -fraction extremes. Lighter computation; loses some information-theoretic privacy because per- coordinate operations interact with ramp shares.
- FedSeg (Sun et al. 2021): ByzSecAgg- compatible variant with explicit segment-based filtering.
The choice depends on attack model and per- aggregator complexity budget. ByzSecAgg's information-theoretic privacy holds for any aggregator that operates on the pairwise distance matrix.
- β’
Krum: simplest,
- β’
Bulyan: more robust, slightly higher cost
- β’
Choice depends on attack model
Historical Note: The Byzantine Aggregator Family Tree
2017β2023The Byzantine-resilient aggregator family in FL began with Krum (Blanchard et al. 2017), inspired by Median-style robust statistics from the distributed-estimation literature. Yin et al. (2018) introduced Trimmed Mean, which has tighter statistical properties for sub-Gaussian noise. Bulyan (El Mhamdi et al. 2018) combined Krum with trimmed-mean post-processing for stronger guarantees.
Until ByzSecAgg (2023), all these aggregators operated on plaintext gradients β incompatible with secure aggregation's privacy guarantee. The coded-distance trick of Β§11.3, lifted from Lagrange Coded Computing (Yu et al. 2019, Chapter 8), enabled the combination. ByzSecAgg showed that any distance-based Byzantine aggregator can be combined with secure aggregation via this technique β a structural insight that has shaped the post-2023 literature.
Key Takeaway
Coded distance computation enables Byzantine detection without privacy violation. By computing pairwise distances on ramp shares using Lagrange Coded Computing, the server can apply any distance-based Byzantine aggregator (Krum, Bulyan, Trimmed Mean) without learning individual gradients. The construction is Chapter 8's LCC framework specialized to quadratic distance functions.
Quick Check
In ByzSecAgg, the server computes pairwise distances via:
Plaintext gradient inspection by the server.
Lagrange Coded Computing on ramp shares β each user computes a quadratic LCC contribution on its shares; server interpolates the distance.
Encrypted gradient computation via homomorphic encryption.
Random projections that approximately preserve distances.
Exactly. LCC handles the degree-2 distance function with recovery threshold , well within ByzSecAgg's parameters.