Ferkans — Interactive Telecom Tutor

Computing Distances Without Seeing Gradients

Section 11.2 sketched ByzSecAgg's six-phase protocol and identified the core challenge: detecting Byzantine users requires some information about gradients, but privacy demands no leakage of individual values. The middle ground is to compute pairwise distances $\|\mathbf{g}_i - \mathbf{g}_j\|^2$ — a quadratic function of the gradients — without revealing the gradients themselves.

The point is that pairwise distance is enough for Krum-style outlier detection: Byzantine users typically have abnormal distance profiles (far from the honest cluster). Computing distances on shares (rather than on plaintext) preserves privacy while enabling filtering. This is exactly what Lagrange Coded Computing (Chapter 8 §8.3) does for general polynomial functions.

Section 11.3 develops the coded distance computation in detail. The construction is one of the few instances where Chapter 8's LCC framework is used in production-relevant settings — a direct application of coded computing to privacy-preserving machine learning.

Definition:
Coded Pairwise Distance Computation

Given users $1, \ldots, n$ with ramp-shared gradients $\{\mathbf{g}_k\}$ over $\mathbb{F}_q$ , the coded pairwise distance between users $i$ and $j$ is $d_{ij} \;=\; \|\mathbf{g}_i - \mathbf{g}_j\|^2 \;=\; \sum_{\ell = 1}^{d} (g_{i,\ell} - g_{j,\ell})^2.$ This is a quadratic function of the gradients — degree $d_f = 2$ in the LCC framework (Chapter 8 §8.3). The coded distance computation reuses the LCC infrastructure:

Each user $k$ shares $\mathbf{g}_k$ via ramp sharing into pieces $\{s_{k, \ell}\}_{\ell}$ held by other users.
Each pair of users $(i, j)$ engages in an LCC-style exchange: compute the share of $d_{ij}$ from the shares of $\mathbf{g}_i$ and $\mathbf{g}_j$ via polynomial multiplication and Lagrange interpolation.
Server collects the $d_{ij}$ values for all pairs without learning any individual gradient.

The construction works because $d_{ij}$ is a polynomial of degree 2 in the inputs, and the LCC recovery threshold (Chapter 8 §8.3 Thm. 1) is $K_{\text{rec}} = d_f (n - 1) + 1 = 2(n-1) + 1$ . For ByzSecAgg's parameters, this is well within the available ramp shares.

,

Coded Pairwise Distance

The Euclidean distance $\|\mathbf{g}_i - \mathbf{g}_j\|^2$ between two users' gradients, computed on shares via Lagrange Coded Computing without revealing the individual gradients to anyone (server or other users).

Definition:
Krum-Style Filtering on Coded Distances

The Krum outlier-detection rule (Blanchard et al. 2017) operates as follows on the matrix of pairwise distances $\{d_{ij}\}_{i, j}$ :

For each user $k$ , compute the Krum score $S_k \;=\; \sum_{j \in \text{nearest}(k, n - B - 1)} d_{kj},$ where $\text{nearest}(k, m)$ denotes the $m$ users with the smallest $d_{kj}$ (excluding $k$ itself).
Identify the $B$ users with the largest $S_k$ values as the Byzantine set $\mathcal{B}$ .
Aggregate over the surviving honest users $\mathcal{H} = [n] \setminus \mathcal{B}$ .

The intuition: Byzantine users typically have gradients far from the honest cluster (otherwise they would not be effective Byzantines). Their Krum scores are abnormally large because their nearest-neighbor sums include many "honest" distances that are still large. The honest users, by contrast, have small Krum scores (they cluster around the true gradient).

In ByzSecAgg, this filtering operates on coded distances — the server never sees individual gradients, only the pairwise distance values. The filtering decision is the same as plaintext Krum but without the privacy violation.

The choice $\text{nearest}(k, n - B - 1)$ is deliberate: by considering only the $n - B - 1$ nearest users (excluding the $B$ furthest, which might be Byzantines themselves), the score is robust to the Byzantine influence. Bulyan (El Mhamdi et al. 2018) refines this further; ByzSecAgg can use any Byzantine-resilient aggregator that operates on pairwise distances.

,

Theorem: Krum Filtering Correctly Identifies Byzantines (Sketch)

Let the honest gradients $\{\mathbf{g}_k : k \in \mathcal{H}\}$ be drawn from a distribution concentrated around the true gradient $\mathbf{g}^*$ (variance $\sigma^2$ ), and let the Byzantine gradients be arbitrary. For sufficiently well-separated Byzantine gradients (i.e., distance from $\mathbf{g}^*$ exceeding a threshold $\Theta(\sigma)$ with high probability), Krum's filtering correctly identifies the Byzantine set $\mathcal{B}$ with probability $\geq 1 - 2 \exp(-c n)$ for some constant $c > 0$ .

The probability of error decays exponentially in $n$ , making Krum reliable for moderate-to-large user populations.

A Byzantine user's gradient, if far from $\mathbf{g}^*$ , is also far from any honest user (which clusters around $\mathbf{g}^*$ ). The Byzantine user's distance scores to honest users are large. Conversely, an honest user has small distances to its $n - B - 1$ nearest others (mostly honest), giving a small Krum score.

The exponential concentration follows from Chernoff-style inequalities on the empirical distance distribution — a standard concentration-of- measure argument. The point is that as $n$ grows, the filtering becomes effectively deterministic.

Proof

Setup

Honest gradients i.i.d. with mean $\mathbf{g}^*$ and variance $\sigma^2$ ; Byzantine gradients arbitrary but assumed at distance $\geq C \sigma$ from $\mathbf{g}^*$ for some sufficient constant $C$ .

Honest distance concentration

For honest $k$ , the distance $\|g_k - g^*\|^2 \leq 2 \sigma^2 + O(\sigma^2 \log(1/\delta)/d)$ with probability $1 - \delta$ (Chernoff). The Krum score $S_k \leq (n - B - 1)(\text{small})$ .

Byzantine distance separation

For Byzantine $k$ , the distance to most honest users exceeds $C^2 \sigma^2$ . Hence $S_k \geq (n - B - 1) C^2 \sigma^2$ , much larger than honest scores.

Filtering correctness

With high probability, the $B$ users with largest scores are Byzantine. Probability of misclassification $\leq 2 \exp(-c n)$ for some $c > 0$ . Full proof: Blanchard et al. 2017 Thm. 1. $\blacksquare$

Example: Krum Filter on Coded Distances: $n = 10, B = 2$

Walk through Krum's filtering for $n = 10$ users with $B = 2$ Byzantines, given a hypothetical distance matrix.

Solution

Setup

$n = 10, B = 2$ . Krum looks at the $n - B - 1 = 7$ nearest neighbors for each user.

Hypothetical distances

Imagine users 1–8 have honest gradients with pairwise distances $\sim 1$ (clustered). Users 9 and 10 are Byzantines with gradients far from honest, distances to honest $\sim 100$ .

Krum scores

Honest user $k \in \{1, ..., 8\}$ : $S_k =$ sum of 7 nearest distances $\approx 7 \cdot 1 = 7$ .

Byzantine user 9: $S_9 =$ sum of 7 nearest, i.e., users 1, 2, ... 7 (all honest at distance $\sim 100$ each): $S_9 \approx 700$ .

Same for user 10.

Filtering decision

Sort scores: 8 honest at $\sim 7$ , 2 Byzantines at $\sim 700$ . The 2 largest scores are users 9 and 10 — correctly identified as Byzantine.

Aggregation

Surviving set $\mathcal{H} = \{1, ..., 8\}$ . Server reconstructs $\mathbf{G}^* = \sum_{k \in \mathcal{H}} \mathbf{g}_k$ from ramp shares.

Coded Pairwise Distance Computation (LCC Specialization)

Complexity: Per pair:

O(d/g + n \log n)

communication,

O(n)

field operations.

Input: Ramp shares

\{s_{k, \ell}\}

for each

user

k

's gradient

\mathbf{g}_k

, distributed

among the

n - 1

other users. Pair

(i, j)

for

distance computation.

Output: Pairwise distance

d_{ij} = \|\mathbf{g}_i - \mathbf{g}_j\|^2

at the server.

1. **Each user

\ell

holds shares

s_{i, \ell}

and

s_{j, \ell}

.**

2. Local LCC computation: User

\ell

computes

\hat d^{(\ell)} = (s_{i, \ell} - s_{j, \ell})^T (s_{i, \ell} - s_{j, \ell}) = \|s_{i, \ell} - s_{j, \ell}\|^2

— a quadratic function on

the share vectors.

3. Each user uploads

\hat d^{(\ell)}

to the

server.

4. Server interpolates

d_{ij}

from the

collected

\{\hat d^{(\ell)}\}

. By the LCC

recovery threshold (Chapter 8 §8.3) for

degree-2 functions:

K_{\text{rec}} = 2(n-1) + 1 = 2n - 1

shares suffice. For typical

ByzSecAgg parameters with

n \geq 2n - 1

satisfied trivially, the server has more than

enough.

5. Output

d_{ij}

.

The construction reuses the standard LCC framework of Chapter 8. The key adaptation is that the "input" is now ramp-shared (Chapter 3 §3.4 instead of Shamir-shared), which preserves the Byzantine-error-correction structure of the overall scheme. The polynomial degree is $d_f = 2$ (quadratic in the gradients), so the LCC recovery threshold is linear in $n$ — feasible.

Krum Filtering Accuracy vs. Byzantine Fraction

Plot the empirical accuracy of Krum-style filtering (correctly identifying Byzantine users) as a function of the Byzantine fraction $B/n$ . As $B/n$ approaches $0.5$ , filtering becomes unreliable; for $B/n \leq 0.25$ , accuracy approaches 100% with sufficient $n$ . The plot illustrates the operational regime where ByzSecAgg's filtering is effective.

Parameters

n

— users100

\sigma

— honest gradient noise1

Common Mistake: Krum Filtering Has a Bias for Sparse Byzantine Patterns

Mistake:

Assume Krum filtering works for arbitrary Byzantine strategies, including coordinated dense attacks.

Correction:

Krum's effectiveness depends on Byzantine gradients being outliers in the distance distribution. If multiple Byzantines coordinate to send gradients that are close to each other (and close to honest gradients but in the wrong direction), their Krum scores can match honest users' — and they survive the filter. Bulyan (El Mhamdi et al. 2018) and other refinements address this; in production, ByzSecAgg can use Bulyan or similar robust aggregators on top of the coded distance computation. The information-theoretic guarantees of §11.2 hold regardless of which specific aggregator is used.

🔧Engineering Note

Krum, Bulyan, Trimmed Mean: Choices Within ByzSecAgg

ByzSecAgg's outlier-detection step (Phase 4) is pluggable. Production deployments choose among:

Krum: Simple, fast, $O(n^2)$ score computation. Works well for $B/n < 0.25$ .
Bulyan: Krum + trimmed-mean refinement. Better tolerance against coordinated attacks, $O(n^2)$ computation. Industry default for Byzantine FL.
Trimmed Mean (Yin et al. 2018): Coordinate- wise removal of $\beta$ -fraction extremes. Lighter computation; loses some information-theoretic privacy because per- coordinate operations interact with ramp shares.
FedSeg (Sun et al. 2021): ByzSecAgg- compatible variant with explicit segment-based filtering.

The choice depends on attack model and per- aggregator complexity budget. ByzSecAgg's information-theoretic privacy holds for any aggregator that operates on the pairwise distance matrix.

Practical Constraints

•
Krum: simplest, $B/n < 0.25$
•
Bulyan: more robust, slightly higher cost
•
Choice depends on attack model

📋 Ref: Blanchard et al. 2017; El Mhamdi et al. 2018; Yin et al. 2018

,

Historical Note: The Byzantine Aggregator Family Tree

2017–2023

The Byzantine-resilient aggregator family in FL began with Krum (Blanchard et al. 2017), inspired by Median-style robust statistics from the distributed-estimation literature. Yin et al. (2018) introduced Trimmed Mean, which has tighter statistical properties for sub-Gaussian noise. Bulyan (El Mhamdi et al. 2018) combined Krum with trimmed-mean post-processing for stronger guarantees.

Until ByzSecAgg (2023), all these aggregators operated on plaintext gradients — incompatible with secure aggregation's privacy guarantee. The coded-distance trick of §11.3, lifted from Lagrange Coded Computing (Yu et al. 2019, Chapter 8), enabled the combination. ByzSecAgg showed that any distance-based Byzantine aggregator can be combined with secure aggregation via this technique — a structural insight that has shaped the post-2023 literature.

, ,

Key Takeaway

Coded distance computation enables Byzantine detection without privacy violation. By computing pairwise distances on ramp shares using Lagrange Coded Computing, the server can apply any distance-based Byzantine aggregator (Krum, Bulyan, Trimmed Mean) without learning individual gradients. The construction is Chapter 8's LCC framework specialized to quadratic distance functions.

Quick Check

In ByzSecAgg, the server computes pairwise distances $d_{ij} = \|\mathbf{g}_i - \mathbf{g}_j\|^2$ via:

Plaintext gradient inspection by the server.

Lagrange Coded Computing on ramp shares — each user computes a quadratic LCC contribution on its shares; server interpolates the distance.

Encrypted gradient computation via homomorphic encryption.

Random projections that approximately preserve distances.

Correction:

Lagrange Coded Computing on ramp shares — each user computes a quadratic LCC contribution on its shares; server interpolates the distance.

Exactly. LCC handles the degree-2 distance function with recovery threshold $K_{\\text{rec}} = 2(n-1) + 1$ , well within ByzSecAgg's parameters.

Coded Outlier Detection via Distance Aggregation

Computing Distances Without Seeing Gradients

Definition: Coded Pairwise Distance Computation

Coded Pairwise Distance

Definition: Krum-Style Filtering on Coded Distances

Theorem: Krum Filtering Correctly Identifies Byzantines (Sketch)

Setup

Honest distance concentration

Byzantine distance separation

Filtering correctness

Example: Krum Filter on Coded Distances: n=10,B=2n = 10, B = 2n=10,B=2

Setup

Hypothetical distances

Krum scores

Filtering decision

Aggregation

Coded Pairwise Distance Computation (LCC Specialization)

Krum Filtering Accuracy vs. Byzantine Fraction

Parameters

Common Mistake: Krum Filtering Has a Bias for Sparse Byzantine Patterns

Krum, Bulyan, Trimmed Mean: Choices Within ByzSecAgg

Historical Note: The Byzantine Aggregator Family Tree

Key Takeaway

Quick Check

Definition:
Coded Pairwise Distance Computation

Definition:
Krum-Style Filtering on Coded Distances

Example: Krum Filter on Coded Distances: $n = 10, B = 2$