Ferkans — Interactive Telecom Tutor

The Polynomial-Code Idea in One Sentence

Encode the blocks of $\mathbf{A}$ and $\mathbf{B}$ into two polynomials $p_A(x)$ and $p_B(x)$ such that the product $p_A(x) \cdot p_B(x)$ , when evaluated at $N$ distinct points, yields $N$ matrix products that interpolate back to the full output $\mathbf{A}^T \mathbf{B}$ . Any $K = pq$ of the $N$ evaluations suffice.

The point is that this reuses the machinery of Shamir secret sharing (Chapter 3): a polynomial of degree $d$ is uniquely determined by $d + 1$ evaluations. The only new idea is choosing the exponents of $p_A, p_B$ so that the product's coefficients are exactly the block products $\mathbf{C}_{ij}$ . Once the exponent choice is right, the rest follows from Lagrange interpolation.

Definition:
Polynomial Code (Yu–Maddah-Ali–Avestimehr)

The polynomial code with partition counts $(p, q)$ and $N$ workers over $\mathbb{F}_q$ (requiring $q \geq N$ ) consists of:

Evaluation points. Fix distinct $\alpha_1, \alpha_2, \ldots, \alpha_N \in \mathbb{F}_q^*$ (one per worker).
Encoding polynomials. $p_A(x) = \sum_{i=0}^{p-1} \mathbf{A}_{i+1} \, x^{i}, \qquad p_B(x) = \sum_{j=0}^{q-1} \mathbf{B}_{j+1} \, x^{p \cdot j}.$ The exponent pattern $(1, p)$ is chosen so that the product $p_A(x) p_B(x)$ enumerates all $(i, j)$ pairs with distinct exponents.
Storage at worker $k$ . $\tilde{\mathbf{A}}_k = p_A(\alpha_k), \qquad \tilde{\mathbf{B}}_k = p_B(\alpha_k).$
Worker computation. $\tilde{\mathbf{C}}_k = \tilde{\mathbf{A}}_k^T \tilde{\mathbf{B}}_k = p_A(\alpha_k)^T p_B(\alpha_k) = p_C(\alpha_k)$ , where $p_C(x) \;\triangleq\; p_A(x)^T p_B(x) \;=\; \sum_{i=0}^{p-1} \sum_{j=0}^{q-1} \mathbf{C}_{i+1, j+1} \, x^{i + pj}.$ $p_C$ is a polynomial of degree $p - 1 + p(q-1) = pq - 1$ whose coefficients are exactly the $pq$ desired output blocks.
Decoder. The master receives $\tilde{\mathbf{C}}_{k_1}, \ldots, \tilde{\mathbf{C}}_{k_K}$ for any $K = pq$ workers $\mathcal{T} = \{k_1, \ldots, k_K\}$ that have responded, and interpolates $p_C(x)$ via Lagrange on the points $\{(\alpha_{k_\ell}, \tilde{\mathbf{C}}_{k_\ell})\}_{\ell}$ . The coefficients of $p_C$ are read off as the output blocks.

Polynomial Code

A coded-matrix-multiplication scheme in which the input matrices are encoded as polynomials $p_A(x), p_B(x)$ , and each worker receives their values at a distinct point $\alpha_k$ . The product polynomial $p_C(x)$ has the output blocks as its coefficients, recoverable by Lagrange interpolation of any $pq$ evaluations.

Polynomial-Code Encoding

Complexity:

O(N \cdot p \cdot md/p + N \cdot q \cdot md'/q) = O(N \cdot m(d + d'))

Input: Matrices

\mathbf{A} \in \mathbb{F}_q^{m \times d}

,

\mathbf{B} \in \mathbb{F}_q^{m \times d'}

, partitions

(p, q)

,

N

distinct

\alpha_1, \ldots, \alpha_N \in \mathbb{F}_q^*

.

Output: Per-worker encoded matrices

\{(\tilde{\mathbf{A}}_k, \tilde{\mathbf{B}}_k)\}_{k=1}^N

.

1. Partition

\mathbf{A} = [\mathbf{A}_1, \ldots, \mathbf{A}_p]

and

\mathbf{B} = [\mathbf{B}_1, \ldots, \mathbf{B}_q]

.

2. for

k = 1, 2, \ldots, N

do

3.

\quad \tilde{\mathbf{A}}_k \leftarrow \sum_{i=0}^{p-1} \alpha_k^{i}\, \mathbf{A}_{i+1}

4.

\quad \tilde{\mathbf{B}}_k \leftarrow \sum_{j=0}^{q-1} \alpha_k^{pj}\, \mathbf{B}_{j+1}

5. end for

6. return

\{(\tilde{\mathbf{A}}_k, \tilde{\mathbf{B}}_k)\}

Encoding is linear in the input size per worker. For large matrices this is dominated by the worker's local matrix- multiply, not the encoding.

Polynomial-Code Decoding via Lagrange

Complexity:

O(pq \cdot K^2)

scalar operations, or

O(K \log^2 K)

with FFTs

Input:

K

worker responses

\{(\alpha_k, \tilde{\mathbf{C}}_k)\}_{k \in \mathcal{T}}

,

|\mathcal{T}| = pq

.

Output: Output blocks

\{\mathbf{C}_{i, j}\}_{i, j}

and hence

\mathbf{A}^T \mathbf{B}

.

1. Compute Lagrange basis at

x = 0, 1, \ldots, pq - 1

:

\ell_k(\lambda) = \prod_{j \in \mathcal{T}, j \neq k} \frac{\lambda - \alpha_j}{\alpha_k - \alpha_j}

for each

k \in \mathcal{T}, \lambda \in \{0, \ldots, pq-1\}

.

2. for

\lambda = 0, 1, \ldots, pq - 1

do

3.

\quad c_\lambda \leftarrow \sum_{k \in \mathcal{T}} \ell_k(\lambda) \cdot \tilde{\mathbf{C}}_k

4.

\quad (i, j) \leftarrow (\lambda \mod p, \lfloor \lambda / p\rfloor)

5.

\quad \mathbf{C}_{i+1, j+1} \leftarrow c_\lambda

6. end for

7. return

\{\mathbf{C}_{i, j}\}

The decoder is identical to Reed-Solomon erasure decoding. For large $pq$ , FFT-based Lagrange interpolation reduces the cost from $O(K^2)$ to $O(K \log^2 K)$ . In production the Lagrange coefficients can be pre-computed for the expected straggler patterns.

Theorem: Polynomial Code Achieves Recovery Threshold $K = pq$

The polynomial code with $N$ workers and partition counts $(p, q)$ over $\mathbb{F}_q$ (with $q \geq N$ ) satisfies:

Correctness. For every input $(\mathbf{A}, \mathbf{B})$ and every subset $\mathcal{T} \subseteq [N]$ with $|\mathcal{T}| = pq$ , the decoder on responses $\{\tilde{\mathbf{C}}_k\}_{k \in \mathcal{T}}$ outputs the exact product $\mathbf{A}^T \mathbf{B}$ .
Storage. Each worker stores $|\mathbf{A}|/p + |\mathbf{B}|/q$ bits, for per-worker storage load $\mu = 1/p + 1/q$ .
Recovery threshold. $K = pq$ , matching the information-theoretic lower bound of Chapter 4, §4.2.

The scheme is optimal within the class of $\mu = 1/p + 1/q$ -storage schemes: no distributed matrix- multiplication scheme with the same per-worker storage can achieve a smaller recovery threshold.

Each worker's output is one evaluation of the degree- $(pq - 1)$ polynomial $p_C(x)$ ; $pq$ evaluations interpolate the whole polynomial, and the coefficients are exactly the $pq$ desired output blocks. The matching lower bound (§5.3) comes from a cut-set argument showing any scheme storing $\mu |\mathbf{A}| + \mu |\mathbf{B}|$ per worker must make at least $pq$ "independent contributions" available to the master.

Proof

Correctness via Lagrange

$p_C(\alpha_k) = p_A(\alpha_k)^T p_B(\alpha_k) = \tilde{\mathbf{C}}_k$ . The polynomial $p_C$ has degree $pq - 1$ , so any $pq$ evaluations uniquely determine $p_C$ — Lagrange interpolation recovers the coefficients, which are exactly the $pq$ output blocks $\mathbf{C}_{ij}$ .

Storage

$\tilde{\mathbf{A}}_k$ has the same size as one block of $\mathbf{A}$ : $(m \times d/p)$ , i.e., $|\mathbf{A}|/p$ . Similarly $\tilde{\mathbf{B}}_k = |\mathbf{B}|/q$ . Sum: $|\mathbf{A}|/p + |\mathbf{B}|/q$ , matching the storage budget.

Recovery threshold

The master recovers $p_C$ from any $pq$ responses, so $K = pq$ . The converse (§5.3) shows no scheme with the same storage can do better. $\blacksquare$

Key Takeaway

Polynomial codes achieve $(K, \mu) = (pq, 1/p + 1/q)$ , optimal for coded matrix multiplication. Straggler tolerance $N - K$ can be set to any value between $0$ (uncoded) and $N - pq$ (large redundancy) just by choosing $N$ . The same storage supports any $N$ — a flexibility that uncoded schemes simply do not have.

Example: Polynomial Code for $p = q = 2$ , $N = 5$

Work out the polynomial code for $p = q = 2$ , $N = 5$ workers with evaluation points $\alpha_k = k$ over $\mathbb{F}_7$ . Verify the recovery threshold $K = 4$ by showing that any 4 workers suffice.

Solution

Encoding polynomials

$p_A(x) = \mathbf{A}_1 + x \mathbf{A}_2$ , $p_B(x) = \mathbf{B}_1 + x^2 \mathbf{B}_2$ . $p_C(x) = p_A(x)^T p_B(x) = \mathbf{C}_{11} + x \mathbf{C}_{21} + x^2 \mathbf{C}_{12} + x^3 \mathbf{C}_{22}$ , degree $3$ .

Worker storage and computation

Worker $k$ stores $\tilde{\mathbf{A}}_k = \mathbf{A}_1 + k\mathbf{A}_2$ and $\tilde{\mathbf{B}}_k = \mathbf{B}_1 + k^2 \mathbf{B}_2$ . It computes $\tilde{\mathbf{C}}_k = p_C(k)$ .

Decoding from workers 1, 2, 3, 4

Compute Lagrange coefficients at $x = 0, 1, 2, 3$ (corresponding to $\mathbf{C}_{11}, \mathbf{C}_{21}, \mathbf{C}_{12}, \mathbf{C}_{22}$ ) from the 4 shares. Each coefficient is a linear combination of the 4 responses with coefficients determined by the evaluation points. The decoder succeeds because any Vandermonde $4 \times 4$ submatrix over distinct points is invertible.

Straggler tolerance

With $N = 5, K = 4$ , any single straggler is absorbed. If worker 3 is slow, the master uses workers 1, 2, 4, 5 — and still recovers $\mathbf{A}^T \mathbf{B}$ exactly. $\blacksquare$

Polynomial Code: Encoding and Decoding

Animation of the polynomial-code construction. The master partitions

\mathbf{A}

and

\mathbf{B}

, forms the encoding polynomials, and sends one evaluation to each worker. Workers compute local products; any

K = pq

of them interpolate back to the polynomial whose coefficients are the output blocks.

Recovery Threshold Comparison: Polynomial vs. MDS vs. Uncoded

Plot the recovery threshold $K$ as a function of the partition count $p$ (with $q = p$ ) for three schemes: polynomial code ( $K = pq = p^2$ ), MDS-coded replication ( $K = 2p - 1$ ), and uncoded ( $K = p^2$ , same as polynomial but without straggler flexibility). The polynomial code dominates in flexibility, while MDS is cheaper at $K$ (threshold) but requires more per-worker storage.

Parameters

p

max8

Range of column-partition counts to plot

N

— workers24

Number of workers (sets straggler budget)

Common Mistake: Field Size Matters

Mistake:

Use a small field $\mathbb{F}_q$ (e.g., $q = 2$ or $q = 8$ ) with $N = 16$ workers.

Correction:

The polynomial code requires $N$ distinct nonzero evaluation points, which are elements of $\mathbb{F}_q^*$ . For $N = 16$ , $q \geq 17$ is needed (or a field extension $\mathbb{F}_{2^5} = \mathbb{F}_{32}$ ). Practical deployments use large prime fields ( $q = 2^{32} - 5$ , etc.) or $\mathbb{F}_{2^{64}}$ to have ample evaluation points and room for privacy padding.

🔧Engineering Note

Decoder Cost in Practice

Naive Lagrange decoding is $O(K^2)$ per output scalar, which for a $10^4 \times 10^4$ product with $K = 16$ amounts to $\sim 10^{10}$ operations — a few seconds on a single CPU core, and comparable to the master's baseline in an uncoded scheme. For larger $K$ (say $K = 100$ in ByzSecAgg, Chapter 11), FFT-based decoding reduces the cost to $O(K \log^2 K)$ per scalar. Production implementations (Apache MXNet with coded primitives, recent PyTorch-XLA fork) include both algorithms and switch based on $K$ .

Practical Constraints

•
Naive Lagrange: $O(K^2)$ per output scalar
•
FFT-based: $O(K \log^2 K)$ — break-even around $K \geq 64$
•
Lagrange coefficients can be pre-computed for expected straggler subsets

📋 Ref: Yu/Maddah-Ali/Avestimehr 2017 §VI.C

Quick Check

In the polynomial code with $(p, q) = (3, 4)$ , the product polynomial $p_C(x)$ has degree:

$pq - 1 = 11$

$p + q - 1 = 6$

$\max(p, q) = 4$

$p + q = 7$