Polynomial Codes and the Recovery Threshold

The Polynomial-Code Idea in One Sentence

Encode the blocks of A\mathbf{A} and B\mathbf{B} into two polynomials pA(x)p_A(x) and pB(x)p_B(x) such that the product pA(x)pB(x)p_A(x) \cdot p_B(x), when evaluated at NN distinct points, yields NN matrix products that interpolate back to the full output ATB\mathbf{A}^T \mathbf{B}. Any K=pqK = pq of the NN evaluations suffice.

The point is that this reuses the machinery of Shamir secret sharing (Chapter 3): a polynomial of degree dd is uniquely determined by d+1d + 1 evaluations. The only new idea is choosing the exponents of pA,pBp_A, p_B so that the product's coefficients are exactly the block products Cij\mathbf{C}_{ij}. Once the exponent choice is right, the rest follows from Lagrange interpolation.

Definition:

Polynomial Code (Yu–Maddah-Ali–Avestimehr)

The polynomial code with partition counts (p,q)(p, q) and NN workers over Fq\mathbb{F}_q (requiring qNq \geq N) consists of:

  • Evaluation points. Fix distinct α1,α2,,αNFq\alpha_1, \alpha_2, \ldots, \alpha_N \in \mathbb{F}_q^* (one per worker).
  • Encoding polynomials. pA(x)=i=0p1Ai+1xi,pB(x)=j=0q1Bj+1xpj.p_A(x) = \sum_{i=0}^{p-1} \mathbf{A}_{i+1} \, x^{i}, \qquad p_B(x) = \sum_{j=0}^{q-1} \mathbf{B}_{j+1} \, x^{p \cdot j}. The exponent pattern (1,p)(1, p) is chosen so that the product pA(x)pB(x)p_A(x) p_B(x) enumerates all (i,j)(i, j) pairs with distinct exponents.
  • Storage at worker kk. A~k=pA(αk),B~k=pB(αk).\tilde{\mathbf{A}}_k = p_A(\alpha_k), \qquad \tilde{\mathbf{B}}_k = p_B(\alpha_k).
  • Worker computation. C~k=A~kTB~k=pA(αk)TpB(αk)=pC(αk)\tilde{\mathbf{C}}_k = \tilde{\mathbf{A}}_k^T \tilde{\mathbf{B}}_k = p_A(\alpha_k)^T p_B(\alpha_k) = p_C(\alpha_k), where pC(x)    pA(x)TpB(x)  =  i=0p1j=0q1Ci+1,j+1xi+pj.p_C(x) \;\triangleq\; p_A(x)^T p_B(x) \;=\; \sum_{i=0}^{p-1} \sum_{j=0}^{q-1} \mathbf{C}_{i+1, j+1} \, x^{i + pj}. pCp_C is a polynomial of degree p1+p(q1)=pq1p - 1 + p(q-1) = pq - 1 whose coefficients are exactly the pqpq desired output blocks.
  • Decoder. The master receives C~k1,,C~kK\tilde{\mathbf{C}}_{k_1}, \ldots, \tilde{\mathbf{C}}_{k_K} for any K=pqK = pq workers T={k1,,kK}\mathcal{T} = \{k_1, \ldots, k_K\} that have responded, and interpolates pC(x)p_C(x) via Lagrange on the points {(αk,C~k)}\{(\alpha_{k_\ell}, \tilde{\mathbf{C}}_{k_\ell})\}_{\ell}. The coefficients of pCp_C are read off as the output blocks.

Polynomial Code

A coded-matrix-multiplication scheme in which the input matrices are encoded as polynomials pA(x),pB(x)p_A(x), p_B(x), and each worker receives their values at a distinct point αk\alpha_k. The product polynomial pC(x)p_C(x) has the output blocks as its coefficients, recoverable by Lagrange interpolation of any pqpq evaluations.

Polynomial-Code Encoding

Complexity: O(Npmd/p+Nqmd/q)=O(Nm(d+d))O(N \cdot p \cdot md/p + N \cdot q \cdot md'/q) = O(N \cdot m(d + d'))
Input: Matrices AFqm×d\mathbf{A} \in \mathbb{F}_q^{m \times d},
BFqm×d\mathbf{B} \in \mathbb{F}_q^{m \times d'}, partitions
(p,q)(p, q), NN distinct α1,,αNFq\alpha_1, \ldots, \alpha_N \in \mathbb{F}_q^*.
Output: Per-worker encoded matrices
{(A~k,B~k)}k=1N\{(\tilde{\mathbf{A}}_k, \tilde{\mathbf{B}}_k)\}_{k=1}^N.
1. Partition A=[A1,,Ap]\mathbf{A} = [\mathbf{A}_1, \ldots, \mathbf{A}_p]
and B=[B1,,Bq]\mathbf{B} = [\mathbf{B}_1, \ldots, \mathbf{B}_q].
2. for k=1,2,,Nk = 1, 2, \ldots, N do
3. A~ki=0p1αkiAi+1\quad \tilde{\mathbf{A}}_k \leftarrow \sum_{i=0}^{p-1} \alpha_k^{i}\, \mathbf{A}_{i+1}
4. B~kj=0q1αkpjBj+1\quad \tilde{\mathbf{B}}_k \leftarrow \sum_{j=0}^{q-1} \alpha_k^{pj}\, \mathbf{B}_{j+1}
5. end for
6. return {(A~k,B~k)}\{(\tilde{\mathbf{A}}_k, \tilde{\mathbf{B}}_k)\}

Encoding is linear in the input size per worker. For large matrices this is dominated by the worker's local matrix- multiply, not the encoding.

Polynomial-Code Decoding via Lagrange

Complexity: O(pqK2)O(pq \cdot K^2) scalar operations, or O(Klog2K)O(K \log^2 K) with FFTs
Input: KK worker responses
{(αk,C~k)}kT\{(\alpha_k, \tilde{\mathbf{C}}_k)\}_{k \in \mathcal{T}},
T=pq|\mathcal{T}| = pq.
Output: Output blocks
{Ci,j}i,j\{\mathbf{C}_{i, j}\}_{i, j} and hence ATB\mathbf{A}^T \mathbf{B}.
1. Compute Lagrange basis at x=0,1,,pq1x = 0, 1, \ldots, pq - 1:
k(λ)=jT,jkλαjαkαj\ell_k(\lambda) = \prod_{j \in \mathcal{T}, j \neq k} \frac{\lambda - \alpha_j}{\alpha_k - \alpha_j} for each
kT,λ{0,,pq1}k \in \mathcal{T}, \lambda \in \{0, \ldots, pq-1\}.
2. for λ=0,1,,pq1\lambda = 0, 1, \ldots, pq - 1 do
3. cλkTk(λ)C~k\quad c_\lambda \leftarrow \sum_{k \in \mathcal{T}} \ell_k(\lambda) \cdot \tilde{\mathbf{C}}_k
4. (i,j)(λmodp,λ/p)\quad (i, j) \leftarrow (\lambda \mod p, \lfloor \lambda / p\rfloor)
5. Ci+1,j+1cλ\quad \mathbf{C}_{i+1, j+1} \leftarrow c_\lambda
6. end for
7. return {Ci,j}\{\mathbf{C}_{i, j}\}

The decoder is identical to Reed-Solomon erasure decoding. For large pqpq, FFT-based Lagrange interpolation reduces the cost from O(K2)O(K^2) to O(Klog2K)O(K \log^2 K). In production the Lagrange coefficients can be pre-computed for the expected straggler patterns.

Theorem: Polynomial Code Achieves Recovery Threshold K=pqK = pq

The polynomial code with NN workers and partition counts (p,q)(p, q) over Fq\mathbb{F}_q (with qNq \geq N) satisfies:

  1. Correctness. For every input (A,B)(\mathbf{A}, \mathbf{B}) and every subset T[N]\mathcal{T} \subseteq [N] with T=pq|\mathcal{T}| = pq, the decoder on responses {C~k}kT\{\tilde{\mathbf{C}}_k\}_{k \in \mathcal{T}} outputs the exact product ATB\mathbf{A}^T \mathbf{B}.
  2. Storage. Each worker stores A/p+B/q|\mathbf{A}|/p + |\mathbf{B}|/q bits, for per-worker storage load μ=1/p+1/q\mu = 1/p + 1/q.
  3. Recovery threshold. K=pqK = pq, matching the information-theoretic lower bound of Chapter 4, §4.2.

The scheme is optimal within the class of μ=1/p+1/q\mu = 1/p + 1/q-storage schemes: no distributed matrix- multiplication scheme with the same per-worker storage can achieve a smaller recovery threshold.

Each worker's output is one evaluation of the degree-(pq1)(pq - 1) polynomial pC(x)p_C(x); pqpq evaluations interpolate the whole polynomial, and the coefficients are exactly the pqpq desired output blocks. The matching lower bound (§5.3) comes from a cut-set argument showing any scheme storing μA+μB\mu |\mathbf{A}| + \mu |\mathbf{B}| per worker must make at least pqpq "independent contributions" available to the master.

Key Takeaway

Polynomial codes achieve (K,μ)=(pq,1/p+1/q)(K, \mu) = (pq, 1/p + 1/q), optimal for coded matrix multiplication. Straggler tolerance NKN - K can be set to any value between 00 (uncoded) and NpqN - pq (large redundancy) just by choosing NN. The same storage supports any NN — a flexibility that uncoded schemes simply do not have.

Example: Polynomial Code for p=q=2p = q = 2, N=5N = 5

Work out the polynomial code for p=q=2p = q = 2, N=5N = 5 workers with evaluation points αk=k\alpha_k = k over F7\mathbb{F}_7. Verify the recovery threshold K=4K = 4 by showing that any 4 workers suffice.

Polynomial Code: Encoding and Decoding

Animation of the polynomial-code construction. The master partitions A\mathbf{A} and B\mathbf{B}, forms the encoding polynomials, and sends one evaluation to each worker. Workers compute local products; any K=pqK = pq of them interpolate back to the polynomial whose coefficients are the output blocks.

Recovery Threshold Comparison: Polynomial vs. MDS vs. Uncoded

Plot the recovery threshold KK as a function of the partition count pp (with q=pq = p) for three schemes: polynomial code (K=pq=p2K = pq = p^2), MDS-coded replication (K=2p1K = 2p - 1), and uncoded (K=p2K = p^2, same as polynomial but without straggler flexibility). The polynomial code dominates in flexibility, while MDS is cheaper at KK (threshold) but requires more per-worker storage.

Parameters
8

Range of column-partition counts to plot

24

Number of workers (sets straggler budget)

Common Mistake: Field Size Matters

Mistake:

Use a small field Fq\mathbb{F}_q (e.g., q=2q = 2 or q=8q = 8) with N=16N = 16 workers.

Correction:

The polynomial code requires NN distinct nonzero evaluation points, which are elements of Fq\mathbb{F}_q^*. For N=16N = 16, q17q \geq 17 is needed (or a field extension F25=F32\mathbb{F}_{2^5} = \mathbb{F}_{32}). Practical deployments use large prime fields (q=2325q = 2^{32} - 5, etc.) or F264\mathbb{F}_{2^{64}} to have ample evaluation points and room for privacy padding.

🔧Engineering Note

Decoder Cost in Practice

Naive Lagrange decoding is O(K2)O(K^2) per output scalar, which for a 104×10410^4 \times 10^4 product with K=16K = 16 amounts to 1010\sim 10^{10} operations — a few seconds on a single CPU core, and comparable to the master's baseline in an uncoded scheme. For larger KK (say K=100K = 100 in ByzSecAgg, Chapter 11), FFT-based decoding reduces the cost to O(Klog2K)O(K \log^2 K) per scalar. Production implementations (Apache MXNet with coded primitives, recent PyTorch-XLA fork) include both algorithms and switch based on KK.

Practical Constraints
  • Naive Lagrange: O(K2)O(K^2) per output scalar

  • FFT-based: O(Klog2K)O(K \log^2 K) — break-even around K64K \geq 64

  • Lagrange coefficients can be pre-computed for expected straggler subsets

📋 Ref: Yu/Maddah-Ali/Avestimehr 2017 §VI.C

Quick Check

In the polynomial code with (p,q)=(3,4)(p, q) = (3, 4), the product polynomial pC(x)p_C(x) has degree:

pq1=11pq - 1 = 11

p+q1=6p + q - 1 = 6

max(p,q)=4\max(p, q) = 4

p+q=7p + q = 7