IA for Distributed Matrix Multiplication
Why Matrix Multiplication Looks Like Interference
Distributed matrix multiplication is the canonical computation we will study in detail in Chapter 5. The question is: given and , compute using workers, each storing only fractions of and .
The point is that each worker's local product is a sum of many sub-products β most of which are "interferers" (entries of the master needs from other workers, not this one). The master must invert such observations to recover the desired block by block. The structure is exactly that of an interference channel where the cross-channel matrices are determined by the coding scheme: choose them to make the interference align, and the recovery becomes efficient.
This section formalizes the analogy and gives the IA-based achievability for a basic version of coded matrix multiplication β the generalization that hits the optimal recovery threshold is polynomial coding, the subject of Chapter 5. Reading this section first explains why polynomial codes work; Chapter 5 will work out how.
Definition: Coded Matrix Multiplication as an Interference Channel
Coded Matrix Multiplication as an Interference Channel
Partition and column-wise into and equal-size blocks respectively. The desired product has blocks A storage scheme assigns to worker a pair of encoded matrices and (over ). Worker computes which is a linear combination of all desired blocks. The master receives one such combination from each worker, giving an -dimensional linear system in unknowns: where stacks the desired blocks and is the encoding matrix with entries .
The interference-channel analogy is exact: the master is the receiver, the desired blocks are the messages, the workers are the symbol slots, and the encoding matrix plays the role of the channel matrix. The master "decodes" by inverting on any rows.
Encoding Matrix
The matrix whose entry is the coefficient of in worker 's response. The number of rows the master needs in order to invert is the recovery threshold (Chapter 5).
Theorem: IA-Based Recovery Threshold for Coded Matrix Multiplication
For the distributed matrix-multiplication scheme above with partition counts , the recovery threshold β the minimum number of worker responses sufficient to reconstruct β satisfies with equality achieved by a generic random encoding matrix over a sufficiently large field , . The proof is a direct application of finite-field IA: any rows of a generic form an invertible square submatrix.
The polynomial-code construction of Chapter 5 achieves deterministically β without random sampling β by choosing and for distinct . The connection to IA is that this deterministic choice makes the encoding matrix a Vandermonde matrix, which has the same generic-rank property by construction.
Each worker contributes one linear equation in the unknown blocks. Generically, equations are needed to invert. The IA insight is that we can choose the encoding matrix so that any rows are jointly informative (no coincidental rank deficiencies) β the coding analogue of "interferers all aligning into the same subspace at every receiver".
Lower bound β counting
The master must reconstruct unknown matrix blocks. Each worker response is one -vector β at most one linear equation per block-coordinate. Hence at least responses are required, giving .
Achievability β generic random encoding
Let have i.i.d. uniform entries in , . Any rows form a square random matrix; the probability of singularity is , vanishing as . Hence with high probability the master can invert any responses.
Achievability β polynomial codes
Choose , for distinct . The encoding matrix becomes Vandermonde-like with all submatrices invertible. The worker product is then where is a polynomial of degree , and the master interpolates from any evaluations to recover all coefficients, which are exactly the blocks . This is the construction of Chapter 5.
Example: , Workers
Set (so the desired product has blocks). With workers, what is the minimum recovery threshold, and how does the polynomial code achieve it?
Lower bound
From the theorem, . With only workers and , the master needs all responses β no straggler tolerance.
Polynomial-code construction
Choose evaluation points over . Worker stores and . Worker computes .
Master decoding
The four worker responses are evaluations of the polynomial at . Lagrange interpolation recovers all four coefficients = all four blocks.
Adding redundancy
With workers and the same polynomial-code construction, is preserved β the master can tolerate up to stragglers. This is exactly the gain coded matrix multiplication brings in Chapter 5.
Recovery Threshold vs. Workers at Various Partitions
Plot the recovery threshold for coded matrix multiplication as a function of the partition counts and the number of workers . As grows beyond , the straggler tolerance improves, but the per-worker storage stays at (twice the disjoint baseline of ). The plot shows the trade-off between recovery threshold (smaller is better for stragglers) and storage (smaller is better for memory).
Parameters
Total number of workers
Number of column blocks of A
Number of column blocks of B
Why This Matters: Polynomial Codes Are the Optimal IA Construction
The deterministic polynomial-code construction of Chapter 5 achieves the same recovery threshold as the generic random IA scheme of this section, but with two advantages: (i) zero failure probability (the Vandermonde matrix is always invertible), (ii) explicit decoding via Lagrange interpolation (cheaper than general matrix inversion). The theoretical content is identical; the engineering story is that explicit constructions matter when reproducibility and complexity are at stake.
Common Mistake: Recovery Threshold Is , Not
Mistake:
Confuse the partition counts and with their sum, expecting a recovery threshold of .
Correction:
The desired output has scalar blocks (the entries of partitioned into a grid). Each must be reconstructed, so . This is why coded matrix multiplication has a quadratic recovery threshold in the partition counts β and why approximate schemes (Chapter 6's gradient coding) try to beat this by settling for an approximation of the product rather than every block.
Why Polynomial Codes Replace Random IA in Practice
On a 24-node Amazon EC2 cluster, Yu et al. measured a speedup of polynomial-coded matrix multiplication over uncoded for a matrix product. Random IA-based schemes achieve the same theoretical recovery threshold but suffer from (i) a non-zero failure probability over fields too small for exact alignment, (ii) the cost of inverting an arbitrary matrix vs. the Lagrange interpolation. In production, deterministic polynomial codes are the rule and randomized IA is reserved for theoretical existence proofs.
- β’
Polynomial-code decoder runs in field ops; random IA needs
- β’
Deterministic constructions allow worst-case analysis; random IA only with-high-probability
- β’
Yu et al.: , EC2, speedup over uncoded
Key Takeaway
IA is the algebraic tool, polynomial codes are the deterministic construction. The IA framework explains why a recovery threshold of is achievable for coded matrix multiplication, by reducing the question to invertibility of generic submatrices. Polynomial codes (Chapter 5) make the construction explicit and remove all randomness from the analysis.
Quick Check
For coded matrix multiplication with column partitions , the optimal recovery threshold is:
()
()
. The polynomial code achieves this exactly, and is also a lower bound β the recovery threshold is information-theoretically tight.