Beyond Matrix Multiplication

Why Matrix Multiplication Is Not Enough

Chapters 5 and 6 gave optimal coded-computing schemes for matrix multiplication and gradient aggregation. The obvious question is: what about convolutions? What about general multivariate-polynomial tensor operations? Modern ML workloads are dominated by these: convolutions power every CNN layer; higher-order tensor contractions appear in attention mechanisms and in Einstein-summation-style ops. If coded computing cannot handle them, its practical impact is limited.

The point is that the polynomial-code framework extends naturally to these operations. The key generalization is to encode inputs as polynomials whose product (or, more generally, composition) has the desired output blocks as coefficients. For convolutions, this is the Yu / Maddah-Ali / Avestimehr "entangled polynomial code" (Β§8.2). For arbitrary multivariate polynomials, this is the Lagrange Coded Computing framework of Yu / Raviv / Soleymani / Avestimehr (Β§8.3).

The extensions have finite limits: coded computing handles polynomial (multilinear) operations well, but non-polynomial operations (ReLU activations, softmax) require approximations or hybrid approaches. Section 8.4 closes the chapter with these limitations and open problems.

Definition:

Convolution as Polynomial Multiplication

For vectors a=(a0,…,adβˆ’1)\mathbf{a} = (a_0, \ldots, a_{d-1}) and b=(b0,…,bdβ€²βˆ’1)\mathbf{b} = (b_0, \ldots, b_{d'-1}) over Fq\mathbb{F}_q, the linear convolution c=aβˆ—b\mathbf{c} = \mathbf{a} * \mathbf{b} is the vector of length d+dβ€²βˆ’1d + d' - 1 with entries ck=βˆ‘i+j=kaibj,k=0,1,…,d+dβ€²βˆ’2.c_k = \sum_{i + j = k} a_i b_j, \qquad k = 0, 1, \ldots, d + d' - 2. Equivalently, defining generating polynomials pa(x)=βˆ‘iaixip_{\mathbf{a}}(x) = \sum_i a_i x^i and pb(x)=βˆ‘jbjxjp_{\mathbf{b}}(x) = \sum_j b_j x^j, the product pa(x)β‹…pb(x)=pc(x)=βˆ‘kckxkp_{\mathbf{a}}(x) \cdot p_{\mathbf{b}}(x) = p_{\mathbf{c}}(x) = \sum_k c_k x^k has the convolution output as its coefficients.

This algebraic identity is what makes coded convolution possible: evaluate the two input polynomials at NN points, let each worker compute the local product, then interpolate the output polynomial from the responses.

Why Naive Coded-Matrix-Mult Is Not Coded Convolution

Convolution can be expressed as a matrix-vector product: c=M(a)b\mathbf{c} = \mathbf{M}(\mathbf{a}) \mathbf{b} where M(a)\mathbf{M}(\mathbf{a}) is the circulant or Toeplitz matrix built from a\mathbf{a}. One could therefore apply the polynomial-code matrix-multiplication framework of Chapter 5 with M(a)\mathbf{M}(\mathbf{a}) as the data matrix.

The catch: M(a)\mathbf{M}(\mathbf{a}) has size (d+dβ€²βˆ’1)Γ—dβ€²(d + d' - 1) \times d' and is itself a function of a\mathbf{a}. Encoding M(a)\mathbf{M}(\mathbf{a}) via polynomial codes requires evaluating M\mathbf{M} at multiple points, which means committing to which a\mathbf{a} before distributing. If both a\mathbf{a} and b\mathbf{b} are inputs (and we want to code over both), the naive approach fails.

The point is that convolution is bilinear in its two inputs, not merely linear in one. Entangled polynomial codes (Β§8.2) handle this by encoding both inputs jointly in a single polynomial whose product structure matches the convolution.

Definition:

Tensor Contraction

Let TA∈Fqd1Γ—d2Γ—d3\mathbf{T}_A \in \mathbb{F}_q^{d_1 \times d_2 \times d_3} and TB∈Fqd3Γ—d4Γ—d5\mathbf{T}_B \in \mathbb{F}_q^{d_3 \times d_4 \times d_5} be tensors of order 3 each. Their tensor contraction along the common dimension d3d_3 is TC∈Fqd1Γ—d2Γ—d4Γ—d5\mathbf{T}_C \in \mathbb{F}_q^{d_1 \times d_2 \times d_4 \times d_5} with entries TC[i1,i2,i4,i5]=βˆ‘i3=1d3TA[i1,i2,i3]β‹…TB[i3,i4,i5].T_C[i_1, i_2, i_4, i_5] = \sum_{i_3 = 1}^{d_3} T_A[i_1, i_2, i_3] \cdot T_B[i_3, i_4, i_5]. Matrix multiplication is the special case with d1,d2,d4,d5=m,d,dβ€²d_1, d_2, d_4, d_5 = m, d, d' and d3=dd_3 = d. Convolution is the special case with a Toeplitz-structured TBT_B.

More general multilinear operations β€” multi-tensor contractions, polynomial evaluations on tensor inputs β€” can be composed from these primitives. The Lagrange Coded Computing framework of Β§8.3 handles arbitrary multivariate polynomial functions.

Tensor Contraction

Summation over a shared dimension between two tensors, generalizing matrix multiplication. Convolutions, attention mechanisms, and higher-order Einstein sums all reduce to tensor contractions.

Entangled Polynomial Code

A polynomial-code variant in which both input matrices are encoded jointly (their exponents "entangled" across xx-powers) to achieve a lower recovery threshold K=p+qβˆ’1K = p + q - 1 than the standard polynomial code's K=pqK = pq. Chapter 5 Β§5.4's MatDot codes are the prototype.

Theorem: Recovery Threshold for Coded Convolution

For the (p,q)(p, q)-partitioned convolution c=aβˆ—b\mathbf{c} = \mathbf{a} * \mathbf{b} with a\mathbf{a} split into pp contiguous chunks and b\mathbf{b} split into qq contiguous chunks, the minimum recovery threshold is Kconvβˆ—β€…β€Š=β€…β€Šp+qβˆ’1.K_{\text{conv}}^* \;=\; p + q - 1. This is achieved by the entangled polynomial code (MatDot variant) of Β§5.4, specialized to the convolution setting. The recovery threshold is smaller than the polynomial code's K=pqK = pq for general matrix multiplication because convolution's output has fewer "degrees of freedom" β€” its structure is richer than a full-rank matrix product.

A general matrix product has pqpq distinct output blocks, all independent. The convolution output has only p+qβˆ’1p + q - 1 distinct "convolution coefficients" (the degree of the product polynomial is d+dβ€²βˆ’2d + d' - 2, which with p+qβˆ’1p + q - 1 distinct block positions gives a smaller output size). So the recovery threshold scales linearly in p+qp + q, not quadratically.

Operationally, this means that convolutions can be distributed with better straggler tolerance than matrix multiplication at the same per-worker storage. This is why convolutions in CNN training have been a successful target for coded-computing deployments.

Example: Coded Convolution: p=2,q=3p = 2, q = 3

Construct an entangled polynomial code for convolution with a\mathbf{a} split into p=2p = 2 chunks and b\mathbf{b} split into q=3q = 3 chunks. Specify the encoding polynomials and verify the recovery threshold.

Convolution vs. Matrix Multiplication Recovery Threshold

Plot the recovery threshold KK for coded convolution (K=p+qβˆ’1K = p + q - 1) vs. coded matrix multiplication (K=pqK = pq) as a function of the partition counts. Convolution's linear-in-partition-sum threshold is dramatically better than matrix multiplication's quadratic-in-partition-product for large partitions β€” though both schemes use the same per-worker storage.

Parameters
8

Range of partition counts (with q = p)

24

Number of workers

Common Mistake: Coded Convolution is Not MatDot Coded Matrix-Multiplication

Mistake:

Assume the MatDot coded-matrix-mult threshold K=p+qβˆ’1K = p + q - 1 applies to general matrix multiplication.

Correction:

MatDot coded matrix multiplication (Chapter 5 Β§5.4) achieves K=p+qβˆ’1K = p + q - 1 only at higher per-worker storage than the standard polynomial code, relaxing the ΞΌ=1/p+1/q\mu = 1/p + 1/q optimality of polynomial codes. For convolution, K=p+qβˆ’1K = p + q - 1 is the tight bound at any per-worker storage β‰₯1/p+1/q\geq 1/p + 1/q β€” because convolution's output structure has fewer degrees of freedom. The distinction matters: convolution benefits from entangled polynomial codes "for free"; general matrix multiplication pays a storage penalty.

πŸ”§Engineering Note

Coded Convolutions in CNN Training

CNN training distributes both the forward pass (where each layer is a convolution) and the backward pass (which involves transposed convolutions β€” also convolutions under Fourier conjugation). Coded convolution can be applied at each layer, distributing the dΓ—dd \times d filter convolutions across workers. Early results (Dutta et al. 2018) showed 22–3Γ—3\times speedup on AlexNet-scale CNNs with moderate stragglers.

The practical adoption has been limited: the per-layer coding overhead is not negligible, and many production systems use simple data-parallel replication instead. For very large convolutional networks (Transformer-variants with attention-as-convolution), coded computing becomes attractive when the per-layer dimension exceeds a few thousand.

Practical Constraints
  • β€’

    Dutta et al. 2018: AlexNet with 22–3Γ—3\times straggler-resilience speedup

  • β€’

    Per-layer coding overhead: ∼10\sim 10–20%20\% wall-clock

  • β€’

    Production systems (PyTorch, TF) do not routinely deploy coded convolutions

πŸ“‹ Ref: Dutta et al. 2018 IEEE T-IT Β§III; NVIDIA cuDNN

Historical Note: The Rise of Coded Tensor Operations

2017–2019

The polynomial-code framework of Yu et al. (2017) for matrix multiplication was quickly extended: Dutta et al. (2018) added convolutions; Yu et al. (2019) generalized to arbitrary multivariate-polynomial functions via Lagrange Coded Computing (LCC). By 2020, coded computing covered essentially every linear and multilinear operation in standard ML pipelines. The extensions to non-polynomial operations (ReLU, softmax, cross-entropy) remain limited to approximations and hybrid schemes β€” a frontier discussed in Β§8.4.

,

Key Takeaway

Convolution's recovery threshold is p+qβˆ’1p + q - 1, linear in the partitions; matrix multiplication's is pqpq, quadratic. The improvement comes from the richer algebraic structure of the convolution output: fewer degrees of freedom per output coefficient. Section 8.2 makes this precise via entangled polynomial codes; Β§8.3 generalizes to arbitrary multivariate polynomials via LCC.

Quick Check

For a (p=4,q=5)(p = 4, q = 5)-partitioned convolution, the coded-convolution recovery threshold is:

K=8K = 8 (p+qβˆ’1p + q - 1)

K=20K = 20 (pqpq)

K=9K = 9 (p+qp + q)

K=max⁑(p,q)=5K = \max(p, q) = 5