Applications and Limitations
What the Framework Can β and Cannot β Do
Sections 8.1β8.3 built out the coded-computing toolbox: polynomial codes for matmul, entangled codes for convolution, LCC for arbitrary polynomials. This section takes stock of what the framework enables, where it falls short, and which open problems remain at the frontier of Part II.
The honest assessment: coded computing is a triumph for polynomial distributed computations but stops short of handling the non-polynomial operations (ReLU, softmax, cross-entropy) that dominate modern deep learning. This is not a defect β it's the scope of the framework β but it means that production ML pipelines use coded computing for parts of the workload, not the whole. Section 8.4 identifies the boundary precisely and points to research directions that seek to cross it.
Modern ML Applications of Coded Computing
Coded computing is deployable for the following common ML workloads:
-
Linear layers (matmul). Polynomial codes or entangled-poly codes with per-worker storage give optimal straggler tolerance.
-
Convolutional layers. Entangled polynomial codes give β linear in partitions. Especially effective when channel / filter dimensions are large.
-
Attention mechanism (in Transformers). The Query-Key dot product is a matrix multiplication; coded matrix multiplication applies directly. The Value-weighted output is also a matmul.
-
Embedding layer. Sparse-matrix Γ dense-matrix, reducible to structured sparse matmul. Coded schemes (Kai Wan, Daniela Tuninetti et al. 2020) handle the sparse case with approximate codes.
-
Gradient aggregation. Chapter 6's gradient coding handles this directly.
-
Batch normalization statistics. Per-batch mean and variance are linear functions; coded sketching (e.g., AMS sketches) handles them.
Operations with non-polynomial structure (activation functions, loss functions, normalization) are not directly coded; they run as usual on each worker.
Definition: The Non-Polynomial Barrier
The Non-Polynomial Barrier
A function is non-polynomial if it cannot be expressed as a finite-degree polynomial (or its total degree is infinite). Standard ML non-polynomials:
- ReLU: . Not polynomial.
- Sigmoid: . Transcendental.
- Softmax: . Transcendental.
- Cross-entropy loss. Involves logarithms β not polynomial.
- Batch / Layer normalization. Division by a norm β rational but not polynomial.
The non-polynomial barrier is that LCC and polynomial codes cannot handle these operations with bounded per-worker storage and bounded recovery threshold. Approximation via low-degree polynomial is possible but adds error; exact computation requires different tools (MPC, FHE).
Non-Polynomial Barrier
The open problem that coded-computing frameworks based on polynomial interpolation cannot exactly handle non-polynomial functions (ReLU, softmax, exponentials). Current workarounds: polynomial approximation, hybrid coding (polynomial layers coded, non-polynomial replicated), or cryptographic MPC.
Three Strategies for the Non-Polynomial Case
Practical approaches to coded computing on non-polynomial operations:
-
Polynomial approximation. Replace by a low-degree polynomial (Taylor series or Chebyshev approximation). Apply LCC to . The approximation error propagates through the computation, bounding the final output's deviation from the true . For ReLU, a degree-5 polynomial approximation gives 1% error; higher degree can reduce this.
-
Layer-wise hybrid coding. In a deep net, code only the polynomial layers (linear, convolutional, attention) and replicate the non-polynomial layers (activations, normalization). The replicated layers are small enough that their straggler-impact is bounded.
-
Secure MPC for non-polynomial operations. Cryptographic MPC protocols (Yao's garbled circuits, BGW, SPDZ) can compute any function, including non-polynomial ones. The cost is much higher per operation than coded computing. Used in privacy- preserving ML pipelines (Chapter 11β12) but not for performance-oriented coded distributed computing.
None of these is as clean as pure coded computing for polynomial operations. The research frontier is in characterizing which hybrid strategies give the best performance / accuracy tradeoffs.
Example: Polynomial Approximation of ReLU
Find a polynomial approximation of ReLU() = max(0, x) on the interval . Compute the approximation error.
Degree-3 approximation
Chebyshev-series approximation of ReLU on : (first few terms; higher-order coefficients alternate in sign).
Max error
On , β about 5.5% of the range.
Degree-10 approximation
Increasing to degree 10 reduces max error to (0.8% of range). Higher degrees require more LCC recovery threshold: vs. 3 for degree-3.
Tradeoff
Choose the degree by the accuracy requirements of the downstream task. For feed-forward networks, degree 5β7 polynomials give usable accuracy with modest recovery-threshold overhead. For deep networks, accumulated error across many layers can become significant.
Open Problems at the Frontier
The coded-computing framework faces several research challenges:
-
Non-polynomial operations. Characterizing the optimal tradeoff between computation and communication for non-polynomial functions is open. Approximation-based approaches are empirical; a clean information-theoretic bound is elusive.
-
Dynamic operation graphs. Modern ML systems build computation graphs dynamically (PyTorch, JAX). Integrating coded computing with dynamic graphs requires either per-operation coding (high overhead) or static graph transformations (reduced flexibility).
-
Asymmetric / heterogeneous workers. Most LCC analyses assume homogeneous workers. Real clusters have varied compute and memory; optimal heterogeneous LCC is partial.
-
Composition with privacy / Byzantine robustness. Combining LCC with secret sharing (for privacy) and with Byzantine-robust aggregation (for integrity) is partially characterized but not closed. Chapter 11's ByzSecAgg is a step in this direction.
-
Approximate coded computing. For applications tolerating bounded error, how small can the recovery threshold be? Chapter 6's approximate gradient coding is the canonical answer; the extension to general LCC is open.
Chapter 18 discusses several of these in more detail.
Coded Computing in Production ML: Current State
As of 2024, coded computing is a specialized tool in production ML:
-
Data center training (Google, Meta, Amazon): uses polynomial codes for some large-scale linear-algebra operations, but most deployments stick to replication and synchronous SGD with timeouts.
-
Edge / federated learning: coded shuffling (Chapter 7) and coded gradient (Chapter 6) are research tools; production deployments use compression and pairwise masking (secure aggregation, Chapter 10).
-
Cryptographic ML: Secure MPC with LCC for privacy-preserving inference β an active area (e.g., CryptoDL, TF-Encrypted). Uses LCC for linear layers, garbled circuits for non-linear.
-
High-performance linear algebra libraries: NVIDIA cuBLAS, Intel MKL β do not implement coded schemes; they use retry mechanisms and hardware fault-tolerance.
The research-to-production gap remains the central open question. The elegance of the coded-computing framework has not yet translated into widespread industrial adoption outside of niche settings.
- β’
Coded schemes: research tools with scattered production use
- β’
Dominant production approach: replication + timeouts + retries
- β’
Niche adoption: privacy-preserving ML, specific high-value workloads
Historical Note: The Arc of the Field
2016βpresentThe coded-computing research programme was born in 2016β2017 with Lee et al.'s paper on speeding up distributed ML using codes. Over the next five years, the field established: polynomial codes (Yu et al. 2017), entangled polynomial codes (Yu et al. 2020), Lagrange Coded Computing (Yu et al. 2019), coded shuffling (Wan / Tuninetti / Caire 2021, this book's Β§7.3), and approximate gradient coding (Charles et al. 2017).
By 2020, the theory was largely complete: each class of computations had its optimal scheme, and the rate regions were characterized. The field then pivoted to composition with privacy / Byzantine primitives (Part III of this book) and to the challenge of non-polynomial operations. The CommIT group's ByzSecAgg (Chapter 11) is a flagship contribution of this second wave.
The emerging story of the 2020s is coded + cryptographic hybrid systems: coded computing for the polynomial core, MPC for the non-polynomial rind, composed carefully. This is the vision that Parts IIIβV of this book develop in detail.
Why This Matters: Part II Concludes; Part III Turns to Privacy
With Chapter 8, Part II of this book (coded computing) concludes. Parts IIIβV turn to the privacy, robustness, and wireless axes of the golden thread. Chapters 9β12 cover federated learning, secure aggregation, Byzantine-resilient aggregation (ByzSecAgg β a CommIT contribution), and communication-efficient secure aggregation (CCESA β another CommIT contribution). Chapters 13β15 cover PIR. Chapters 16β18 cover AirComp and FL over wireless.
The framework built in Part II β polynomial codes, LCC, coded shuffling β appears throughout Part III as a building block. Chapter 11's ByzSecAgg uses polynomial codes for outlier detection; Chapter 13's Sun-Jafar PIR uses finite-field IA (Chapter 4's tool); Chapter 16's AirComp exploits the wireless channel's analog superposition to bypass coded computation entirely. Every chapter of Parts IIIβV is built on the foundation of Parts IβII.
Key Takeaway
Coded computing achieves optimal straggler tolerance for polynomial computations; non-polynomial operations remain the frontier. Polynomial codes, entangled codes, and LCC together cover essentially all polynomial distributed ML workloads. Non-polynomial operations (ReLU, softmax, loss functions) require approximation or hybrid coding. Part II closes here; Part III now turns to privacy and robustness.
Common Mistake: Coded Computing Is Not a 'Solved' Field
Mistake:
Conclude that because LCC is information-theoretically optimal for polynomial functions, coded computing is "solved" and no more research is needed.
Correction:
LCC is optimal for polynomial computations β a restricted class. Non-polynomial operations, dynamic graphs, heterogeneous workers, and composition with privacy primitives all remain active research areas. Production adoption is also limited. Chapter 18 of this book discusses the open directions in detail; readers entering the field should expect many interesting problems still to solve.
Quick Check
Why cannot ReLU be exactly computed via Lagrange Coded Computing at bounded recovery threshold?
ReLU is not a polynomial function, so its total degree is undefined (infinite).
ReLU requires too much per-worker storage.
ReLU is not Lipschitz-continuous.
ReLU's computation is not distributed-friendly.
LCC requires to be a polynomial of bounded degree. ReLU is piecewise-linear but not polynomial; it cannot be expressed as a finite-degree polynomial on .