Applications and Limitations

What the Framework Can β€” and Cannot β€” Do

Sections 8.1–8.3 built out the coded-computing toolbox: polynomial codes for matmul, entangled codes for convolution, LCC for arbitrary polynomials. This section takes stock of what the framework enables, where it falls short, and which open problems remain at the frontier of Part II.

The honest assessment: coded computing is a triumph for polynomial distributed computations but stops short of handling the non-polynomial operations (ReLU, softmax, cross-entropy) that dominate modern deep learning. This is not a defect β€” it's the scope of the framework β€” but it means that production ML pipelines use coded computing for parts of the workload, not the whole. Section 8.4 identifies the boundary precisely and points to research directions that seek to cross it.

Modern ML Applications of Coded Computing

Coded computing is deployable for the following common ML workloads:

  • Linear layers (matmul). Polynomial codes or entangled-poly codes with per-worker storage ΞΌ=1/p+1/q\mu = 1/p + 1/q give optimal straggler tolerance.

  • Convolutional layers. Entangled polynomial codes give K=p+qβˆ’1K = p + q - 1 β€” linear in partitions. Especially effective when channel / filter dimensions are large.

  • Attention mechanism (in Transformers). The Query-Key dot product QKT\mathbf{Q}\mathbf{K}^T is a matrix multiplication; coded matrix multiplication applies directly. The Value-weighted output is also a matmul.

  • Embedding layer. Sparse-matrix Γ— dense-matrix, reducible to structured sparse matmul. Coded schemes (Kai Wan, Daniela Tuninetti et al. 2020) handle the sparse case with approximate codes.

  • Gradient aggregation. Chapter 6's gradient coding handles this directly.

  • Batch normalization statistics. Per-batch mean and variance are linear functions; coded sketching (e.g., AMS sketches) handles them.

Operations with non-polynomial structure (activation functions, loss functions, normalization) are not directly coded; they run as usual on each worker.

Definition:

The Non-Polynomial Barrier

A function f:Fd→Fd′f: \mathbb{F}^d \to \mathbb{F}^{d'} is non-polynomial if it cannot be expressed as a finite-degree polynomial (or its total degree is infinite). Standard ML non-polynomials:

  • ReLU: f(x)=max⁑(0,x)f(x) = \max(0, x). Not polynomial.
  • Sigmoid: f(x)=1/(1+eβˆ’x)f(x) = 1/(1 + e^{-x}). Transcendental.
  • Softmax: f(x)i=exi/βˆ‘jexjf(\mathbf{x})_i = e^{x_i}/\sum_j e^{x_j}. Transcendental.
  • Cross-entropy loss. Involves logarithms β€” not polynomial.
  • Batch / Layer normalization. Division by a norm β€” rational but not polynomial.

The non-polynomial barrier is that LCC and polynomial codes cannot handle these operations with bounded per-worker storage and bounded recovery threshold. Approximation via low-degree polynomial is possible but adds error; exact computation requires different tools (MPC, FHE).

Non-Polynomial Barrier

The open problem that coded-computing frameworks based on polynomial interpolation cannot exactly handle non-polynomial functions (ReLU, softmax, exponentials). Current workarounds: polynomial approximation, hybrid coding (polynomial layers coded, non-polynomial replicated), or cryptographic MPC.

Three Strategies for the Non-Polynomial Case

Practical approaches to coded computing on non-polynomial operations:

  1. Polynomial approximation. Replace ff by a low-degree polynomial f^\hat f (Taylor series or Chebyshev approximation). Apply LCC to f^\hat f. The approximation error propagates through the computation, bounding the final output's deviation from the true ff. For ReLU, a degree-5 polynomial approximation gives 1% error; higher degree can reduce this.

  2. Layer-wise hybrid coding. In a deep net, code only the polynomial layers (linear, convolutional, attention) and replicate the non-polynomial layers (activations, normalization). The replicated layers are small enough that their straggler-impact is bounded.

  3. Secure MPC for non-polynomial operations. Cryptographic MPC protocols (Yao's garbled circuits, BGW, SPDZ) can compute any function, including non-polynomial ones. The cost is much higher per operation than coded computing. Used in privacy- preserving ML pipelines (Chapter 11–12) but not for performance-oriented coded distributed computing.

None of these is as clean as pure coded computing for polynomial operations. The research frontier is in characterizing which hybrid strategies give the best performance / accuracy tradeoffs.

Example: Polynomial Approximation of ReLU

Find a polynomial approximation of ReLU(xx) = max(0, x) on the interval [βˆ’1,1][-1, 1]. Compute the approximation error.

Open Problems at the Frontier

The coded-computing framework faces several research challenges:

  1. Non-polynomial operations. Characterizing the optimal tradeoff between computation and communication for non-polynomial functions is open. Approximation-based approaches are empirical; a clean information-theoretic bound is elusive.

  2. Dynamic operation graphs. Modern ML systems build computation graphs dynamically (PyTorch, JAX). Integrating coded computing with dynamic graphs requires either per-operation coding (high overhead) or static graph transformations (reduced flexibility).

  3. Asymmetric / heterogeneous workers. Most LCC analyses assume homogeneous workers. Real clusters have varied compute and memory; optimal heterogeneous LCC is partial.

  4. Composition with privacy / Byzantine robustness. Combining LCC with secret sharing (for privacy) and with Byzantine-robust aggregation (for integrity) is partially characterized but not closed. Chapter 11's ByzSecAgg is a step in this direction.

  5. Approximate coded computing. For applications tolerating bounded error, how small can the recovery threshold be? Chapter 6's approximate gradient coding is the canonical answer; the extension to general LCC is open.

Chapter 18 discusses several of these in more detail.

πŸ”§Engineering Note

Coded Computing in Production ML: Current State

As of 2024, coded computing is a specialized tool in production ML:

  • Data center training (Google, Meta, Amazon): uses polynomial codes for some large-scale linear-algebra operations, but most deployments stick to replication and synchronous SGD with timeouts.

  • Edge / federated learning: coded shuffling (Chapter 7) and coded gradient (Chapter 6) are research tools; production deployments use compression and pairwise masking (secure aggregation, Chapter 10).

  • Cryptographic ML: Secure MPC with LCC for privacy-preserving inference β€” an active area (e.g., CryptoDL, TF-Encrypted). Uses LCC for linear layers, garbled circuits for non-linear.

  • High-performance linear algebra libraries: NVIDIA cuBLAS, Intel MKL β€” do not implement coded schemes; they use retry mechanisms and hardware fault-tolerance.

The research-to-production gap remains the central open question. The elegance of the coded-computing framework has not yet translated into widespread industrial adoption outside of niche settings.

Practical Constraints
  • β€’

    Coded schemes: research tools with scattered production use

  • β€’

    Dominant production approach: replication + timeouts + retries

  • β€’

    Niche adoption: privacy-preserving ML, specific high-value workloads

πŸ“‹ Ref: Li & Avestimehr survey 2020 Β§VII

Historical Note: The Arc of the Field

2016–present

The coded-computing research programme was born in 2016–2017 with Lee et al.'s paper on speeding up distributed ML using codes. Over the next five years, the field established: polynomial codes (Yu et al. 2017), entangled polynomial codes (Yu et al. 2020), Lagrange Coded Computing (Yu et al. 2019), coded shuffling (Wan / Tuninetti / Caire 2021, this book's Β§7.3), and approximate gradient coding (Charles et al. 2017).

By 2020, the theory was largely complete: each class of computations had its optimal scheme, and the rate regions were characterized. The field then pivoted to composition with privacy / Byzantine primitives (Part III of this book) and to the challenge of non-polynomial operations. The CommIT group's ByzSecAgg (Chapter 11) is a flagship contribution of this second wave.

The emerging story of the 2020s is coded + cryptographic hybrid systems: coded computing for the polynomial core, MPC for the non-polynomial rind, composed carefully. This is the vision that Parts III–V of this book develop in detail.

Why This Matters: Part II Concludes; Part III Turns to Privacy

With Chapter 8, Part II of this book (coded computing) concludes. Parts III–V turn to the privacy, robustness, and wireless axes of the golden thread. Chapters 9–12 cover federated learning, secure aggregation, Byzantine-resilient aggregation (ByzSecAgg β€” a CommIT contribution), and communication-efficient secure aggregation (CCESA β€” another CommIT contribution). Chapters 13–15 cover PIR. Chapters 16–18 cover AirComp and FL over wireless.

The framework built in Part II β€” polynomial codes, LCC, coded shuffling β€” appears throughout Part III as a building block. Chapter 11's ByzSecAgg uses polynomial codes for outlier detection; Chapter 13's Sun-Jafar PIR uses finite-field IA (Chapter 4's tool); Chapter 16's AirComp exploits the wireless channel's analog superposition to bypass coded computation entirely. Every chapter of Parts III–V is built on the foundation of Parts I–II.

Key Takeaway

Coded computing achieves optimal straggler tolerance for polynomial computations; non-polynomial operations remain the frontier. Polynomial codes, entangled codes, and LCC together cover essentially all polynomial distributed ML workloads. Non-polynomial operations (ReLU, softmax, loss functions) require approximation or hybrid coding. Part II closes here; Part III now turns to privacy and robustness.

Common Mistake: Coded Computing Is Not a 'Solved' Field

Mistake:

Conclude that because LCC is information-theoretically optimal for polynomial functions, coded computing is "solved" and no more research is needed.

Correction:

LCC is optimal for polynomial computations β€” a restricted class. Non-polynomial operations, dynamic graphs, heterogeneous workers, and composition with privacy primitives all remain active research areas. Production adoption is also limited. Chapter 18 of this book discusses the open directions in detail; readers entering the field should expect many interesting problems still to solve.

Quick Check

Why cannot ReLU be exactly computed via Lagrange Coded Computing at bounded recovery threshold?

ReLU is not a polynomial function, so its total degree is undefined (infinite).

ReLU requires too much per-worker storage.

ReLU is not Lipschitz-continuous.

ReLU's computation is not distributed-friendly.