Coded Computing and Gradient Coding

Beyond Shuffling: Coded Computing

Coded shuffling (§15.2) addresses inter-epoch data movement. A related coded-computing problem addresses a different bottleneck: stragglers. In distributed ML, one or a few slow workers (due to load imbalance, GPU contention, network issues) can slow down the entire training iteration. Coded computing techniques add redundancy so that any KrK-r workers can produce the aggregate gradient, tolerating rr stragglers.

This section surveys the gradient-coding problem and its relation to coded caching. The main idea: distributing encoded versions of data across workers, such that any subset can contribute without waiting for slowest.

Definition:

Gradient Coding

In gradient coding, the training dataset is encoded and distributed across workers with redundancy. Specifically:

  • Dataset of DD samples partitioned into KK groups of size D/KD/K.
  • Each worker kk stores (r+1)/K(r+1)/K fraction of the dataset (instead of 1/K1/K) — where rr is the straggler tolerance.
  • At each iteration, workers compute partial gradients on their assigned (possibly overlapping) data.
  • Any subset of KrK - r workers' responses suffices to reconstruct the full gradient (via an appropriate combining operation).

The redundancy factor (r+1)(r+1) enables tolerance to rr slow workers.

Gradient coding is cache-like: each worker stores more than its "fair share" (1/K1/K) to provide redundancy. The analogy to coded caching: memory MM per worker enables coded robustness.

Theorem: Gradient-Coding Storage-Tolerance Tradeoff

For distributed gradient descent with KK workers tolerating rr stragglers, the minimum per-worker storage is Mgc(r)  =  r+1KD samples.M_\text{gc}(r) \;=\; \frac{r+1}{K} \cdot D \text{ samples}. That is, storage overhead is (r+1)/K(r+1)/K fraction of the dataset, matching the straggler tolerance.

To recover the full gradient from any KrK-r workers, each worker must provide linearly-dependent partial gradients that span the space. Each worker needs (r+1)(r+1) data partitions to achieve this.

Gradient Coding: Redundancy vs Straggler Tolerance

Per-worker redundancy (storage × compute) vs straggler tolerance rr. Higher rr requires more redundancy; throughput-wise, the scheme handles more slow workers but at a storage cost.

Parameters
30

Example: Straggler Tolerance in Production

A 100-GPU training cluster experiences typical straggler rate of 5% (~5 GPUs slow per iteration). Design gradient coding to maintain throughput.

Coded Computing as a Unifying Framework

Gradient coding is one member of a broader class of coded computing techniques:

  1. Coded matrix multiplication (Lee et al. 2017): distribute matrix blocks with polynomial redundancy to tolerate stragglers.
  2. Coded MapReduce (Li et al. 2018): reduce shuffle phase communication in MapReduce via coded placement.
  3. Coded convolutions: for distributed CNN training.
  4. Secure coded computing: combines coding with secret sharing for privacy.

All of these share the common principle: storage replaces recomputation, just as in coded caching, storage replaces bandwidth. The CommIT framework's unified perspective has been influential in shaping distributed computing theory.

🔧Engineering Note

Gradient Coding in Practice

Gradient coding has seen limited production deployment:

  1. PyTorch/TensorFlow: Native support for data parallel training; straggler handling via timeouts. No native gradient coding.
  2. Ray (Anyscale): Supports user-defined retry and failure tolerance. Gradient coding as custom middleware.
  3. Horovod / NCCL: All-reduce based; no coding. Relies on tight coupling across workers.
  4. ByteDance BytePS: Parameter-server based; some straggler-mitigation logic but not coded.

Why limited deployment? (a) Implementation complexity vs timeout / checkpoint-restart simplicity; (b) modest gains for typical stragglers (2-5%); (c) storage overhead dissuasion. The coded- computing theory is mature; practical adoption lags.

The CommIT group and Caire-Tuninetti collaboration have contributed to the coded-computing research program but real- world deployment remains an open engineering question.

Practical Constraints
  • PyTorch DDP: data parallel, no native coding

  • Straggler tolerance via timeout: simple, common

  • Gradient coding: 3-6× storage overhead, complex to deploy

  • Coded MapReduce: research-only, no production systems

Common Mistake: Coded Computing Is Not Free

Mistake:

Claiming gradient coding eliminates straggler latency without acknowledging the storage/compute overhead.

Correction:

Gradient coding requires r+1r+1 data partitions per worker ((r+1)/K(r+1)/K fraction of data), vs 1 partition without coding. Storage increase: (r+1)(r+1)-fold. Compute increase: same factor.

For a 100-worker cluster with r=5r=5: workers store 6% vs 1% of data, compute 6× partial gradients. Total cluster cost: 6× storage + 6× flops. Trade-off: tolerance vs resource.

Deployments must weigh this — not zero-cost optimization.