Coded Computing and Gradient Coding
Beyond Shuffling: Coded Computing
Coded shuffling (§15.2) addresses inter-epoch data movement. A related coded-computing problem addresses a different bottleneck: stragglers. In distributed ML, one or a few slow workers (due to load imbalance, GPU contention, network issues) can slow down the entire training iteration. Coded computing techniques add redundancy so that any workers can produce the aggregate gradient, tolerating stragglers.
This section surveys the gradient-coding problem and its relation to coded caching. The main idea: distributing encoded versions of data across workers, such that any subset can contribute without waiting for slowest.
Definition: Gradient Coding
Gradient Coding
In gradient coding, the training dataset is encoded and distributed across workers with redundancy. Specifically:
- Dataset of samples partitioned into groups of size .
- Each worker stores fraction of the dataset (instead of ) — where is the straggler tolerance.
- At each iteration, workers compute partial gradients on their assigned (possibly overlapping) data.
- Any subset of workers' responses suffices to reconstruct the full gradient (via an appropriate combining operation).
The redundancy factor enables tolerance to slow workers.
Gradient coding is cache-like: each worker stores more than its "fair share" () to provide redundancy. The analogy to coded caching: memory per worker enables coded robustness.
Theorem: Gradient-Coding Storage-Tolerance Tradeoff
For distributed gradient descent with workers tolerating stragglers, the minimum per-worker storage is That is, storage overhead is fraction of the dataset, matching the straggler tolerance.
To recover the full gradient from any workers, each worker must provide linearly-dependent partial gradients that span the space. Each worker needs data partitions to achieve this.
Achievability
Assign each worker distinct data partitions. Workers compute partial gradients on their partitions, then combine with Reed-Solomon-like coding. Any workers' linear combinations can be solved for the full gradient (full rank with high probability).
Converse
Information-theoretic: each partition must be replicated on workers to tolerate simultaneous failures. Per-worker storage: partitions.
Gradient Coding: Redundancy vs Straggler Tolerance
Per-worker redundancy (storage × compute) vs straggler tolerance . Higher requires more redundancy; throughput-wise, the scheme handles more slow workers but at a storage cost.
Parameters
Example: Straggler Tolerance in Production
A 100-GPU training cluster experiences typical straggler rate of 5% (~5 GPUs slow per iteration). Design gradient coding to maintain throughput.
Straggler tolerance
To tolerate 5 stragglers: . Storage overhead per worker: of the dataset.
Cost
Without coding: 100 GPUs × 1% storage = 100% total. With coding: 100 × 6% = 600% — 6× storage overhead.
Tradeoff
6× storage is substantial but often acceptable for small datasets. For large datasets (> TB), storage becomes prohibitive. In that case, use approximate gradient coding with lower redundancy.
Practical
Production: - typically sufficient. Storage overhead: 3-4%. Acceptable tradeoff for reliable training.
Coded Computing as a Unifying Framework
Gradient coding is one member of a broader class of coded computing techniques:
- Coded matrix multiplication (Lee et al. 2017): distribute matrix blocks with polynomial redundancy to tolerate stragglers.
- Coded MapReduce (Li et al. 2018): reduce shuffle phase communication in MapReduce via coded placement.
- Coded convolutions: for distributed CNN training.
- Secure coded computing: combines coding with secret sharing for privacy.
All of these share the common principle: storage replaces recomputation, just as in coded caching, storage replaces bandwidth. The CommIT framework's unified perspective has been influential in shaping distributed computing theory.
Gradient Coding in Practice
Gradient coding has seen limited production deployment:
- PyTorch/TensorFlow: Native support for data parallel training; straggler handling via timeouts. No native gradient coding.
- Ray (Anyscale): Supports user-defined retry and failure tolerance. Gradient coding as custom middleware.
- Horovod / NCCL: All-reduce based; no coding. Relies on tight coupling across workers.
- ByteDance BytePS: Parameter-server based; some straggler-mitigation logic but not coded.
Why limited deployment? (a) Implementation complexity vs timeout / checkpoint-restart simplicity; (b) modest gains for typical stragglers (2-5%); (c) storage overhead dissuasion. The coded- computing theory is mature; practical adoption lags.
The CommIT group and Caire-Tuninetti collaboration have contributed to the coded-computing research program but real- world deployment remains an open engineering question.
- •
PyTorch DDP: data parallel, no native coding
- •
Straggler tolerance via timeout: simple, common
- •
Gradient coding: 3-6× storage overhead, complex to deploy
- •
Coded MapReduce: research-only, no production systems
Common Mistake: Coded Computing Is Not Free
Mistake:
Claiming gradient coding eliminates straggler latency without acknowledging the storage/compute overhead.
Correction:
Gradient coding requires data partitions per worker ( fraction of data), vs 1 partition without coding. Storage increase: -fold. Compute increase: same factor.
For a 100-worker cluster with : workers store 6% vs 1% of data, compute 6× partial gradients. Total cluster cost: 6× storage + 6× flops. Trade-off: tolerance vs resource.
Deployments must weigh this — not zero-cost optimization.