Coded Computing and Gradient Coding

Beyond Shuffling: Coded Computing

Coded shuffling (§15.2) addresses inter-epoch data movement. A related coded-computing problem addresses a different bottleneck: stragglers. In distributed ML, one or a few slow workers (due to load imbalance, GPU contention, network issues) can slow down the entire training iteration. Coded computing techniques add redundancy so that any $K-r$ workers can produce the aggregate gradient, tolerating $r$ stragglers.

This section surveys the gradient-coding problem and its relation to coded caching. The main idea: distributing encoded versions of data across workers, such that any subset can contribute without waiting for slowest.

Definition:
Gradient Coding

In gradient coding, the training dataset is encoded and distributed across workers with redundancy. Specifically:

Dataset of $D$ samples partitioned into $K$ groups of size $D/K$ .
Each worker $k$ stores $(r+1)/K$ fraction of the dataset (instead of $1/K$ ) — where $r$ is the straggler tolerance.
At each iteration, workers compute partial gradients on their assigned (possibly overlapping) data.
Any subset of $K - r$ workers' responses suffices to reconstruct the full gradient (via an appropriate combining operation).

The redundancy factor $(r+1)$ enables tolerance to $r$ slow workers.

Gradient coding is cache-like: each worker stores more than its "fair share" ( $1/K$ ) to provide redundancy. The analogy to coded caching: memory $M$ per worker enables coded robustness.

Theorem: Gradient-Coding Storage-Tolerance Tradeoff

For distributed gradient descent with $K$ workers tolerating $r$ stragglers, the minimum per-worker storage is $M_\text{gc}(r) \;=\; \frac{r+1}{K} \cdot D \text{ samples}.$ That is, storage overhead is $(r+1)/K$ fraction of the dataset, matching the straggler tolerance.

To recover the full gradient from any $K-r$ workers, each worker must provide linearly-dependent partial gradients that span the space. Each worker needs $(r+1)$ data partitions to achieve this.

Proof

Achievability

Assign each worker $r+1$ distinct data partitions. Workers compute partial gradients on their partitions, then combine with Reed-Solomon-like coding. Any $K-r$ workers' linear combinations can be solved for the full gradient (full rank with high probability).

Converse

Information-theoretic: each partition must be replicated on $r+1$ workers to tolerate $r$ simultaneous failures. Per-worker storage: $r+1$ partitions. $\blacksquare$

Gradient Coding: Redundancy vs Straggler Tolerance

Per-worker redundancy (storage × compute) vs straggler tolerance $r$ . Higher $r$ requires more redundancy; throughput-wise, the scheme handles more slow workers but at a storage cost.

Parameters

Workers K30

Example: Straggler Tolerance in Production

A 100-GPU training cluster experiences typical straggler rate of 5% (~5 GPUs slow per iteration). Design gradient coding to maintain throughput.

Solution

Straggler tolerance

To tolerate 5 stragglers: $r = 5$ . Storage overhead per worker: $(r+1)/K = 6/100 = 6\%$ of the dataset.

Cost

Without coding: 100 GPUs × 1% storage = 100% total. With coding: 100 × 6% = 600% — 6× storage overhead.

Tradeoff

6× storage is substantial but often acceptable for small datasets. For large datasets (> TB), storage becomes prohibitive. In that case, use approximate gradient coding with lower redundancy.

Practical

Production: $r = 2$ - $3$ typically sufficient. Storage overhead: 3-4%. Acceptable tradeoff for reliable training.

Coded Computing as a Unifying Framework

Gradient coding is one member of a broader class of coded computing techniques:

Coded matrix multiplication (Lee et al. 2017): distribute matrix blocks with polynomial redundancy to tolerate stragglers.
Coded MapReduce (Li et al. 2018): reduce shuffle phase communication in MapReduce via coded placement.
Coded convolutions: for distributed CNN training.
Secure coded computing: combines coding with secret sharing for privacy.

All of these share the common principle: storage replaces recomputation, just as in coded caching, storage replaces bandwidth. The CommIT framework's unified perspective has been influential in shaping distributed computing theory.

🔧Engineering Note

Gradient Coding in Practice

Gradient coding has seen limited production deployment:

PyTorch/TensorFlow: Native support for data parallel training; straggler handling via timeouts. No native gradient coding.
Ray (Anyscale): Supports user-defined retry and failure tolerance. Gradient coding as custom middleware.
Horovod / NCCL: All-reduce based; no coding. Relies on tight coupling across workers.
ByteDance BytePS: Parameter-server based; some straggler-mitigation logic but not coded.

Why limited deployment? (a) Implementation complexity vs timeout / checkpoint-restart simplicity; (b) modest gains for typical stragglers (2-5%); (c) storage overhead dissuasion. The coded- computing theory is mature; practical adoption lags.

The CommIT group and Caire-Tuninetti collaboration have contributed to the coded-computing research program but real- world deployment remains an open engineering question.

Practical Constraints

•
PyTorch DDP: data parallel, no native coding
•
Straggler tolerance via timeout: simple, common
•
Gradient coding: 3-6× storage overhead, complex to deploy
•
Coded MapReduce: research-only, no production systems

Common Mistake: Coded Computing Is Not Free

Mistake:

Claiming gradient coding eliminates straggler latency without acknowledging the storage/compute overhead.

Correction:

Gradient coding requires $r+1$ data partitions per worker ( $(r+1)/K$ fraction of data), vs 1 partition without coding. Storage increase: $(r+1)$ -fold. Compute increase: same factor.

For a 100-worker cluster with $r=5$ : workers store 6% vs 1% of data, compute 6× partial gradients. Total cluster cost: 6× storage + 6× flops. Trade-off: tolerance vs resource.

Deployments must weigh this — not zero-cost optimization.

Coded Shuffling (CommIT Wan-Tuninetti-Caire)Deployment in Distributed ML Systems