Gradient Coding and Coded Matrix Multiplication

Stragglers: The Tail at Scale

In distributed ML clusters, worker nodes experience heavy-tailed latency: a few workers are orders of magnitude slower than the median (Dean-Barroso 2013, "The Tail at Scale"). Waiting for all workers bottlenecks throughput at the slowest.

Gradient coding (Tandon-Lei-Dimakis-Karampatziakis 2017) and coded matrix multiplication (Lee-Lam-Pedarsani-Papailiopoulos- Ramchandran 2017) use coding theory to tolerate stragglers: assign redundant work so that the master can decode the full output from any sufficiently large subset.

This is a different coded-computing primitive than coded MapReduce (Β§16.2). LMYA trades compute for shuffle; gradient coding trades compute for latency tolerance.

Theorem: Gradient Coding Tradeoff

For a distributed gradient computation with KK workers and a per-worker storage redundancy of s+1s+1 (each data partition stored at s+1s+1 workers), there exist encoding matrices such that the master can recover the full gradient βˆ‘i=1Nβˆ‡fi\sum_{i=1}^N \nabla f_i from any Kβˆ’sK - s workers. The scheme tolerates any ss stragglers.

Each worker computes a linear combination of local gradients (encoded at the coding layer). Any Kβˆ’sK - s of KK linear combinations span the same space as the full set, so the master can always recover the sum.

Storage/compute cost: each data partition stored at s+1s + 1 workers β€” factor (s+1)(s+1) storage overhead for ss-straggler tolerance.

Gradient Coding: Latency vs Straggler Tolerance

Expected epoch latency under Poisson-Ο„\tau straggler model, as function of tolerance ss. Larger ss β‡’\Rightarrow less waiting for slow workers. Tradeoff: more redundancy.

Parameters
30
1

Theorem: Coded Matrix Multiplication

For distributed computation of AB\mathbf{A}\mathbf{B} (large matrix product) on KK workers using a (K,k)(K, k)-Reed-Solomon-coded scheme, the master can reconstruct AB\mathbf{A}\mathbf{B} from any kk of KK workers' outputs. The expected completion time is the kk-th order statistic of worker latencies, substantially lower than the KK-th (straggler bottleneck).

Encode A\mathbf{A} via polynomial (e.g., Lagrange interpolation at KK points). Each worker computes a polynomial value times B\mathbf{B}. Master reconstructs the original product via polynomial interpolation from any kk points.

Coded Matrix Multiplication: Speedup vs k

Expected speedup over uncoded (k=Kk = K) as function of kk. Smaller kk β‡’\Rightarrow larger speedup (less waiting) but more storage per worker.

Parameters
20

Example: Gradient Coding for ImageNet Training

Train ResNet-50 on ImageNet with K=64K = 64 GPUs. Each epoch consists of computing gradients over mini-batches. 5% of GPUs straggle with 10Γ—10\times median latency per epoch. Quantify gradient-coding gains.

πŸŽ“CommIT Contribution(2024)

CommIT Work on Coded Gradient Methods for Federated Learning

H. Joudeh, G. Caire β€” IEEE Transactions on Communications

The CommIT group extends coded gradient methods to the federated learning (FL) setting, where clients have heterogeneous data and stragglers are common. Key contributions:

  1. Straggler-robust FL with heterogeneous data. Unlike original gradient coding (assumes homogeneous batch size), the CommIT work handles non-IID client data typical in FL.
  2. Privacy-compatible coding. Integration with differential privacy and secure aggregation (Ch 17 connections), so that gradient coding doesn't compromise privacy.
  3. Communication-efficient variants. Reduce gradient upload bandwidth by factor proportional to KK, similar to Wan-Tuninetti-Caire shuffling (Β§15.2).

Impact: provides a path to deploy coded-computing techniques in cross-device FL at scale β€” integrating CC theory with practical FL systems like TensorFlow Federated and Flower.

commitfederated-learninggradient-codingView Paper β†’

Common Mistake: Don't Confuse Gradient Coding with Coded MapReduce

Mistake:

Treating all "coded computing" schemes as equivalent.

Correction:

Two distinct coded-computing primitives:

  • Coded MapReduce (LMYA): trades compute redundancy for shuffle bandwidth. Memory-communication tradeoff in bandwidth.
  • Gradient coding / coded matmul: trades compute redundancy for straggler tolerance. Time-latency tradeoff.

They can be composed (store redundantly + code gradients) but address different bottlenecks. Confusing them leads to wrong system designs.

Key Takeaway

Gradient coding and coded matrix multiplication are distinct from coded MapReduce. They trade storage for straggler tolerance, not bandwidth. Both are foundational coded-computing primitives. Together with coded MapReduce, they form the three pillars of coded distributed computing β€” a subfield that unifies IT, coding theory, and systems.