Deployment in Distributed ML Systems
From Theory to GPU Clusters
Coded shuffling and gradient coding are theoretically mature but only recently seeing serious attention from ML systems engineers. This section addresses how coded-caching-inspired techniques will roll out in production ML infrastructure.
Definition: Distributed ML Architectures
Distributed ML Architectures
Main paradigms for distributed ML training:
- Parameter server. Central server holds model parameters; workers compute gradients, send to server, receive updates. Easy to implement; central bottleneck.
- All-reduce (synchronous SGD). Workers communicate gradients peer-to-peer; no central server. Scales well to 100s-1000s of workers. Used by PyTorch DDP, Horovod, NCCL.
- Asynchronous. Workers send partial gradients; server updates model without waiting. Faster but convergence less stable.
- Federated learning. Devices hold data, train locally, send updates. No data movement between devices.
Coded shuffling applies primarily to paradigm 1-2 where data movement is the bottleneck. Federated learning is a different model with its own communication-efficiency techniques.
Where Coding Helps
Coded techniques are most impactful at specific scales:
- Large datasets (> TB). Shuffling cost is significant; reduction is meaningful. Coded shuffling beneficial.
- Many workers (). Coordination and shuffling costs grow with ; coded techniques give concrete savings.
- Unreliable / elastic clusters. Gradient coding helps tolerate stragglers (cloud-based training at variable capacity).
- Privacy-constrained settings. Coded computing extends to secret-sharing for federated learning.
Less impactful:
- Small-scale (): overhead > savings.
- Tightly coupled clusters (HPC, no stragglers): no coding benefit.
- Single-machine ML: irrelevant.
The target for practical coded-ML is large-scale production training pipelines (Netflix, Meta, Google). Even 5-10% speedup from coding is worth millions at that scale.
Example: Coded Shuffling at Hyperscale
A hyperscaler (Meta / Google) trains foundation models with 1000 GPUs on 10-PB datasets. Training run: 100 epochs. Naive shuffling cost per run: ?
Uncoded cost
= 990,000 PB = 990 EB (exabytes). Too much to shuffle explicitly.
Actual practice
In practice, hyperscalers don't shuffle the full dataset per epoch — they use approximate shuffling (rotate within smaller windows). Still: significant inter-epoch traffic.
Coded shuffling
(at ) = 51. Communication reduced 51×: total 19 EB. Still large but more feasible.
Business case
At USD per GB transferred (inter-DC bandwidth): uncoded cost: 1.9M. Savings per run: 9.7B savings.
These numbers are inflated (hyperscalers don't pay rack-rate for internal traffic), but illustrate that coded techniques have real economic value at scale.
Bandwidth Bottlenecks in Hyperscale Training
In modern GPU training:
- Intra-node bandwidth. NVLink / PCIe: 600 GB/s GPU-to-GPU within a node. No bottleneck.
- Intra-rack bandwidth. 200-400 Gbps per server; 8 GPUs per server × 400 Gbps = 3.2 Tbps aggregate per rack. Modest.
- Cross-rack bandwidth. Top-of-rack switches; 2-4 Tbps per rack, 100+ racks = PB/s aggregate. Here bandwidth constraints start.
- Cross-DC bandwidth. Wide-area inter-datacenter: single-digit Gbps per link. Expensive; this is where coding helps.
Coded shuffling is most valuable at cross-DC training (federated / multi-DC). For single-DC all-reduce, communication dominates only in specific regimes. The CommIT group's ongoing research identifies when and where coding pays off.
- •
Intra-node: 600 GB/s (no bottleneck)
- •
Intra-rack: 2-4 Tbps (modest)
- •
Cross-rack: ~100 GB/s (start of constraint)
- •
Cross-DC: 10-100 Gbps (bottleneck for large models)
Federated Learning and Coded Techniques
Federated learning (FL) has its own communication-reduction techniques:
- Gradient sparsification. Send only the largest gradient components. Reduces bandwidth dramatically (~100×).
- Gradient quantization. 8-bit or 4-bit gradients instead of fp32. Reduces 4-8×.
- Local SGD. Workers perform multiple local updates before communication. Reduces frequency.
- Coded FL (emerging): apply coded shuffling + secret sharing for communication efficiency and privacy.
Coded techniques are complementary to (1)-(3). A production FL system could use quantized gradients + coded shuffling + local SGD for maximum efficiency.
The CommIT group collaborates with UIC (Tuninetti) and others on coded FL. Early-stage research; prototype systems exist.
Key Takeaway
Coded shuffling and coded computing extend coded caching to distributed ML. Key deployment targets: hyperscale training (, TB), cross-DC federated learning, straggler- prone cloud clusters. Not immediately transformative for small-scale training, but major at production scale where bandwidth costs millions per year.
Quick Check
In coded shuffling (Wan-Tuninetti-Caire), worker memory corresponds to what in coded caching?
User demands
Cache contents (the files stored per user)
The shared-link bandwidth
The library size N
Correct. Worker memory (fraction of dataset) is exactly analogous to user cache ( files). The gain mirrors the MAN coded-multicasting gain.
Historical Note: Coded Computing's Origin Story
2016–2024Coded computing as a field emerged around 2016-2017 through several parallel efforts:
- 2017: Tandon-Li-Ramchandran — gradient coding for straggler mitigation.
- 2017: Lee-Suh-Lopez — coded matrix multiplication.
- 2018: Li-Maddah-Ali-Yu-Avestimehr — coded MapReduce.
- 2020: Wan-Tuninetti-Caire — coded data shuffling, connecting to coded caching framework.
- 2021-2024: Privacy-preserving coded computing (Kim et al.), federated coding (Caire-Tuninetti collab).
The field is now mature theoretically but practical deployment is in early stages. Production ML systems are starting to adopt gradient coding for fault tolerance; coded shuffling remains mostly research.
The CommIT contribution (Wan-Tuninetti-Caire 2020) was important for the framing: coded computing as a natural extension of coded caching, not a separate topic. This unification has clarified research directions and suggested new problems.