Deployment in Distributed ML Systems

From Theory to GPU Clusters

Coded shuffling and gradient coding are theoretically mature but only recently seeing serious attention from ML systems engineers. This section addresses how coded-caching-inspired techniques will roll out in production ML infrastructure.

Definition:
Distributed ML Architectures

Main paradigms for distributed ML training:

Parameter server. Central server holds model parameters; workers compute gradients, send to server, receive updates. Easy to implement; central bottleneck.
All-reduce (synchronous SGD). Workers communicate gradients peer-to-peer; no central server. Scales well to 100s-1000s of workers. Used by PyTorch DDP, Horovod, NCCL.
Asynchronous. Workers send partial gradients; server updates model without waiting. Faster but convergence less stable.
Federated learning. Devices hold data, train locally, send updates. No data movement between devices.

Coded shuffling applies primarily to paradigm 1-2 where data movement is the bottleneck. Federated learning is a different model with its own communication-efficiency techniques.

Where Coding Helps

Coded techniques are most impactful at specific scales:

Large datasets (> TB). Shuffling cost is significant; $1 + Ks$ reduction is meaningful. Coded shuffling beneficial.
Many workers ( $K > 100$ ). Coordination and shuffling costs grow with $K$ ; coded techniques give concrete savings.
Unreliable / elastic clusters. Gradient coding helps tolerate stragglers (cloud-based training at variable capacity).
Privacy-constrained settings. Coded computing extends to secret-sharing for federated learning.

Less impactful:

Small-scale ( $K < 20$ ): overhead > savings.
Tightly coupled clusters (HPC, no stragglers): no coding benefit.
Single-machine ML: irrelevant.

The target for practical coded-ML is large-scale production training pipelines (Netflix, Meta, Google). Even 5-10% speedup from coding is worth millions at that scale.

Example: Coded Shuffling at Hyperscale

A hyperscaler (Meta / Google) trains foundation models with 1000 GPUs on 10-PB datasets. Training run: 100 epochs. Naive shuffling cost per run: ?

Solution

Uncoded cost

$K(1-s) \cdot D \cdot T = 1000 \cdot 0.99 \cdot 10 \text{ PB} \cdot 100$ = 990,000 PB = 990 EB (exabytes). Too much to shuffle explicitly.

Actual practice

In practice, hyperscalers don't shuffle the full dataset per epoch — they use approximate shuffling (rotate within smaller windows). Still: significant inter-epoch traffic.

Coded shuffling

$1 + Ks = 1 + 50$ (at $s = 0.05$ ) = 51. Communication reduced 51×: total 19 EB. Still large but more feasible.

Business case

At $10^{-4}$ USD per GB transferred (inter-DC bandwidth): uncoded cost: $99M per training run. Coded:$ 1.9M. Savings per run: $97M. For 100 runs/year:$ 9.7B savings.

These numbers are inflated (hyperscalers don't pay rack-rate for internal traffic), but illustrate that coded techniques have real economic value at scale.

⚠️Engineering Note

Bandwidth Bottlenecks in Hyperscale Training

In modern GPU training:

Intra-node bandwidth. NVLink / PCIe: 600 GB/s GPU-to-GPU within a node. No bottleneck.
Intra-rack bandwidth. 200-400 Gbps per server; 8 GPUs per server × 400 Gbps = 3.2 Tbps aggregate per rack. Modest.
Cross-rack bandwidth. Top-of-rack switches; 2-4 Tbps per rack, 100+ racks = PB/s aggregate. Here bandwidth constraints start.
Cross-DC bandwidth. Wide-area inter-datacenter: single-digit Gbps per link. Expensive; this is where coding helps.

Coded shuffling is most valuable at cross-DC training (federated / multi-DC). For single-DC all-reduce, communication dominates only in specific regimes. The CommIT group's ongoing research identifies when and where coding pays off.

Practical Constraints

•
Intra-node: 600 GB/s (no bottleneck)
•
Intra-rack: 2-4 Tbps (modest)
•
Cross-rack: ~100 GB/s (start of constraint)
•
Cross-DC: 10-100 Gbps (bottleneck for large models)

Federated Learning and Coded Techniques

Federated learning (FL) has its own communication-reduction techniques:

Gradient sparsification. Send only the largest gradient components. Reduces bandwidth dramatically (~100×).
Gradient quantization. 8-bit or 4-bit gradients instead of fp32. Reduces 4-8×.
Local SGD. Workers perform multiple local updates before communication. Reduces frequency.
Coded FL (emerging): apply coded shuffling + secret sharing for communication efficiency and privacy.

Coded techniques are complementary to (1)-(3). A production FL system could use quantized gradients + coded shuffling + local SGD for maximum efficiency.

The CommIT group collaborates with UIC (Tuninetti) and others on coded FL. Early-stage research; prototype systems exist.

Key Takeaway

Coded shuffling and coded computing extend coded caching to distributed ML. Key deployment targets: hyperscale training ( $K > 100$ , $D >$ TB), cross-DC federated learning, straggler- prone cloud clusters. Not immediately transformative for small-scale training, but major at production scale where bandwidth costs millions per year.

Quick Check

In coded shuffling (Wan-Tuninetti-Caire), worker memory corresponds to what in coded caching?

User demands

Cache contents (the $M$ files stored per user)

The shared-link bandwidth

The library size N

Correction:

Cache contents (the

M

files stored per user)

Correct. Worker memory (fraction $s$ of dataset) is exactly analogous to user cache ( $M = sN$ files). The $1 + Ks$ gain mirrors the $1 + KM/N = 1 + t$ MAN coded-multicasting gain.

Historical Note: Coded Computing's Origin Story

2016–2024

Coded computing as a field emerged around 2016-2017 through several parallel efforts:

2017: Tandon-Li-Ramchandran — gradient coding for straggler mitigation.
2017: Lee-Suh-Lopez — coded matrix multiplication.
2018: Li-Maddah-Ali-Yu-Avestimehr — coded MapReduce.
2020: Wan-Tuninetti-Caire — coded data shuffling, connecting to coded caching framework.
2021-2024: Privacy-preserving coded computing (Kim et al.), federated coding (Caire-Tuninetti collab).

The field is now mature theoretically but practical deployment is in early stages. Production ML systems are starting to adopt gradient coding for fault tolerance; coded shuffling remains mostly research.

The CommIT contribution (Wan-Tuninetti-Caire 2020) was important for the framing: coded computing as a natural extension of coded caching, not a separate topic. This unification has clarified research directions and suggested new problems.

Coded Computing and Gradient Coding Chapter Summary