Ferkans — Interactive Telecom Tutor

ex-cc-ch15-01

Easy

Compute the coded shuffling cost for $K = 100$ , $s = 0.1$ , $D = 1$ PB.

Show Hint

$K(1-s)D/(1+Ks)$ .

Solution

Compute

$R_\text{coded} = 100 \cdot 0.9/(1+10) \cdot 1 \text{ PB} = 90/11 = 8.18 \text{ PB per epoch}$ . Uncoded: $90$ PB. Savings: $10\times$ .

ex-cc-ch15-02

Easy

State the coded-caching / coded-shuffling analogy.

Show Hint

Worker memory = cache.

Solution

Table

Cache $M$ ↔ Worker memory $sD$ . Files $N$ ↔ Dataset $D$ . Memory ratio $M/N$ ↔ Storage fraction $s$ . Demand ↔ New assignment. Delivery ↔ Shuffling. Caching gain $1+KM/N$ ↔ Shuffling gain $1+Ks$ .

ex-cc-ch15-03

Easy

Gradient coding with $K = 50$ workers tolerates $r = 4$ stragglers. What's the storage overhead per worker?

Show Hint

$(r+1)/K$ .

Solution

Compute

$(r+1)/K = 5/50 = 10\%$ . Per-worker storage: 10% of dataset (vs 2% without coding — $1/K$ ). Storage overhead: 5×.

ex-cc-ch15-04

Easy

Why does the shuffling gain factor $1+Ks$ saturate as $K \to \infty$ at fixed $s$ ?

Show Hint

$(1-s)/s$ asymptote.

Solution

Answer

At large $K$ : $K(1-s)/(1+Ks) \to (1-s)/s$ (asymptote). Independent of $K$ . Beyond moderate $K$ , adding workers doesn't reduce per-worker shuffling cost. Storage fraction $s$ is the dominant lever.

ex-cc-ch15-05

Easy

In coded shuffling, if $s = 1$ (worker stores full dataset), what is the communication cost?

Show Hint

No shuffling needed.

Solution

Answer

$s = 1$ : each worker has the full dataset; no data transfer needed per epoch. $R_\text{coded} = 0$ . Formula: $K(1-1)/(1+K) = 0$ . ✓

ex-cc-ch15-06

Medium

Converse proof sketch. Prove that the coded shuffling rate $K(1-s)/(1+Ks)$ is order-optimal.

Show Hint

Cut-set argument on worker information.

Solution

Cut-set

Consider any single worker. At epoch $t+1$ , it needs its new assignment $\tilde D_k$ of size $sD$ . Of these, $(1-s)$ fraction is new data; the rest was already stored.

Bound

Worker must receive at least $sD(1-s) = s(1-s)D$ new data per epoch. Across $K$ workers: $Ks(1-s)D$ . After optimizing over the information shared across workers via coded multicasting: achievable rate is $K(1-s)/(1+Ks) \cdot D$ .

Matching achievability

The MAN-style coded scheme (§15.2) matches this bound asymptotically. $\blacksquare$

ex-cc-ch15-07

Medium

Partial shuffling. If only a fraction $p$ of the dataset is re-shuffled per epoch (rest stays), what's the cost?

Show Hint

Linear scaling in $p$ .

Solution

Formula

$R = p \cdot K(1-s)/(1+Ks) \cdot D$ . Linear in $p$ ; saturating gain structure.

Practical

Production ML shuffles $p \sim 10\%$ per epoch (not full). Total cost 10× less than full shuffling. Coding still helps; reduction remains $(1+Ks)$ .

ex-cc-ch15-08

Medium

Wall-clock savings. ResNet-50 training at $K = 256$ , $s = 0.05$ : compute shuffling time savings per epoch and total.

Show Hint

Use earlier numbers: 36.5 TB per epoch uncoded.

Solution

Per epoch

$1+Ks = 13.8$ . Coded: 36.5/13.8 $\approx$ 2.64 TB per epoch. At 100 Gbps network, 256 workers: shuffling time $\approx 2.64 \times 10^{12} \cdot 8 / (100 \times 10^9 \cdot 256)$ = 0.82 seconds. Uncoded: 11 seconds.

Per training

90 epochs. Uncoded: 1000 s = 17 min. Coded: 74 s = 1.2 min. Saved: ~16 min per training run.

Scale

For 1000 training runs/year (hyperscale): ~260 hrs of compute cluster time saved per year, on shuffling alone.

ex-cc-ch15-09

Medium

Gradient coding + shuffling. Combine both techniques in a training loop. Does the combination compound?

Show Hint

Orthogonal mechanisms.

Solution

Gradient coding

Handles straggler tolerance via redundant data / computation. Storage overhead: $(r+1)/K$ .

Coded shuffling

Handles inter-epoch data transfer. Bandwidth reduction: $(1+Ks)$ .

Combination

Both techniques can coexist: gradient coding for reliability, shuffling for bandwidth. Storage overhead from gradient coding: $s' = s(r+1)$ . Shuffling gain: still $(1+Ks')$ . Compound gain possible.

Practical

Not yet standard in production. Research prototype: Caire- Tuninetti collaborations explore combined schemes.

ex-cc-ch15-10

Medium

Cross-DC training. For federated training across 10 data centers, each with 10 Gbps interconnect, dataset 1 PB: compare uncoded and coded shuffling time.

Show Hint

Aggregated cross-DC bandwidth.

Solution

Total bandwidth

10 DCs × 10 Gbps = 100 Gbps aggregate cross-DC.

Uncoded

Per-epoch transfer: $K(1-s)D$ = say $100(1-0.1)$ PB = 90 PB. At 100 Gbps: $720{,}000$ seconds = 200 hours per epoch. Infeasible.

Coded

Reduce by $1+Ks = 11$ : 8.2 PB per epoch. At 100 Gbps: 18 hrs. Still slow; use smaller datasets or longer epochs.

Conclusion

Cross-DC federated training is bottlenecked by bandwidth. Coded shuffling helps but cannot eliminate the constraint. Future 100+ Tbps optical networks needed for full-scale federated training.

ex-cc-ch15-11

Hard

Heterogeneous worker storage. Extend coded shuffling to workers with different storage $s_k$ . Derive the achievable rate.

Show Hint

Analog of heterogeneous-cache analysis (Ch 13).

Solution

Formula

Rate: $K(1 - \bar s)/(1 + K \bar s)$ where $\bar s = (1/K) \sum_k s_k$ . Aggregate storage determines rate.

Heterogeneity penalty

Factor-of-2 upper bound on penalty vs homogeneous at $\bar s$ (analogous to Sengupta-Tandon-Clancy 2017 for caches).

Design

Rich workers (more storage) contribute more to coded multicast. Poor workers still receive via XOR.

ex-cc-ch15-12

Hard

Dynamic assignments. Over $T$ epochs, worker assignments evolve. Under what assumption is the per-epoch rate $K(1-s)D/ (1+Ks)$ achievable uniformly?

Show Hint

i.i.d. assignments.

Solution

Assumption

Each worker's new assignment is independent of its previous assignment and of other workers' assignments. Uniformly random over size- $sD$ subsets.

Achievable

Under this assumption, per-epoch rate is fixed at $K(1-s)D/(1+Ks)$ . No dependence on $T$ .

Violation

If assignments are correlated (e.g., slow drift), the structure can be exploited for further reduction. Open research area.

ex-cc-ch15-13

Challenge

Approximate coded shuffling for federated learning. Design a coded shuffling scheme for federated learning where each worker has local private data. Combine with differential privacy.

Show Hint

Coded FL (Caire-Tuninetti 2022+).

Solution

Setup

$K$ devices, each with local private data. No raw data exchange allowed. Gradient updates exchanged; coded shuffling of model parameters.

Combined scheme

Add differential-privacy noise to gradients; use coded shuffling to communicate masked updates efficiently.

Analysis

Privacy budget + communication cost tradeoff. Rate savings ~ $(1+Ks)$ while maintaining DP- $\epsilon$ privacy.

Research

Active CommIT-Tuninetti research; papers 2023-2024 explore combined schemes.

ex-cc-ch15-14

Challenge

Coded all-reduce. PyTorch DDP uses all-reduce for gradient aggregation. Can coded techniques speed up all-reduce?

Show Hint

All-reduce = sum of $K$ gradient vectors.

Solution

Problem

Standard all-reduce: O( $2K$ ) phases of bandwidth, $(K-1)/K \cdot$ total gradient size per phase.

Coded approach

Apply coded aggregation: workers exchange XOR-coded gradient segments. Reduce communication by coded-caching-style gain at cost of computation.

Research

Coded all-reduce is active research; marginal speedups (5- 20%) over NCCL optimized schemes. Not yet standard in PyTorch.

ex-cc-ch15-15

Challenge

Industrial adoption roadmap. What are the main barriers to production adoption of coded shuffling and coded computing? Propose a roadmap.

Show Hint

Integration, economics, measurement.

Solution

Barriers

Integration with existing frameworks (PyTorch, TF, JAX).
Operational complexity (storage overhead, coding logic).
Limited measured benefit at typical scales.
Engineer familiarity: coded techniques not widely taught.

Roadmap

2024-2025: Open-source implementations of coded shuffling in PyTorch Distributed. Research benchmarks. 2026-2027: Cloud-provider-integrated (AWS, GCP) coded- compute services. 2028+: Standard adoption in hyperscale training.

Gating

Widespread adoption hinges on framework integration — researchers know the value; practitioners need turnkey tools.

Exercises

ex-cc-ch15-01

Compute

ex-cc-ch15-02

Table

ex-cc-ch15-03

Compute

ex-cc-ch15-04

Answer

ex-cc-ch15-05

Answer

ex-cc-ch15-06

Cut-set

Bound

Matching achievability

ex-cc-ch15-07

Formula

Practical

ex-cc-ch15-08

Per epoch

Per training

Scale

ex-cc-ch15-09

Gradient coding

Coded shuffling

Combination

Practical

ex-cc-ch15-10

Total bandwidth

Uncoded

Coded

Conclusion

ex-cc-ch15-11

Formula

Heterogeneity penalty

Design

ex-cc-ch15-12

Assumption

Achievable

Violation

ex-cc-ch15-13

Setup

Combined scheme

Analysis

Research

ex-cc-ch15-14

Problem

Coded approach

Research

ex-cc-ch15-15

Barriers

Roadmap

Gating