Exercises
ex-cc-ch15-01
EasyCompute the coded shuffling cost for , , PB.
.
Compute
. Uncoded: PB. Savings: .
ex-cc-ch15-02
EasyState the coded-caching / coded-shuffling analogy.
Worker memory = cache.
Table
Cache ↔ Worker memory . Files ↔ Dataset . Memory ratio ↔ Storage fraction . Demand ↔ New assignment. Delivery ↔ Shuffling. Caching gain ↔ Shuffling gain .
ex-cc-ch15-03
EasyGradient coding with workers tolerates stragglers. What's the storage overhead per worker?
.
Compute
. Per-worker storage: 10% of dataset (vs 2% without coding — ). Storage overhead: 5×.
ex-cc-ch15-04
EasyWhy does the shuffling gain factor saturate as at fixed ?
asymptote.
Answer
At large : (asymptote). Independent of . Beyond moderate , adding workers doesn't reduce per-worker shuffling cost. Storage fraction is the dominant lever.
ex-cc-ch15-05
EasyIn coded shuffling, if (worker stores full dataset), what is the communication cost?
No shuffling needed.
Answer
: each worker has the full dataset; no data transfer needed per epoch. . Formula: . ✓
ex-cc-ch15-06
MediumConverse proof sketch. Prove that the coded shuffling rate is order-optimal.
Cut-set argument on worker information.
Cut-set
Consider any single worker. At epoch , it needs its new assignment of size . Of these, fraction is new data; the rest was already stored.
Bound
Worker must receive at least new data per epoch. Across workers: . After optimizing over the information shared across workers via coded multicasting: achievable rate is .
Matching achievability
The MAN-style coded scheme (§15.2) matches this bound asymptotically.
ex-cc-ch15-07
MediumPartial shuffling. If only a fraction of the dataset is re-shuffled per epoch (rest stays), what's the cost?
Linear scaling in .
Formula
. Linear in ; saturating gain structure.
Practical
Production ML shuffles per epoch (not full). Total cost 10× less than full shuffling. Coding still helps; reduction remains .
ex-cc-ch15-08
MediumWall-clock savings. ResNet-50 training at , : compute shuffling time savings per epoch and total.
Use earlier numbers: 36.5 TB per epoch uncoded.
Per epoch
. Coded: 36.5/13.8 2.64 TB per epoch. At 100 Gbps network, 256 workers: shuffling time = 0.82 seconds. Uncoded: 11 seconds.
Per training
90 epochs. Uncoded: 1000 s = 17 min. Coded: 74 s = 1.2 min. Saved: ~16 min per training run.
Scale
For 1000 training runs/year (hyperscale): ~260 hrs of compute cluster time saved per year, on shuffling alone.
ex-cc-ch15-09
MediumGradient coding + shuffling. Combine both techniques in a training loop. Does the combination compound?
Orthogonal mechanisms.
Gradient coding
Handles straggler tolerance via redundant data / computation. Storage overhead: .
Coded shuffling
Handles inter-epoch data transfer. Bandwidth reduction: .
Combination
Both techniques can coexist: gradient coding for reliability, shuffling for bandwidth. Storage overhead from gradient coding: . Shuffling gain: still . Compound gain possible.
Practical
Not yet standard in production. Research prototype: Caire- Tuninetti collaborations explore combined schemes.
ex-cc-ch15-10
MediumCross-DC training. For federated training across 10 data centers, each with 10 Gbps interconnect, dataset 1 PB: compare uncoded and coded shuffling time.
Aggregated cross-DC bandwidth.
Total bandwidth
10 DCs × 10 Gbps = 100 Gbps aggregate cross-DC.
Uncoded
Per-epoch transfer: = say PB = 90 PB. At 100 Gbps: seconds = 200 hours per epoch. Infeasible.
Coded
Reduce by : 8.2 PB per epoch. At 100 Gbps: 18 hrs. Still slow; use smaller datasets or longer epochs.
Conclusion
Cross-DC federated training is bottlenecked by bandwidth. Coded shuffling helps but cannot eliminate the constraint. Future 100+ Tbps optical networks needed for full-scale federated training.
ex-cc-ch15-11
HardHeterogeneous worker storage. Extend coded shuffling to workers with different storage . Derive the achievable rate.
Analog of heterogeneous-cache analysis (Ch 13).
Formula
Rate: where . Aggregate storage determines rate.
Heterogeneity penalty
Factor-of-2 upper bound on penalty vs homogeneous at (analogous to Sengupta-Tandon-Clancy 2017 for caches).
Design
Rich workers (more storage) contribute more to coded multicast. Poor workers still receive via XOR.
ex-cc-ch15-12
HardDynamic assignments. Over epochs, worker assignments evolve. Under what assumption is the per-epoch rate achievable uniformly?
i.i.d. assignments.
Assumption
Each worker's new assignment is independent of its previous assignment and of other workers' assignments. Uniformly random over size- subsets.
Achievable
Under this assumption, per-epoch rate is fixed at . No dependence on .
Violation
If assignments are correlated (e.g., slow drift), the structure can be exploited for further reduction. Open research area.
ex-cc-ch15-13
ChallengeApproximate coded shuffling for federated learning. Design a coded shuffling scheme for federated learning where each worker has local private data. Combine with differential privacy.
Coded FL (Caire-Tuninetti 2022+).
Setup
devices, each with local private data. No raw data exchange allowed. Gradient updates exchanged; coded shuffling of model parameters.
Combined scheme
Add differential-privacy noise to gradients; use coded shuffling to communicate masked updates efficiently.
Analysis
Privacy budget + communication cost tradeoff. Rate savings ~ while maintaining DP- privacy.
Research
Active CommIT-Tuninetti research; papers 2023-2024 explore combined schemes.
ex-cc-ch15-14
ChallengeCoded all-reduce. PyTorch DDP uses all-reduce for gradient aggregation. Can coded techniques speed up all-reduce?
All-reduce = sum of gradient vectors.
Problem
Standard all-reduce: O() phases of bandwidth, total gradient size per phase.
Coded approach
Apply coded aggregation: workers exchange XOR-coded gradient segments. Reduce communication by coded-caching-style gain at cost of computation.
Research
Coded all-reduce is active research; marginal speedups (5- 20%) over NCCL optimized schemes. Not yet standard in PyTorch.
ex-cc-ch15-15
ChallengeIndustrial adoption roadmap. What are the main barriers to production adoption of coded shuffling and coded computing? Propose a roadmap.
Integration, economics, measurement.
Barriers
- Integration with existing frameworks (PyTorch, TF, JAX).
- Operational complexity (storage overhead, coding logic).
- Limited measured benefit at typical scales.
- Engineer familiarity: coded techniques not widely taught.
Roadmap
2024-2025: Open-source implementations of coded shuffling in PyTorch Distributed. Research benchmarks. 2026-2027: Cloud-provider-integrated (AWS, GCP) coded- compute services. 2028+: Standard adoption in hyperscale training.
Gating
Widespread adoption hinges on framework integration — researchers know the value; practitioners need turnkey tools.