Fundamental Quantities: Storage, Load, Threshold
Three Numbers That Characterize a Scheme
Every distributed-computing scheme in this book lives at a point in a three-dimensional design space. Whether we are coding matrix multiplication, shuffling gradients, or aggregating federated models, we ultimately want to locate our scheme in this space and compare it to the best achievable operating point.
This section gives the three quantities their precise information-theoretic definitions. In later chapters the space is augmented with a privacy parameter (Part III), a PIR rate (Part IV), or a distortion (Part V), but the triple is always the core.
Definition: Computation (Storage) Load
Computation (Storage) Load
The computation load (equivalently, storage load) of a distributed-computing scheme is the average per-worker storage normalized by the full dataset entropy. corresponds to uncoded disjoint partitioning (each worker stores a fraction with no overlap); corresponds to full replication (every worker stores the entire dataset). Any intermediate value trades storage for redundancy.
In the deterministic case where is a fixed file of length bits and is a subfile of length , the definition reduces to .
Computation / Storage Load
The average per-worker storage as a fraction of the full dataset size. is no redundancy (disjoint partition); is full replication. Larger enables coding gains but costs more memory/disk at each worker.
Definition: Communication Load
Communication Load
The communication load of a distributed-computing scheme is the aggregate message entropy normalized by the reference intermediate-file entropy (typically the size of the intermediate-value file in MapReduce, or the size of the desired output in matrix multiplication). Smaller means less network traffic per unit of useful output.
Different problems use slightly different normalizations. In PIR (Chapter 13) the analogous quantity is the PIR rate where is the file size and the download size; is then the natural communication cost. The essential idea — normalize by what the master actually needs — is common to all.
Communication Load
Aggregate inter-worker traffic normalized by the reference output size. In coded shuffling ; in uncoded shuffling . The computation– communication tradeoff curve is .
Definition: Recovery Threshold
Recovery Threshold
A distributed-computing scheme has recovery threshold if the decoder satisfies i.e., any of the encoded messages suffice to reconstruct the output. Smaller means more straggler tolerance — the master waits for fewer responses.
A scheme is optimal in the recovery-threshold sense if no other scheme with the same storage load achieves a smaller .
Recovery Threshold
The minimum number of worker responses needed to reconstruct the output. Smaller gives better straggler tolerance. For polynomial- coded matrix multiplication (Chapter 5), matches an information-theoretic lower bound.
Example: The Uncoded Benchmark
For the uncoded MapReduce scheme of Chapter 1 (each of workers stores a disjoint partition and transmits raw intermediate values), compute .
Storage load
Each worker stores of the dataset disjointly: .
Communication load
Each worker needs the fraction of intermediate values held by others, and transmits its fraction to each of the others. The total load is .
Recovery threshold
The master needs all responses (no redundancy), so .
Point
The uncoded scheme sits at — minimum storage, maximum communication, maximum straggler sensitivity. Every coded-computing scheme in this book trades off along the three axes simultaneously.
Definition: Computation and Communication Rates
Computation and Communication Rates
It is often convenient to express storage and communication as rates — bits per input sample rather than fractions of the dataset. Let be the dataset entropy. Define is the per-worker storage cost in bits; is the aggregate communication cost in bits per input bit of dataset. Expressing results in rates makes the coding-theoretic intuition (Reed–Solomon, MDS, Shamir) more transparent.
Key Takeaway
Three numbers characterize a distributed-computing scheme: storage , communication , recovery threshold . Each chapter of this book is ultimately about locating the achievable region in this three-dimensional space. The converse of Section 2.4 will show that the uncoded baseline sits in the interior of the region — coded schemes can be strictly better on all three axes simultaneously.
Communication Load vs. Number of Workers
Plot the communication load as a function of the number of workers , for three schemes: (i) uncoded shuffling (), (ii) coded shuffling at fixed storage (), and (iii) full replication (). Adjust to see how the coded curve shifts. Notice that the coded curve decays like while uncoded asymptotes to — this is the coding gain we quantify in Chapters 5–7.
Parameters
Per-worker storage fraction
Maximum number of workers on the plot
Three Operating Regimes of the Plane
| Regime | When used | |||
|---|---|---|---|---|
| Uncoded | Legacy MapReduce, naive FL, no redundancy budget | |||
| Coded, minimum storage | No improvement at min. storage (equivalent to uncoded) | |||
| Coded, intermediate | Data-center FL, resilient coded computing (Chapters 5–7) | |||
| Full replication | Small clusters, maximum straggler tolerance, luxury regime |
Common Mistake: Coding at Minimum Storage Does Not Help
Mistake:
Claim that applying a coding scheme at minimum storage gives any improvement over uncoded.
Correction:
At every worker stores a disjoint fraction — there is no redundancy to exploit, regardless of the coding. The coded-shuffling formula evaluated at gives , exactly the uncoded load. Coding gains require extra storage, quantified by . This is easy to verify from the formula, but surprisingly often miscommunicated in system-design discussions.
Quick Check
A coded shuffling scheme with workers uses storage load . What is the communication load ?
. A reduction over the uncoded .