Ferkans — Interactive Telecom Tutor

Three Numbers That Characterize a Scheme

Every distributed-computing scheme in this book lives at a point $(\mu, \Delta, K)$ in a three-dimensional design space. Whether we are coding matrix multiplication, shuffling gradients, or aggregating federated models, we ultimately want to locate our scheme in this space and compare it to the best achievable operating point.

This section gives the three quantities their precise information-theoretic definitions. In later chapters the space is augmented with a privacy parameter (Part III), a PIR rate (Part IV), or a distortion (Part V), but the $(\mu, \Delta, K)$ triple is always the core.

Definition:
Computation (Storage) Load $\mu$

The computation load (equivalently, storage load) of a distributed-computing scheme is $\mu \;\triangleq\; \frac{\sum_{k=1}^N H(\mathcal{D}_k)} {N \cdot H(\mathcal{D})} \;\in\; \Bigl[\tfrac{1}{N},\, 1\Bigr],$ the average per-worker storage normalized by the full dataset entropy. $\mu = 1/N$ corresponds to uncoded disjoint partitioning (each worker stores a $1/N$ fraction with no overlap); $\mu = 1$ corresponds to full replication (every worker stores the entire dataset). Any intermediate value trades storage for redundancy.

In the deterministic case where $\mathcal{D}$ is a fixed file of length $F$ bits and $\mathcal{D}_k$ is a subfile of length $F_k$ , the definition reduces to $\mu = \sum_k F_k / (N F)$ .

Computation / Storage Load $\mu$

The average per-worker storage as a fraction of the full dataset size. $\mu = 1/N$ is no redundancy (disjoint partition); $\mu = 1$ is full replication. Larger $\mu$ enables coding gains but costs more memory/disk at each worker.

Definition:
Communication Load $\Delta$

The communication load of a distributed-computing scheme is $\Delta \;\triangleq\; \frac{\sum_{k=1}^N H(X_k)}{H(Y_{\text{int}})},$ the aggregate message entropy normalized by the reference intermediate-file entropy $H(Y_{\text{int}})$ (typically the size of the intermediate-value file in MapReduce, or the size of the desired output in matrix multiplication). Smaller $\Delta$ means less network traffic per unit of useful output.

Different problems use slightly different normalizations. In PIR (Chapter 13) the analogous quantity is the PIR rate $R_{\text{PIR}} = F / D$ where $F$ is the file size and $D$ the download size; $\Delta = 1/R_{\text{PIR}}$ is then the natural communication cost. The essential idea — normalize by what the master actually needs — is common to all.

Communication Load $\Delta$

Aggregate inter-worker traffic normalized by the reference output size. In coded shuffling $\Delta = (1-\mu)/(N\mu)$ ; in uncoded shuffling $\Delta = 1 - \mu$ . The computation– communication tradeoff curve is $\Delta(\mu)$ .

Definition:
Recovery Threshold $K$

A distributed-computing scheme has recovery threshold $K \leq N$ if the decoder satisfies $\Pr\bigl[\hat Y \neq Y \bigm| \text{any } K \text{ messages are received}\bigr] \to 0,$ i.e., any $K$ of the $N$ encoded messages suffice to reconstruct the output. Smaller $K$ means more straggler tolerance — the master waits for fewer responses.

A scheme is optimal in the recovery-threshold sense if no other scheme with the same storage load $\mu$ achieves a smaller $K$ .

Recovery Threshold $K$

The minimum number of worker responses needed to reconstruct the output. Smaller $K$ gives better straggler tolerance. For polynomial- coded matrix multiplication (Chapter 5), $K$ matches an information-theoretic lower bound.

Example: The Uncoded Benchmark

For the uncoded MapReduce scheme of Chapter 1 (each of $N$ workers stores a disjoint $1/N$ partition and transmits raw intermediate values), compute $(\mu, \Delta, K)$ .

Solution

Storage load

Each worker stores $1/N$ of the dataset disjointly: $\mu = 1/N$ .

Communication load

Each worker needs the $(N-1)/N$ fraction of intermediate values held by others, and transmits its $1/N$ fraction to each of the $N-1$ others. The total load is $\Delta = 1 - 1/N$ .

Recovery threshold

The master needs all $N$ responses (no redundancy), so $K = N$ .

Point

The uncoded scheme sits at $(\mu, \Delta, K) = (1/N, 1 - 1/N, N)$ — minimum storage, maximum communication, maximum straggler sensitivity. Every coded-computing scheme in this book trades off along the three axes simultaneously.

Definition:
Computation and Communication Rates

It is often convenient to express storage and communication as rates — bits per input sample rather than fractions of the dataset. Let $F = H(\mathcal{D})$ be the dataset entropy. Define $R_{\text{comp}} \;\triangleq\; \mu \cdot F, \qquad R_{\text{comm}} \;\triangleq\; \Delta \cdot F.$ $R_{\text{comp}}$ is the per-worker storage cost in bits; $R_{\text{comm}}$ is the aggregate communication cost in bits per input bit of dataset. Expressing results in rates makes the coding-theoretic intuition (Reed–Solomon, MDS, Shamir) more transparent.

Key Takeaway

Three numbers characterize a distributed-computing scheme: storage $\mu$ , communication $\Delta$ , recovery threshold $K$ . Each chapter of this book is ultimately about locating the achievable region in this three-dimensional space. The converse of Section 2.4 will show that the uncoded baseline sits in the interior of the region — coded schemes can be strictly better on all three axes simultaneously.

Communication Load vs. Number of Workers

Plot the communication load $\Delta$ as a function of the number of workers $N$ , for three schemes: (i) uncoded shuffling ( $\Delta = 1 - 1/N$ ), (ii) coded shuffling at fixed storage $\mu$ ( $\Delta = (1-\mu)/(N\mu)$ ), and (iii) full replication ( $\Delta = 0$ ). Adjust $\mu$ to see how the coded curve shifts. Notice that the coded curve decays like $1/N$ while uncoded asymptotes to $1$ — this is the coding gain we quantify in Chapters 5–7.

Parameters

\mu

(storage load)0.25

Per-worker storage fraction

N_{\max}

100

Maximum number of workers on the plot

Three Operating Regimes of the $(\mu, \Delta)$ Plane

Regime	$\mu$	$\Delta$	$K$	When used
Uncoded	$1/N$	$1 - 1/N$	$N$	Legacy MapReduce, naive FL, no redundancy budget
Coded, minimum storage	$1/N$	$(N-1)/N$	$N$	No improvement at min. storage (equivalent to uncoded)
Coded, intermediate	$\mu \in (1/N, 1)$	$(1-\mu)/(N\mu)$	$K = \lceil (1-\mu)/(N\mu) + 1\rceil$	Data-center FL, resilient coded computing (Chapters 5–7)
Full replication	$1$	$0$	$1$	Small clusters, maximum straggler tolerance, luxury regime

Common Mistake: Coding at Minimum Storage Does Not Help

Mistake:

Claim that applying a coding scheme at minimum storage $\mu = 1/N$ gives any improvement over uncoded.

Correction:

At $\mu = 1/N$ every worker stores a disjoint $1/N$ fraction — there is no redundancy to exploit, regardless of the coding. The coded-shuffling formula $\Delta(\mu) = (1-\mu)/(N\mu)$ evaluated at $\mu = 1/N$ gives $(N-1)/N = 1 - 1/N$ , exactly the uncoded load. Coding gains require extra storage, quantified by $\mu > 1/N$ . This is easy to verify from the formula, but surprisingly often miscommunicated in system-design discussions.

Quick Check

A coded shuffling scheme with $N = 20$ workers uses storage load $\mu = 0.25$ . What is the communication load $\Delta$ ?

$0.15$

$0.75$

$0.95$

$0.04$

Correction:

0.15

$\Delta(\mu) = (1 - \mu)/(N\mu) = 0.75/(20 \cdot 0.25) = 0.75/5 = 0.15$ . A $6.3\times$ reduction over the uncoded $\Delta = 0.95$ .

Fundamental Quantities: Storage, Load, Threshold