Workers as Encoders, Master as Decoder

From Engineering to Information Theory

Chapter 1 introduced distributed computing as a system-design problem. We now formalize it as a network information theory problem. The move is conceptually the same one Shannon made for point-to-point communication: strip the engineering away, keep the information flow, and ask what is fundamentally achievable regardless of implementation. The framework we build here will be the common language of the whole book — every subsequent chapter will specialize it (adding privacy, Byzantine robustness, a wireless channel, ...) but the skeleton stays the same.

The point is that once workers are encoders and the master is a decoder, the laws that govern source and channel coding (entropy bounds, cut-set converse, Fano's inequality) apply directly. This is what makes the trade-offs we identified in Chapter 1 provable — not merely plausible.

Definition:

Distributed Computing as a Multi-Terminal Coding Problem

Let D\mathcal{D} be the input dataset, modeled as a random variable taking values in a finite alphabet (typically uniform over its support so that H(D)=logsupp(D)H(\mathcal{D}) = \log |\mathrm{supp}(\mathcal{D})|). A distributed computing scheme with NN workers consists of:

  • Storage mapping. Each worker kk stores a function Dk=φk(D)\mathcal{D}_k = \varphi_k(\mathcal{D}) of the dataset, where DkμD|\mathcal{D}_k| \leq \mu \cdot |\mathcal{D}| for some μ[0,1]\mu \in [0, 1]. The storage function φk\varphi_k is fixed before the inputs are revealed.
  • Encoders. Each worker applies a local encoder EkE_k that takes Dk\mathcal{D}_k (and any shared randomness) and produces a message Xk=Ek(Dk)X_k = E_k(\mathcal{D}_k) in a finite alphabet.
  • Decoder. The master applies a decoder ψ\psi that reconstructs the output Y^=ψ(X1,X2,,XN)\hat Y = \psi(X_1, X_2, \ldots, X_N).
  • Correctness. The output must satisfy Pr[Y^Y]0\Pr[\hat Y \neq Y] \to 0, where Y=f(D)Y = f(\mathcal{D}) is the ground-truth function of the dataset.

The communication load is the normalized total message length Δ=klogsupp(Xk)/V\Delta = \sum_k \log|\mathrm{supp}(X_k)| / |V|, where V|V| is a reference intermediate-file size. The storage load is μ\mu. A scheme is (μ,Δ)(\mu, \Delta)-achievable if there exists a sequence of storage mappings, encoders, and a decoder whose block error probability vanishes as the dataset size grows.

,

Storage Mapping

The (fixed) function φk\varphi_k that determines what worker kk stores given the dataset D\mathcal{D}. Chosen before the inputs are revealed; may be stochastic (shared randomness) or deterministic. Uncoded storage uses identity mappings on disjoint partitions; coded storage uses MDS or polynomial codes.

Worker Encoder

Per-worker encoder Ek:DkXkE_k: \mathcal{D}_k \mapsto X_k that produces the message sent to the master. In coded matrix multiplication, EkE_k is polynomial evaluation at the worker's assigned evaluation point; in secure aggregation, EkE_k adds pairwise masks to the gradient.

The Distributed-Computing Network Model

The Distributed-Computing Network Model
Dataset D\mathcal{D} is stored across NN workers via storage mappings φk\varphi_k. Each worker runs an encoder EkE_k and sends message XkX_k to the master, which runs a decoder ψ\psi to produce Y^\hat Y. Computation load μ\mu bounds the storage; communication load Δ\Delta bounds the total message traffic; recovery threshold KK bounds the number of responses the master waits for.

How Each Problem Specializes the Model

The same network model captures every problem in Parts II–V, with different constraints and adversary models:

Problem Extra constraint / adversary Extra output constraint
Coded matrix multiply (Ch 5) Any KK messages suffice to decode Y=ATBY = \mathbf{A}^T \mathbf{B}
Coded shuffling (Ch 7) Storage μ\mu fixed across workers YY = full shuffled set
Secure aggregation (Ch 10) Server is honest-but-curious Y=kgkY = \sum_k \mathbf{g}_k; server learns nothing else
Private info. retrieval (Ch 13) Any TT databases cannot learn query YY = requested file
AirComp / wireless FL (Ch 16) Multiple-access channel for uplink Y=kfk(xk)Y = \sum_k f_k(x_k); noise tolerated

Each row specializes the decoder's correctness constraint, the adversary's view, or the physical channel between workers and master. The mathematical skeleton — encoders, decoder, rate region — is the same throughout. This is what makes the information-theoretic formulation worth the effort: one framework, many specializations.

Theorem: Output Entropy Bounds Aggregate Message Entropy

Any distributed-computing scheme satisfying the correctness condition Pr[Y^Y]0\Pr[\hat Y \neq Y] \to 0 must satisfy k=1NH(Xk)    H(Y),\sum_{k=1}^N H(X_k) \;\geq\; H(Y), and for any subset S{1,,N}\mathcal{S} \subseteq \{1, \ldots, N\} of workers whose responses the master uses for decoding, H(XS)    H(YXSc).H(X_{\mathcal{S}}) \;\geq\; H(Y \mid X_{\mathcal{S}^c}). The second inequality is a cut-set bound: the traffic across the "cut" between S\mathcal{S} and the master must be at least the conditional entropy of the output given the other workers.

Decoder ψ\psi produces a (nearly) deterministic function of the messages, so by the data-processing inequality I(Y;X1,,XN)H(Y)o(1)I(Y; X_1, \ldots, X_N) \geq H(Y) - o(1). Applying the chain rule and non-negativity of mutual information yields the stated bound. The cut-set version is the specialization to any subset of messages.

Key Takeaway

The cut-set bound is the structural lower bound throughout this book. Every converse we prove — the coded-computing lower bound in Chapter 5, the secure-aggregation converse in Chapter 10, the PIR capacity upper bound in Chapter 13 — is a carefully-chosen cut-set argument specialized to the problem at hand. It is worth understanding the generic form well.

What the Model Does and Does Not Assume

Our formal model assumes:

  1. Noiseless, bit-pipe communication between each worker and the master. Parts V (AirComp) and the wireless-FL chapters replace this with a multiple-access channel.
  2. Full worker compliance with the protocol. Byzantine robustness is added in Chapter 11 by allowing up to BB workers to deviate arbitrarily.
  3. Honest-but-curious master (when privacy matters). Stronger adversarial models — e.g., a malicious server colluding with some workers — are relaxations introduced in Chapters 10–12.

The model is simplest when all three assumptions hold; each relaxation adds to the rate region in a predictable way. The central value of the formal framework is precisely this modularity.

Historical Note: From Network Coding to Coded Computing

2000–2018

The move from thinking about distributed computing in system-design terms to thinking about it information-theoretically mirrors the arc that network coding took in the early 2000s. Ahlswede, Cai, Li, and Yeung's landmark paper (2000) proved that for multicast networks, coding across messages at intermediate nodes can achieve rates unreachable by routing alone. Their "butterfly network" example is structurally identical to the MapReduce shuffle: an all-to-all exchange in which coded messages at the edges allow the receivers to decode what they need from fewer total transmissions. The coded- computing programme of Part II of this book is, in a precise sense, network coding applied to the distributed-computation setting.

🔧Engineering Note

Bit-Pipes Are a Modeling Choice, Not a Law

Modeling worker-to-master links as noiseless bit-pipes is appropriate inside a data center or on a dedicated backhaul, but it is the worst choice for a wireless federated-learning deployment. Over a wireless multiple-access channel, users' signals superpose in the analog domain — the "channel" literally computes kXk\sum_k X_k for free, modulo receiver noise. This observation is the seed of over-the-air computation (Chapter 16) and is an example of how relaxing one modeling assumption can reveal a fundamentally different rate region. Designers who keep the bit-pipe model when the channel is wireless forfeit a free order of magnitude in spectral efficiency.

Practical Constraints
  • Wireless MAC superposition is analog; digital relaying reintroduces the per-user uplink cost

  • Analog AirComp requires synchronization (subcarrier, CFO, and timing) but no heavy coding

  • Current AirComp deployments (e.g., NEXT-G FL testbeds) achieve 5520×20\times spectrum savings over orthogonal uplink

📋 Ref: 3GPP TR 38.843 (AI/ML for NR air-interface)

Common Mistake: Load Δ\Delta Is Not the Same as Rate RR

Mistake:

Treat the communication load Δ\Delta as a Shannon rate in bits per channel use.

Correction:

The load Δ\Delta is a normalized communication cost: total bits exchanged, divided by a reference intermediate-file size (or, in other conventions, per input bit). A Shannon rate is bits per channel use and depends on a channel model. For a noiseless bit-pipe the two agree up to units; for a wireless channel the rate is constrained by capacity while the load is not. This distinction will matter in Chapter 16.

Quick Check

In the formal network model of this section, what is the role of the cut-set inequality H(XS)H(YXSc)H(X_{\mathcal{S}}) \geq H(Y \mid X_{\mathcal{S}^c})?

It is an achievability bound telling us how to design the encoders.

It is a converse: any scheme must carry at least H(YXSc)H(Y \mid X_{\mathcal{S}^c}) bits across the cut between S\mathcal{S} and the master.

It requires the master to know the sources' joint distribution.

It shows that uncoded routing is always optimal.