Ferkans — Interactive Telecom Tutor

The CommIT-Group Contribution

This section carries the central CommIT contribution of Chapter 7: the Wan / Tuninetti / Caire construction that achieves the information-theoretic lower bound $R^*(M) = N(1 - \mu)/(1 + N\mu)$ of §7.2 with a deterministic, explicit, and efficiently-decodable finite-field IA scheme. The construction mirrors the Maddah-Ali / Niesen coded-caching delivery (Chapter 4) but introduces modifications to handle the data-shuffling structure (per-worker groups, epoch-wise permutations).

The point is that this closes the rate-region of the data-shuffling problem: achievability matches the cut-set converse, giving the exact tradeoff between per-worker memory $M$ and per-epoch communication $R^*(M)$ . The result is tagged as a CommIT contribution and is the first of two such contributions in Part II (the second being the ByzSecAgg-foundation polynomial-code extensions in Chapter 11).

Theorem: Optimal Coded-Shuffling Rate

Consider the $(N, D, M)$ -data-shuffling problem with per- worker memory $M$ and $N$ workers. For every $\mu = M/D \in \{0, 1/N, 2/N, \ldots, 1\}$ , the minimum worst-case per-epoch shuffling rate is $R^*(M) \;=\; \frac{N(1 - \mu)}{1 + N\mu}.$ The Wan / Tuninetti / Caire coded-shuffling scheme (Section 7.3.2) achieves this rate exactly. For non-integer $N\mu$ , memory-sharing between adjacent integer points gives a piecewise-linear upper envelope matching the cut-set converse of §7.2.

The rate $R^*(M)$ is a deterministic function of $(N, \mu)$ — not a stochastic bound or an average. This is what makes the result operational: a system architect with per-worker memory budget $M$ and $N$ workers can compute the exact per-epoch shuffling traffic before provisioning the network. The achievability construction (below) also provides the explicit broadcast schedule.

Operationally, this means that investing in $M$ -sized per-worker caches pays off by a factor of $1 + N\mu$ in network traffic. For $N = 16$ and $\mu = 0.25$ , the savings factor is $5\times$ .

Proof

Achievability — coded-shuffling construction

Partition each data point $W_d$ into $\binom{N}{N\mu}$ subfiles indexed by subsets $\mathcal{S} \subset [N]$ of size $N\mu$ . The placement phase assigns subfile $W_{d, \mathcal{S}}$ to every worker $k \in \mathcal{S}$ .

After the epoch's permutation $\pi$ is announced, for each subset $\mathcal{S}$ of size $N\mu + 1$ and each demand pattern over that subset, the server broadcasts the XOR $\bigoplus_{k \in \mathcal{S}} W_{\pi^{-1}(k), \mathcal{S} \setminus \{k\}}$ (abuse of notation — one bit per subfile). Each recipient $k$ finds its desired subfile by XOR-ing the broadcast with its cached contents.

Total broadcasts: $\binom{N}{N\mu + 1}$ per demand pattern; total bits (normalized): $\binom{N}{N\mu + 1} / \binom{N}{N\mu}$ . Algebraic simplification gives $R^*(M) = N(1 - \mu)/(1 + N\mu)$ .

Converse — cut-set

The cut-set argument is the one developed in §7.2 Theorem. Bounding below by the joint entropy of the workers' demands minus their caches, and applying the finite-field IA alignment factor of $1 + N\mu$ , the bound is $R^* \geq N(1 - \mu)/(1 + N\mu)$ . Achievability matches, closing the rate region.

Memory-sharing for intermediate $\mu$

For storage loads not at discrete points $\mu \in \{1/N, 2/N, \ldots\}$ , run the $\lfloor N\mu \rfloor/N$ scheme for a fraction $\beta$ of the data and the $\lceil N\mu \rceil/N$ scheme for the rest. Concatenating gives a piecewise-linear rate that upper-bounds the smooth convex curve. $\blacksquare$

,

🎓CommIT Contribution(2021)

Coded Data Shuffling for Distributed Machine Learning

K. Wan, D. Tuninetti, G. Caire — IEEE Transactions on Information Theory

The optimal rate-memory tradeoff $R^*(M) = N(1 - \mu)/(1 + N\mu)$ for coded data shuffling in distributed machine learning training was established by Kai Wan, Daniela Tuninetti, and Giuseppe Caire (CommIT, TU Berlin) in a series of papers culminating in their 2021 IEEE T-IT result. The construction uses finite-field interference alignment in the delivery phase — the same algebraic machinery that powers the coded-caching delivery of §4.3.

Key technical contributions:

Optimal achievability for worst-case demands. The construction works for every permutation $\pi$ , not just in expectation. The per-epoch rate is deterministic at $R^*(M)$ .
Matching converse via cut-set. The lower bound closes the rate region — no scheme can do better at the same per-worker memory $M$ .
Decentralized variant. When the master cannot centrally coordinate placement (federated-learning-style settings), Wan et al. also give a random-placement variant with a mild rate penalty that vanishes as $N \to \infty$ .
Demand-private extension (Wan, Tuninetti, Caire 2020 ISIT): the shuffling protocol can be extended to hide which data point each worker is processing from every other worker, at a rate penalty that quantifies the privacy / communication tradeoff of the shuffling setting. This is the closest predecessor to the PIR framework of Chapter 13.

The result is the third CommIT group contribution tagged in this book, after Shamir-based MPC-foundations (Chapter 3's commit-ch03-foundational) and the polynomial-code extensions (Chapter 5). Chapter 15 (cache-aided PIR) extends the framework further, also with CommIT involvement.

coded-shufflingcommit-contributionml-systemsView Paper →

Wan / Tuninetti / Caire Coded-Shuffling Delivery

Complexity: Broadcasts:

\binom{N}{N\mu + 1}

per epoch. Total bits

= N(1 - \mu) D/(1 + N\mu)

.

Input:

N

workers, dataset

\{W_1, \ldots, W_D\}

,

placement: each worker

k

stores subfiles

\{W_{d, \mathcal{S}} : k \in \mathcal{S}\}

for all subsets

\mathcal{S} \subset [N]

of size

N\mu

. Epoch permutation

\pi

.

Output: Broadcast messages that deliver each worker's

needed data.

1. for each subset

\mathcal{S} \subset [N]

of size

N\mu + 1

do

2.

\quad

For each

k \in \mathcal{S}

, let

d_k = \pi^{-1}(\text{slot}_k)

— the data point worker

k

needs from within

\mathcal{S}

.

3.

\quad

Broadcast:

M_{\mathcal{S}} \;=\; \bigoplus_{k \in \mathcal{S}} W_{d_k,\, \mathcal{S} \setminus \{k\}}

4. end for

Decoding: Worker

k

receives every

M_{\mathcal{S}}

with

k \in \mathcal{S}

. From its cached subfiles, it

XOR-cancels the contributions of other members of

\mathcal{S}

and extracts

W_{d_k, \mathcal{S} \setminus \{k\}}

.

Iterating across all

\mathcal{S}

reconstructs every

needed subfile.

The construction is identical in spirit to Maddah-Ali / Niesen coded-caching delivery (§4.3), with two modifications: (i) per-epoch demands (each worker's slot in the permutation) replace per-user requests (each user's file request in caching), (ii) the subfile indexing is adapted to the data-shuffling structure. The overall broadcast count and per-epoch rate are identical.

Example: $N = 3$ , $D = 6$ , $M = 2$ Worked Example

Illustrate the Wan / Tuninetti / Caire coded-shuffling scheme on $N = 3$ workers, $D = 6$ data points, per-worker memory $M = 2$ . Work out one epoch's broadcast schedule and verify the rate matches the bound.

Solution

Placement

$\mu = M/D = 1/3$ , so $N\mu = 1$ . Partition each data point $W_d$ into $\binom{3}{1} = 3$ subfiles indexed by $\mathcal{S} \in \{\{1\}, \{2\}, \{3\}\}$ : $W_d = (W_{d, \{1\}}, W_{d, \{2\}}, W_{d, \{3\}})$ .

Worker $k$ stores $\{W_{d, \{k\}} : d = 1, \ldots, 6\}$ — 6 subfiles × (1/3 of each) = $6 \cdot (1/3) = 2$ data points worth of material. ✓

Epoch permutation

Suppose $\pi$ assigns $\{W_4, W_5\}$ to worker 1, $\{W_1, W_6\}$ to worker 2, $\{W_2, W_3\}$ to worker 3.

Broadcasts (subsets of size $N\mu + 1 = 2$)

For $\mathcal{S} = \{1, 2\}$ : worker 1 needs $W_{4, \{2\}}, W_{5, \{2\}}$ from subfiles in worker 2's cache; worker 2 needs $W_{1, \{1\}}, W_{6, \{1\}}$ from worker 1's cache. Broadcast: $M_{12} = W_{4, \{2\}} \oplus W_{1, \{1\}}$ and $M_{12}' = W_{5, \{2\}} \oplus W_{6, \{1\}}$ . Each worker cancels the other's contribution using its cache.

Similarly for $\mathcal{S} = \{1, 3\}$ and $\{2, 3\}$ .

Rate

Total broadcasts: $\binom{3}{2} \cdot 2 = 6$ subfile- equivalents, i.e., $6 \cdot (1/3) = 2$ full data points. Normalized: $R = 2 \cdot N / D = 2 \cdot 3/6 = 1$ . Compare with the formula $R^* = N(1 - \mu)/(1 + N\mu) = 3 \cdot (2/3)/2 = 1$ . ✓

The uncoded baseline would be $N(1 - \mu) = 2$ — twice the coded rate. Exactly the $1 + N\mu = 2$ factor improvement.

Coded Data Shuffling: XOR Delivery

Animation of the Wan / Tuninetti / Caire coded-shuffling scheme: placement assigns subfiles via subsets, delivery XORs the per-subset demands into single broadcasts, each worker decodes by XOR-cancelling with its cache.

Optimal Rate-Memory Tradeoff: Wan / Tuninetti / Caire

Plot the optimal shuffling rate $R^*(M) = N(1-\mu)/(1+N\mu)$ against the per-worker memory $\mu = M/D$ , for several values of $N$ . Also show: (i) the uncoded baseline, and (ii) the rate savings factor $1 + N\mu$ . The Wan / Tuninetti / Caire achievability matches the cut-set converse, closing the rate region.

Parameters

N

— workers16

Number of workers

Key Takeaway

Wan / Tuninetti / Caire close the rate region of distributed data shuffling. The achievability matches the cut-set lower bound at $R^*(M) = N(1 - \mu)/(1 + N\mu)$ , giving the exact minimum per-epoch shuffling traffic. Each unit of per-worker memory buys a $1 + N\mu$ factor reduction in network traffic — operationally comparable to the coded-caching gain but specialized to the ML training setting. This is the CommIT group's signature Part-II contribution.

Common Mistake: Rate Savings Factor Depends on $N\mu$ , Not Just $\mu$

Mistake:

Assume a moderate memory fraction $\mu$ gives modest gains independent of $N$ .

Correction:

The savings factor is $1 + N\mu$ — it scales with both $N$ and $\mu$ . For $\mu = 0.1$ at $N = 100$ , the savings factor is $11\times$ ; at $N = 10$ it's only $2\times$ . This is why coded shuffling becomes more attractive for larger clusters, not less — the opposite of what one might naively expect.

⚠️Engineering Note

Deployment of Coded Shuffling

Production deployment of coded shuffling has been limited, despite the clean information-theoretic result. The main engineering barriers are:

Centralized placement coordination. The Wan et al. scheme requires a centralized controller to assign subfiles to workers during placement. In distributed training, the natural controller is the master / coordinator role. This works in data-center training but is harder in hybrid wireless / edge settings.
Subfile granularity. The construction requires each data point to be split into $\binom{N}{N\mu}$ subfiles, which for $N = 100$ and $\mu = 0.2$ means 2 · 10^16 subfiles per data point — clearly infeasible. In practice, the dataset is pre-sharded into $\binom{N}{N\mu}$ equivalence classes of data points.
Decentralized variants. Wan et al. also give a decentralized placement (independent per-worker random caching), which loses a sub-logarithmic factor in the rate but is far easier to deploy.

For data-center training of moderate-scale models (10–100 workers, TB-scale datasets), the scheme is deployable and yields 5–10× shuffling-cost reductions. For federated learning (no cross-user data movement), the scheme does not directly apply; Chapter 9 discusses alternatives.

Practical Constraints

•
Centralized placement: feasible in data centers; harder in federated / edge
•
Subfile explosion: $\binom{N}{N\mu}$ — sharded in practice
•
Decentralized variant: near-optimal with random placement

📋 Ref: Wan/Tuninetti/Caire 2021 IEEE T-IT §VI

Historical Note: The Wan / Tuninetti / Caire Program

2017–2021

The coded-data-shuffling line of work by Kai Wan (TU Berlin, then Shanghai), Daniela Tuninetti (UIC), and Giuseppe Caire (TU Berlin) began around 2017, as a natural extension of the Maddah-Ali / Niesen coded-caching framework to distributed machine learning workloads. Early papers (ISIT 2018, 2019) established achievability and lower bounds for progressively more general settings: worst-case demands, random permutations, privacy constraints. The 2020 ISIT paper introduced demand-private shuffling (a predecessor to the PIR extensions of Chapter 15), and the 2021 IEEE T-IT paper gave the complete rate region. The framework is one of the CommIT group's most-cited coded-computing contributions, bridging classical caching theory with modern ML-systems concerns.

Quick Check

For $N = 10$ workers and per-worker memory fraction $\mu = 0.3$ , the Wan / Tuninetti / Caire optimal shuffling rate is:

$R^* = 1.75$

$R^* = 7$

$R^* = 10$

$R^* = 0.3$