Ferkans — Interactive Telecom Tutor

ex-ch07-01

Easy

For $N = 20$ workers, dataset size $D$ , per-worker memory $M = 0.4 D$ . Compute the uncoded shuffling cost $R_{\text{uncoded}}$ and the Wan / Tuninetti / Caire coded rate $R^*(M)$ .

Show Hint

$R_{\text{uncoded}} = N(1 - \mu)$ ; $R^* = N(1-\mu)/(1+N\mu)$ .

Solution

Plug in

$\mu = 0.4$ . $R_{\text{uncoded}} = 20 \cdot 0.6 = 12$ . $R^* = 12 / (1 + 8) = 12/9 \approx 1.33$ .

Ratio

Savings factor: $12 / 1.33 \approx 9 = 1 + N\mu$ , matching the theoretical $1 + N\mu$ improvement.

ex-ch07-02

Easy

Why does the Wan / Tuninetti / Caire coded-shuffling scheme achieve optimal rate in both achievability and converse?

Show Hint

Construction is finite-field IA; converse is cut-set.

Solution

Achievability

The coded-shuffling construction uses finite-field IA to XOR-combine per-subset demands into single broadcasts, achieving rate $R = N(1-\mu)/(1+N\mu)$ exactly (for integer $N\mu$ ).

Converse

The cut-set argument of §7.2 shows that no scheme can achieve smaller rate at the same per-worker memory, giving a matching lower bound.

Implication

Matching upper and lower bounds close the rate region: no cleverer scheme exists at this memory level. The result is information-theoretically tight.

ex-ch07-03

Easy

State why federated learning does not benefit from the coded-shuffling framework of this chapter, and what alternative shuffling strategies FL uses instead.

Solution

Why FL doesn't shuffle cross-user

In FL, each user's data is private and stays on the local device — there is no cross-user data transfer. Instead, FL samples users (random subset per round) and each selected user processes its own local data.

The "shuffle" replacement

User sampling per round gives the random-access property that SGD needs; no data movement is required. This is one of FL's advantages over data-center training for privacy-sensitive applications.

When coded shuffling does apply to FL

Hybrid FL systems (e.g., cross-silo with shared data centers) can use coded shuffling within each silo. Pure cross-device FL cannot.

ex-ch07-04

Easy

For random-reshuffling SGD vs. i.i.d. sampling, state the convergence rates and why the former is preferred.

Show Hint

$O(1/T)$ vs. $O(1/\sqrt{T})$ .

Solution

Rates

i.i.d. sampling with replacement: $O(1/\sqrt{T})$ . Random reshuffling: $O(1/T)$ .

Why

Random reshuffling visits every data point exactly once per epoch, reducing gradient variance compared to i.i.d. sampling (which may sample some points multiple times per epoch, others not at all).

Practical upshot

Reshuffling is $\sqrt{T}\times$ faster per unit of "effective" iterations. For $T = 10^4$ , this is a $100\times$ speedup — hence the importance of the shuffling step.

ex-ch07-05

Medium

Construct a concrete coded-shuffling placement and one epoch's delivery for $N = 4$ workers, $D = 8$ data points, $M = 2$ (so $N\mu = 1$ ). Specify subfile partitions, placement, and broadcast messages for a particular permutation $\pi$ .

Show Hint

Each data point splits into $\binom{4}{1} = 4$ subfiles.

Solution

Placement

$\mu = 1/4$ , $N\mu = 1$ . Each $W_d$ has 4 subfiles $W_{d, \{k\}}$ for $k \in \{1, 2, 3, 4\}$ . Worker $k$ stores $\{W_{d, \{k\}} : d = 1, \ldots, 8\}$ — 8 subfiles × 1/4 = 2 data points. ✓

Example permutation

$\pi$ assigns worker 1 $\{W_3, W_7\}$ , worker 2 $\{W_1, W_5\}$ , worker 3 $\{W_2, W_6\}$ , worker 4 $\{W_4, W_8\}$ .

Delivery (pick subset $\{1, 2\}$)

Worker 1 needs $W_{3, \{2\}}, W_{7, \{2\}}$ from worker 2's cache. Worker 2 needs $W_{1, \{1\}}, W_{5, \{1\}}$ from worker 1's cache. Broadcasts: $M_1 = W_{3, \{2\}} \oplus W_{1, \{1\}}$ and $M_2 = W_{7, \{2\}} \oplus W_{5, \{1\}}$ . Worker 1 uses its cached $\{W_{3, \{1\}}, W_{7, \{1\}}\}$ — but wait, $W_{3, \{1\}}$ is in worker 1's cache, not the $W_{3, \{2\}}$ subfile. Each worker decodes by XOR-ing with its cached companion subfile.

Rate

Total broadcasts (over all $\binom{4}{2} = 6$ subsets, each contributing $\mu N \cdot 2 = 2$ broadcasts of subfile size) equals the theoretical rate $N(1 - \mu)/(1 + N\mu) = 4 \cdot 0.75 / 2 = 1.5$ . Uncoded would be $N(1 - \mu) = 3$ — a $2\times$ reduction.

ex-ch07-06

Medium

Prove that the coded-shuffling rate $R^*(M) = N(1 - \mu)/ (1 + N\mu)$ is decreasing in $\mu$ . What is the physical interpretation?

Show Hint

Differentiate w.r.t. $\mu$ .

Solution

Compute derivative

$R^*(\mu) = N(1 - \mu)/(1 + N\mu)$ . Let $f(\mu) = (1 - \mu)/(1 + N\mu)$ . Then $f'(\mu) = -(1 + N\mu)/(1+N\mu)^2 - N(1 - \mu)/(1+N\mu)^2 = -((1 + N\mu) + N(1 - \mu))/(1+N\mu)^2 = -(N + 1)/(1+N\mu)^2 < 0$ . Hence $R^*$ is strictly decreasing in $\mu$ .

Physical interpretation

More per-worker memory $\implies$ each worker already has more of the required permuted data $\implies$ fewer broadcasts needed. The rate of decrease is $(N+1)$ -times-steeper-than-linear, reflecting the combinatorial alignment gain.

ex-ch07-07

Medium

For $N = 30$ workers and $\mu = 0.05$ , compute both the centralized coded-shuffling rate and the decentralized variant. Quantify the gap.

Show Hint

Centralized: $N(1 - \mu)/(1 + N\mu)$ . Decentralized: $N(1-\mu)(1 - (1-\mu)^N)/(N\mu)$ .

Solution

Centralized

$R^* = 30 \cdot 0.95 / (1 + 1.5) = 28.5 / 2.5 = 11.4$ .

Decentralized

$R_{\text{dec}}^* = 30 \cdot 0.95 \cdot (1 - 0.95^{30})/1.5 \approx 19 \cdot 0.785 \approx 14.9$ .

Gap

$14.9 / 11.4 \approx 1.31$ — about $31\%$ overhead. At this small $\mu$ ( $N\mu = 1.5$ , borderline), the decentralized scheme is noticeably worse. For $\mu = 0.2$ the gap shrinks to $<5\%$ .

ex-ch07-08

Medium

Compute the demand-private shuffling rate for $N = 20$ workers, $\mu = 0.2$ , collusion threshold $T = 2$ . Compare with the non-private rate.

Show Hint

$R_{\text{priv}} = N(1-\mu)/(1+N\mu-T)$ .

Solution

Non-private

$R^*(M) = N(1 - \mu)/(1 + N\mu) = 20 \cdot 0.8 / 5 = 3.2$ .

Private

$R_{\text{priv}} = 20 \cdot 0.8 / (5 - 2) = 16 / 3 \approx 5.33$ .

Penalty

Relative penalty: $5.33 / 3.2 \approx 1.67$ , i.e., a $67\%$ rate increase to hide demands from any 2 colluding workers. Higher $T$ inflates more; at $T = 4$ the rate becomes $16 / 1 = 16$ — 5 $\times$ the non-private cost.

ex-ch07-09

Medium

Explain how the coded-shuffling construction of §7.3 reuses the finite-field IA machinery of Chapter 4 directly. Identify the specific alignment happening in the XOR broadcasts.

Show Hint

Each broadcast aligns $1 + N\mu$ demands into one transmission.

Solution

Alignment in broadcasts

Each subset $\mathcal{S}$ of size $N\mu + 1$ contributes one broadcast that simultaneously serves $N\mu + 1$ workers' demands. The workers' cached subfiles "align" onto a common subspace where the XOR-cancellation happens, freeing the orthogonal direction for the intended demand.

Parallel with §4.3

In §4.3's coded caching, each broadcast satisfies $1 + KM/F$ user demands; here each broadcast satisfies $1 + N\mu$ worker demands. The mechanism is identical: cached content cancels interferers, leaving only the desired information.

The key transfer

The $1 + N\mu$ alignment capacity is exactly the $K/2$ DoF gain of finite-field IA specialized to this communication structure. Chapter 7's Wan / Tuninetti / Caire result inherits this algebraic machinery.

ex-ch07-10

Medium

Why is centralized placement more efficient than decentralized placement? What is the underlying combinatorial reason?

Show Hint

Centralized placement ensures each subset is 'covered' exactly once.

Solution

Centralized optimality

Centralized placement ensures every size- $N\mu$ subset has its corresponding subfile stored at exactly $N\mu$ workers. The broadcast structure perfectly exploits this combinatorial regularity.

Decentralized inefficiency

Random placement creates probabilistic coverage: some subsets are over-covered (redundant storage), others are under-covered (extra broadcasts needed). The mismatch between actual and ideal subset- cardinality distributions causes the rate gap.

Asymptotic

As $N \to \infty$ , the law of large numbers smooths out the randomness, and the decentralized coverage approaches centralized. The gap is $O(e^{-N\mu})$ .

ex-ch07-11

Hard

Prove the cut-set lower bound for coded shuffling: $R^*(M) \geq N(1 - \mu)/(1 + N\mu)$ . Sketch the key inequalities.

Show Hint

Use the output-entropy bound from Chapter 2's §2.1.

Solution

Output-entropy bound

For any scheme, the broadcast messages $\{X_k\}$ and the caches $\{Z_k\}$ must satisfy $\sum_k H(X_k) + \sum_k H(Z_k) \geq H(W_1, \ldots, W_D)$ . With $H(Z_k) = M \cdot \log q = \mu D \log q$ , we get $\sum_k H(X_k) \geq D \log q - N \mu D \log q$ .

Alignment factor

Each broadcast bit can simultaneously serve $1 + N\mu$ distinct workers' demands (the finite-field IA alignment capacity). Hence the number of broadcast bits is at least $D(1 - \mu)\log q / (1 + N\mu)$ , or, normalized, $R \geq (1 - \mu)/(1 + N\mu)$ per data point. Scaling by $N$ workers gives $R \geq N(1 - \mu)/(1 + N\mu)$ .

Match

The Wan et al. construction achieves this bound with equality, closing the rate region. $\blacksquare$

ex-ch07-12

Hard

Extend the Wan / Tuninetti / Caire scheme to $T$ -demand-private shuffling. Show that the rate becomes $R_{\text{priv}}^* = N(1 - \mu)/(1 + N\mu - T)$ and explain where the $-T$ comes from.

Show Hint

Use ramp secret sharing to mask the demands.

Solution

Construction

Combine the Wan et al. coded-shuffling placement with a $(T, 1 + N\mu)$ -ramp secret sharing of the demand indices. Each broadcast carries one informative unit and $T$ random mask units.

Alignment capacity reduction

Each broadcast bit can serve $1 + N\mu - T$ demands (the remaining capacity after $T$ units are spent on masking). So the number of broadcasts per epoch is $N(1 - \mu)/(1 + N\mu - T)$ .

Privacy guarantee

Any $T$ colluding workers see $T$ masked shares — by the ramp secret sharing property, they learn nothing about the demand of any other worker. Privacy is information-theoretic.

Feasibility

Scheme requires $1 + N\mu > T$ ; beyond this, the alignment capacity is exhausted and no scheme achieves demand privacy at the given memory level.

ex-ch07-13

Hard

Consider heterogeneous per-worker memory budgets: $n_1$ workers with memory $M_1$ , $n_2$ workers with memory $M_2$ ( $M_1 < M_2$ ). Conjecture the optimal rate and argue why the homogeneous lower bound must weaken.

Show Hint

The rate region becomes piecewise-defined.

Solution

Guess form

Each subset of workers contributes broadcasts in proportion to the alignment capacity of the smallest subset-cardinality storage. The rate depends on the joint distribution of memory budgets.

Example at $M_1 = 0$, $M_2 = D$

$n_1$ workers cache nothing, $n_2$ workers cache everything. Rate reduces to $n_1(1 - 0)/(1 + 0) = n_1$ — just the no-cache subset of workers must be served. The high-memory workers need no broadcasts.

General bound

The rate is at least $\sum_i n_i (1 - \mu_i)/(1 + \sum_j n_j \mu_j)$ , a piecewise-linear function of the memory distribution. Full characterization is an open problem (see Chapter 18).

ex-ch07-14

Hard

Describe a hypothetical hybrid scheme combining coded gradient computation (Chapter 6) with coded data shuffling (this chapter). What benefits arise and what additional costs?

Show Hint

Both operations are linear; they can be composed.

Solution

Composition idea

At each epoch, the master broadcasts the shuffle via coded shuffling. Each worker then computes its coded gradient (with $s + 1$ partitions) on the newly-shuffled local data. The master aggregates via gradient coding.

Benefits

Both shuffling and aggregation bottlenecks are addressed simultaneously. Per-round latency is reduced; straggler tolerance is improved.

Costs

Per-worker storage: $M + (s + 1) \cdot D / N$ partitions (shuffling cache + gradient-coding redundancy). This can be significant for large models and datasets.

Status

Such hybrid schemes are a natural research direction. Partial results exist in the Ye-Abbe 2018 paper on joint communication-computation tradeoffs. A complete rate-region characterization is open.

ex-ch07-15

Challenge

Open problem. Characterize the optimal achievability-converse closure for the demand-private coded-shuffling problem with $T$ colluders and straggler tolerance $s$ . Is there a clean formula $R^*_{\text{priv,strag}} (M, T, s)$ ?

Show Hint

Think about what the cut-set adds for stragglers.

Solution

Partial result

Achievability by composition: Wan et al. coded shuffling + ramp secret sharing (privacy) + gradient-coding-style straggler tolerance. The composed rate is at most $R \leq N(1 - \mu)/(1 + N\mu - T - s)$ .

Conjecture

The matching converse may be $R^*_{\text{priv,strag}} = N(1 - \mu)/(1 + N\mu - T - s)$ assuming $1 + N\mu > T + s$ . The privacy and straggler penalties each subtract one unit from the alignment capacity.

Status

The conjecture is consistent with special cases ( $T = 0$ recovers gradient coding; $s = 0$ recovers Wan et al. private shuffling) but a general converse is open. This is a research- level exercise at the intersection of coded shuffling, gradient coding, and PIR.

Exercises

ex-ch07-01

Plug in

Ratio

ex-ch07-02

Achievability

Converse

Implication

ex-ch07-03

Why FL doesn't shuffle cross-user

The "shuffle" replacement

When coded shuffling does apply to FL

ex-ch07-04

Rates

Why

Practical upshot

ex-ch07-05

Placement

Example permutation

Delivery (pick subset $\{1, 2\}$)

Rate

ex-ch07-06

Compute derivative

Physical interpretation

ex-ch07-07

Centralized

Decentralized

Gap

ex-ch07-08

Non-private

Private

Penalty

ex-ch07-09

Alignment in broadcasts

Parallel with §4.3

The key transfer

ex-ch07-10

Centralized optimality

Decentralized inefficiency

Asymptotic

ex-ch07-11

Output-entropy bound

Alignment factor

Match

ex-ch07-12

Construction

Alignment capacity reduction

Privacy guarantee

Feasibility

ex-ch07-13

Guess form

Example at $M_1 = 0$, $M_2 = D$

General bound

ex-ch07-14

Composition idea

Benefits

Costs

Status

ex-ch07-15

Partial result

Conjecture

Status