Exercises
ex-ch07-01
EasyFor workers, dataset size , per-worker memory . Compute the uncoded shuffling cost and the Wan / Tuninetti / Caire coded rate .
; .
Plug in
. . .
Ratio
Savings factor: , matching the theoretical improvement.
ex-ch07-02
EasyWhy does the Wan / Tuninetti / Caire coded-shuffling scheme achieve optimal rate in both achievability and converse?
Construction is finite-field IA; converse is cut-set.
Achievability
The coded-shuffling construction uses finite-field IA to XOR-combine per-subset demands into single broadcasts, achieving rate exactly (for integer ).
Converse
The cut-set argument of §7.2 shows that no scheme can achieve smaller rate at the same per-worker memory, giving a matching lower bound.
Implication
Matching upper and lower bounds close the rate region: no cleverer scheme exists at this memory level. The result is information-theoretically tight.
ex-ch07-03
EasyState why federated learning does not benefit from the coded-shuffling framework of this chapter, and what alternative shuffling strategies FL uses instead.
Why FL doesn't shuffle cross-user
In FL, each user's data is private and stays on the local device — there is no cross-user data transfer. Instead, FL samples users (random subset per round) and each selected user processes its own local data.
The "shuffle" replacement
User sampling per round gives the random-access property that SGD needs; no data movement is required. This is one of FL's advantages over data-center training for privacy-sensitive applications.
When coded shuffling does apply to FL
Hybrid FL systems (e.g., cross-silo with shared data centers) can use coded shuffling within each silo. Pure cross-device FL cannot.
ex-ch07-04
EasyFor random-reshuffling SGD vs. i.i.d. sampling, state the convergence rates and why the former is preferred.
vs. .
Rates
i.i.d. sampling with replacement: . Random reshuffling: .
Why
Random reshuffling visits every data point exactly once per epoch, reducing gradient variance compared to i.i.d. sampling (which may sample some points multiple times per epoch, others not at all).
Practical upshot
Reshuffling is faster per unit of "effective" iterations. For , this is a speedup — hence the importance of the shuffling step.
ex-ch07-05
MediumConstruct a concrete coded-shuffling placement and one epoch's delivery for workers, data points, (so ). Specify subfile partitions, placement, and broadcast messages for a particular permutation .
Each data point splits into subfiles.
Placement
, . Each has 4 subfiles for . Worker stores — 8 subfiles × 1/4 = 2 data points. ✓
Example permutation
assigns worker 1 , worker 2 , worker 3 , worker 4 .
Delivery (pick subset $\{1, 2\}$)
Worker 1 needs from worker 2's cache. Worker 2 needs from worker 1's cache. Broadcasts: and . Worker 1 uses its cached — but wait, is in worker 1's cache, not the subfile. Each worker decodes by XOR-ing with its cached companion subfile.
Rate
Total broadcasts (over all subsets, each contributing broadcasts of subfile size) equals the theoretical rate . Uncoded would be — a reduction.
ex-ch07-06
MediumProve that the coded-shuffling rate is decreasing in . What is the physical interpretation?
Differentiate w.r.t. .
Compute derivative
. Let . Then . Hence is strictly decreasing in .
Physical interpretation
More per-worker memory each worker already has more of the required permuted data fewer broadcasts needed. The rate of decrease is -times-steeper-than-linear, reflecting the combinatorial alignment gain.
ex-ch07-07
MediumFor workers and , compute both the centralized coded-shuffling rate and the decentralized variant. Quantify the gap.
Centralized: . Decentralized: .
Centralized
.
Decentralized
.
Gap
— about overhead. At this small (, borderline), the decentralized scheme is noticeably worse. For the gap shrinks to .
ex-ch07-08
MediumCompute the demand-private shuffling rate for workers, , collusion threshold . Compare with the non-private rate.
.
Non-private
.
Private
.
Penalty
Relative penalty: , i.e., a rate increase to hide demands from any 2 colluding workers. Higher inflates more; at the rate becomes — 5 the non-private cost.
ex-ch07-09
MediumExplain how the coded-shuffling construction of §7.3 reuses the finite-field IA machinery of Chapter 4 directly. Identify the specific alignment happening in the XOR broadcasts.
Each broadcast aligns demands into one transmission.
Alignment in broadcasts
Each subset of size contributes one broadcast that simultaneously serves workers' demands. The workers' cached subfiles "align" onto a common subspace where the XOR-cancellation happens, freeing the orthogonal direction for the intended demand.
Parallel with §4.3
In §4.3's coded caching, each broadcast satisfies user demands; here each broadcast satisfies worker demands. The mechanism is identical: cached content cancels interferers, leaving only the desired information.
The key transfer
The alignment capacity is exactly the DoF gain of finite-field IA specialized to this communication structure. Chapter 7's Wan / Tuninetti / Caire result inherits this algebraic machinery.
ex-ch07-10
MediumWhy is centralized placement more efficient than decentralized placement? What is the underlying combinatorial reason?
Centralized placement ensures each subset is 'covered' exactly once.
Centralized optimality
Centralized placement ensures every size- subset has its corresponding subfile stored at exactly workers. The broadcast structure perfectly exploits this combinatorial regularity.
Decentralized inefficiency
Random placement creates probabilistic coverage: some subsets are over-covered (redundant storage), others are under-covered (extra broadcasts needed). The mismatch between actual and ideal subset- cardinality distributions causes the rate gap.
Asymptotic
As , the law of large numbers smooths out the randomness, and the decentralized coverage approaches centralized. The gap is .
ex-ch07-11
HardProve the cut-set lower bound for coded shuffling: . Sketch the key inequalities.
Use the output-entropy bound from Chapter 2's §2.1.
Output-entropy bound
For any scheme, the broadcast messages and the caches must satisfy . With , we get .
Alignment factor
Each broadcast bit can simultaneously serve distinct workers' demands (the finite-field IA alignment capacity). Hence the number of broadcast bits is at least , or, normalized, per data point. Scaling by workers gives .
Match
The Wan et al. construction achieves this bound with equality, closing the rate region.
ex-ch07-12
HardExtend the Wan / Tuninetti / Caire scheme to -demand-private shuffling. Show that the rate becomes and explain where the comes from.
Use ramp secret sharing to mask the demands.
Construction
Combine the Wan et al. coded-shuffling placement with a -ramp secret sharing of the demand indices. Each broadcast carries one informative unit and random mask units.
Alignment capacity reduction
Each broadcast bit can serve demands (the remaining capacity after units are spent on masking). So the number of broadcasts per epoch is .
Privacy guarantee
Any colluding workers see masked shares — by the ramp secret sharing property, they learn nothing about the demand of any other worker. Privacy is information-theoretic.
Feasibility
Scheme requires ; beyond this, the alignment capacity is exhausted and no scheme achieves demand privacy at the given memory level.
ex-ch07-13
HardConsider heterogeneous per-worker memory budgets: workers with memory , workers with memory (). Conjecture the optimal rate and argue why the homogeneous lower bound must weaken.
The rate region becomes piecewise-defined.
Guess form
Each subset of workers contributes broadcasts in proportion to the alignment capacity of the smallest subset-cardinality storage. The rate depends on the joint distribution of memory budgets.
Example at $M_1 = 0$, $M_2 = D$
workers cache nothing, workers cache everything. Rate reduces to — just the no-cache subset of workers must be served. The high-memory workers need no broadcasts.
General bound
The rate is at least , a piecewise-linear function of the memory distribution. Full characterization is an open problem (see Chapter 18).
ex-ch07-14
HardDescribe a hypothetical hybrid scheme combining coded gradient computation (Chapter 6) with coded data shuffling (this chapter). What benefits arise and what additional costs?
Both operations are linear; they can be composed.
Composition idea
At each epoch, the master broadcasts the shuffle via coded shuffling. Each worker then computes its coded gradient (with partitions) on the newly-shuffled local data. The master aggregates via gradient coding.
Benefits
Both shuffling and aggregation bottlenecks are addressed simultaneously. Per-round latency is reduced; straggler tolerance is improved.
Costs
Per-worker storage: partitions (shuffling cache + gradient-coding redundancy). This can be significant for large models and datasets.
Status
Such hybrid schemes are a natural research direction. Partial results exist in the Ye-Abbe 2018 paper on joint communication-computation tradeoffs. A complete rate-region characterization is open.
ex-ch07-15
ChallengeOpen problem. Characterize the optimal achievability-converse closure for the demand-private coded-shuffling problem with colluders and straggler tolerance . Is there a clean formula ?
Think about what the cut-set adds for stragglers.
Partial result
Achievability by composition: Wan et al. coded shuffling + ramp secret sharing (privacy) + gradient-coding-style straggler tolerance. The composed rate is at most .
Conjecture
The matching converse may be assuming . The privacy and straggler penalties each subtract one unit from the alignment capacity.
Status
The conjecture is consistent with special cases ( recovers gradient coding; recovers Wan et al. private shuffling) but a general converse is open. This is a research- level exercise at the intersection of coded shuffling, gradient coding, and PIR.