Ferkans — Interactive Telecom Tutor

ex18-1

Easy

Show that $\tanh(x)$ on $[-1, 1]$ has Chebyshev coefficients decaying exponentially. What polynomial degree is needed for $10^{-3}$ uniform error?

Show Hint

$\tanh$ is analytic on $[-1, 1]$ .

Chebyshev coefficients of analytic functions decay exponentially.

Solution

Analyticity

$\tanh(x)$ is analytic on a strip $|\Im(x)| < \pi/2$ . By Bernstein's theorem, Chebyshev coefficients decay as $|c_k| \leq C e^{-k \log((\pi/2)/1)} = C e^{-k \log(\pi/2)}$ .

Degree for $10^{-3}$

Need $e^{-d \log(\pi/2)} < 10^{-3}$ , i.e., $d \log(\pi/2) > 3 \log 10 \approx 6.9$ . $d > 6.9/0.45 \approx 15$ .

Coded-computing cost

Degree $d = 15$ , so recovery threshold $K = O(15)$ workers. Reasonable for moderate cluster sizes.

ex18-2

Easy

Explain why ReLU's polynomial approximation gives only $O(1/d)$ error, not exponential.

Show Hint

ReLU has a kink at $x = 0$ .

Gibbs phenomenon for non-smooth functions.

Solution

The kink

ReLU $= \max(0, x)$ ; derivative jumps at $x = 0$ . Not differentiable at that point.

Chebyshev convergence rate

For non-smooth functions, Chebyshev coefficients decay polynomially, typically as $1/k^2$ .

Total error

$\sum_{k > d} 1/k^2 = O(1/d)$ . Polynomial convergence.

Engineering implication

For $10^{-3}$ error, $d \sim 10^3$ — prohibitive recovery threshold. Production ReLU coded computing requires hybrid or learned schemes.

ex18-3

Easy

For $n = 50$ users, $d = 10^5$ gradient dim, $T = 3$ -privacy, $B = 5$ Byzantine, $\varepsilon = 1, \delta = 10^{-5}$ DP, compute the lower-bound communication per round.

Show Hint

Use Theorem 18.2.1.

Solution

Compute components

$T + B + \log(1/\delta)/\varepsilon^2 = 3 + 5 + \log(10^5) = 3 + 5 + 11.5 = 19.5$ .

Multiply

$n \cdot 19.5 \cdot d = 50 \cdot 19.5 \cdot 10^5 = 9.75 \cdot 10^7$ bits per round.

Compare with FL-only

FL with no constraints: $n \cdot d = 5 \cdot 10^6$ bits. Joint axis cost: $\sim 20\times$ overhead.

ex18-4

Easy

An Erdős-Rényi $G(n, p)$ graph has spectral gap (second eigenvalue of Metropolis-Hastings mixing matrix) $\rho \approx 1 - c p (1-p)$ for large $n$ . Compute $\rho$ for $n = 100$ , $p = 0.05$ . What is the mixing time $T_{\text{mix}} = 1/(1-\rho)$ ?

Show Hint

$c \approx 1$ for typical graphs.

Solution

Spectral gap

$\rho \approx 1 - 0.05 \cdot 0.95 = 0.9525$ .

Mixing time

$1/(1-\rho) = 1/0.0475 \approx 21$ .

Operational

Need $\sim 21$ rounds of mixing per consensus. For D-SGD with 1000 iterations, 20x overhead is tolerable.

ex18-5

Medium

Compute the quantum PIR rate $(N - T)/(N + T)$ and classical $(1 - T/N)$ for $N = 6, T = 2$ . By what percent does quantum outperform?

Show Hint

Substitute directly.

Solution

Quantum

$Q = (6 - 2)/(6 + 2) = 4/8 = 0.5$ .

Classical

$C = 1 - 2/6 = 2/3 \approx 0.667$ .

Comparison

Actually, classical is higher here. For large $K$ , classical $\to 1 - T/N = 2/3$ ; quantum $(N-T)/(N+T) = 0.5$ . Quantum does not always outperform — the advantage depends on the regime.

Advantage regime

Quantum advantage is for moderate $T/N$ and specific non-classical PIR variants (coded storage + colluding). Check Allaix et al. 2022 for the exact regime.

ex18-6

Medium

For $n = 100$ users in D-SGD with graph density $p = \log(100)/100 \approx 0.046$ , compute the total per-round peer exchanges and compare with centralized ( $n$ uploads).

Show Hint

Expected edges: $n(n-1)p/2$ .

Each edge is two exchanges (one per direction).

Solution

Expected edges

$n(n-1)p/2 = 100 \cdot 99 \cdot 0.046/2 \approx 228$ .

Per-round exchanges

$2 \cdot 228 = 456$ edge-exchanges.

Comparison

Centralized: $100$ uploads. D-SGD with sparse graph: $456$ exchanges — $4.5\times$ more.

Operational

Sparse D-SGD doesn't save total bandwidth, but distributes it (no central bottleneck). The trade-off is resilience vs. raw efficiency.

ex18-7

Medium

For coded transformer attention with Chebyshev-polynomial softmax approximation of degree $d = 8$ and $n = 20$ workers, compute the recovery threshold if (i) only $\mathbf{Q}\mathbf{K}^T$ is coded, and (ii) both $\mathbf{Q}\mathbf{K}^T$ and softmax are coded.

Show Hint

Matrix product via Chapter 5: threshold $K = N - 1$ (for simple polynomial codes).

Softmax approximation: degree-8 polynomial, $K = O(8)$ .

Solution

(i) QK^T only

Threshold for polynomial encoding: $K \sim 1-2$ (most schemes) — can tolerate $\sim 10-15$ stragglers per $20$ .

(ii) Both

Softmax requires degree $8$ Lagrange code: threshold $K = O(8)$ . Total effective threshold: $K \approx 10$ per $20$ workers. Can tolerate $\sim 10$ stragglers.

Engineering

Coding both stages reduces straggler tolerance but preserves more of the transformer pipeline. End-to-end optimization depends on the straggler rate.

ex18-8

Medium

In an MoE model with $M = 64$ experts and $k = 2$ active per input, a fraction $k/M = 1/32$ of the computation is activated per input. Argue that coded computing must adapt to this sparsity.

Show Hint

Naive coded computing assumes all experts used.

Sparse coded computing exploits non-activation.

Solution

Naive coded MoE

If all $M$ experts are coded redundantly, the recovery threshold is $K = O(M)$ . For $M = 64$ , large cluster required.

Sparsity exploitation

Only $k = 2$ experts are active per input. Each input has a known sparse activation pattern.

Adaptive scheme

Encode only the active $k$ experts' computations per input. Threshold $K = O(k)$ — much smaller.

Challenge

The activation is data-dependent: different inputs use different experts. Scheduling straggler tolerance per input breaks uniform coding. Open research.

ex18-9

Medium

A RAG system performs $T = 100$ queries per session. The user caches $M = 5$ documents from previous queries. Apply cache-aided PIR (Chapter 15) to estimate the retrieval-rate improvement.

Show Hint

Effective $K$ reduces by $M$ (side-info PIR).

With $K_{\text{total}} = 1000$ docs, $N = 5$ databases.

Solution

Without cache

PIR rate $= C_{\text{PIR}}(5, 1000) = (1 + 0.2 + 0.04 + \ldots)^{-1} \approx 0.8$ .

With cache of $5$

Effective $K = 995$ . Rate $= C(5, 995) \approx 0.8$ . For $K \to \infty$ the rate saturates at $1 - 1/N = 0.8$ . Cache of 5 is negligible relative to $K = 1000$ .

When cache helps

Cache helps significantly when $M \sim K$ . For $M = 900, K = 1000$ : effective $K = 100$ , rate $\approx 0.88$ . Requires large cache.

Operational

In RAG, $M \ll K$ typical (user caches a few; library has thousands). Cache-aided PIR gives minor rate improvement; its main value is recurring- document reuse.

ex18-10

Hard

Sketch an approach to proving an information-theoretic lower bound on the recovery threshold of non-linear coded computing. What is the converse argument?

Show Hint

Use cut-set with a non-linear function.

Think about how many workers must be non-stragglers to reconstruct $f(x)$ .

Solution

Setup

User wants $f(\mathbf{X})$ where $\mathbf{X}$ is split across workers. Each worker computes partial information. Recovery threshold: minimum workers needed.

Cut-set argument

Consider a subset $\mathcal{S}$ of workers. The mutual information between $f(\mathbf{X})$ and $\{\text{output}_k : k \in \mathcal{S}\}$ is bounded by the joint entropy of their inputs — in the non-linear case, this is not linearly separable.

Non-linear converse difficulty

Unlike the polynomial case, non-linear $f$ can create non-additive information. Cut-set bounds are not tight; tighter bounds require computing non-linear conditional entropies — an unsolved problem in general.

Open aspect

Tight converses for non-linear coded computing require new techniques — e.g., matroid theory, or generalizations of Fano's inequality to non-linear regimes. An active area.

ex18-11

Hard

Given a fixed total communication budget $C$ bits, what is the optimal allocation among privacy ( $T$ ), robustness ( $B$ ), and DP ( $(\varepsilon, \delta)$ )? Derive the Pareto frontier.

Show Hint

Lower bound: $C = n \cdot (T + B + \log(1/\delta)/\varepsilon^2) \cdot d$ .

Each axis has its own cost-benefit.

Solution

Setup

Maximize utility (e.g., user-level privacy + system-level robustness + aggregate DP) subject to $C$ budget.

Lagrangian

$\mathcal{L} = -U(T, B, \varepsilon, \delta) + \lambda(T + B + \log(1/\delta)/\varepsilon^2 - C/(nd))$ .

First-order conditions

Equate marginal utilities across axes: $\partial U/\partial T = \partial U/\partial B = \partial U/\partial (\log(1/\delta)/\varepsilon^2)$ . The optimal $(T, B, \varepsilon)$ equalize marginal benefits.

Frontier shape

For convex utility, the Pareto frontier is a smooth curve in $(T, B, \log(1/\delta)/\varepsilon^2)$ space. For the specific form of each axis's utility (e.g., privacy is binary — either satisfied or not), the frontier is piecewise linear.

Engineering

In practice, operators set $(T, B, \varepsilon)$ from requirements (privacy law, SLA, etc.) rather than maximize utility. The Pareto frontier is a check that the requirements are feasible.

ex18-12

Hard

Sketch how AirComp (Ch 16) could be combined with D-SGD (§18.3). Each peer-to-peer pair uses a small MAC; aggregation is per-edge. What are the synchronization and scaling implications?

Show Hint

Per-edge AirComp: $O(1)$ channel uses per neighbor pair.

Synchronization across pairs.

Solution

Per-edge AirComp

For each edge $(i, j)$ in $\mathcal{G}$ , peers $i$ and $j$ mutually transmit their models. Using AirComp: $O(1)$ channel use per edge.

Total bandwidth

$O(|\mathcal{E}|) = O(pn^2)$ AirComp rounds. For sparse $p = \log n / n$ : $O(n \log n)$ — sub-quadratic.

Synchronization

Each peer-pair must synchronize independently. Harder than synchronizing a single MAC — pairwise synchronization is harder in multi-hop networks.

Privacy benefit

Per-edge AirComp provides weak privacy between each pair only: the MAC superposition hides the sum of the two models. Stronger privacy requires additional masking.

Open problem

Convergence rate of D-SGD with AirComp: can the graph-mixing and per-edge MSE be combined into a unified convergence result? Open.

ex18-13

Hard

A federated RAG system has $n = 100$ users, each hosting $K/n = 10$ documents. Retrieval must be $T$ -private against any $T$ users. What is the total retrieval overhead (per document) under different $T$ ?

Show Hint

This is $T$ -colluding PIR (Ch 14 §14.2) with $N = n$ databases.

$C_{\text{PIR}}(N, K, T) \leq 1 - T/N$ .

Solution

$T = 1$ (classical)

$C(100, 1000, 1) = (1 + 1/100 + \ldots)^{-1} \approx 0.99$ . Retrieve 1 document at rate 0.99 — download $\approx 1.01$ documents' worth.

$T = 5$

$C(100, 1000, 5) \approx 1 - 5/100 = 0.95$ . Download $\approx 1.05$ documents' worth.

$T = 50$ (half-colluding)

$C(100, 1000, 50) \approx 1 - 50/100 = 0.5$ . Download $\approx 2$ documents' worth — $2\times$ overhead.

$T = 90$

$C(100, 1000, 90) \approx 1 - 90/100 = 0.1$ . Download $\approx 10$ documents — $10\times$ overhead.

Operational

For modest $T/N$ , PIR overhead is negligible. For aggressive collusion tolerance, substantial. Federated RAG deployments should pick $T$ based on threat model.

ex18-14

Hard

For $T$ -colluding PIR with $N = 8, K = 100$ , compute (i) classical PIR capacity and (ii) quantum PIR rate (Theorem 18.4.1). For what range of $T$ does quantum strictly outperform classical?

Show Hint

Classical: $1 - T/N$ (asymptotic).

Quantum: $(N-T)/(N+T)$ .

Solution

Classical vs. Quantum

Classical: $1 - T/8$ . Quantum: $(8 - T)/(8 + T)$ .

Find $T$ where quantum exceeds classical

$(8 - T)/(8 + T) > 1 - T/8$ $\Leftrightarrow 8(8-T) > (1 - T/8)(8+T) \cdot 8$ $\Leftrightarrow 64 - 8T > (8 - T)(1 + T/8)$ $\Leftrightarrow 64 - 8T > 8 + T - T/8 - T^2/8$ Simplifying: $56 - 9T + T/8 + T^2/8 > 0$ $\Leftrightarrow 7T^2 - 72T + 448 > 0$ Roots: $T \approx 7.3 \pm i(\ldots)$ — no real roots, so LHS always positive. Quantum always at least matches classical? Let's check $T = 1$ : Classical $= 7/8 = 0.875$ ; Quantum $= 7/9 \approx 0.778$ . Quantum is lower here! The formula in Thm. 18.4.1 applies for specific PIR variants, not vanilla $T$ -colluding.

Resolution

The quantum advantage is specific to coded-storage PIR or SPIR, not classical $T$ -colluding. For classical $T$ -colluding, classical can be equal or better than quantum. The quantum advantage emerges in multi-constraint PIR variants (Allaix et al. 2022).

Open

Characterizing the exact classical-quantum gap for each PIR variant is open. Some variants have quantum advantage; others don't.

ex18-15

Challenge

The five CommIT contributions cover coded shuffling, SecAgg, ByzSecAgg, CCESA, and IT-secure FRL. Propose a joint problem — combining two or more contributions — that would constitute a sixth CommIT direction. State the problem statement, the achievability scheme, and the key open questions.

Show Hint

Combine ByzSecAgg with cache-aided PIR?

CCESA + AirComp?

FRL + ByzSecAgg?

Solution

Example candidate: CCESA + AirComp

Setup: wireless FL with $n$ users on a sparse graph (CCESA- like topology); within each neighborhood, AirComp aggregation. Combines CCESA's $O(n\sqrt{n/\log n})$ scaling with AirComp's $O(1)$ per-round cost.

Achievability

Modify CCESA's sparse-graph topology so neighbors are reachable via a pair of peers with AirComp: local edges use AirComp; global consensus uses gossip + digital filter.

Open questions

Convergence rate as a function of $p$ (sparsity), $\ntn{mseagg}$ (per-edge AirComp), and graph-mixing time.
Privacy bound: CCESA's reliability guarantees hold under AirComp aggregation?
Practical synchronization: neighbor-synchronous AirComp vs. full global synchronization.

Impact

A successful framework would give communication-optimal wireless FL with formal privacy-robustness-DP guarantees — closing gaps in all five prior CommIT contributions simultaneously. This is the kind of sixth contribution the CommIT community could aim for.

Research plan

(1) Simulate small-scale prototype. (2) Prove convergence bound. (3) Characterize threat model. (4) Test on realistic wireless channels. Each step is a year of research. The larger research program spans 3-5 years.

Exercises

ex18-1

Analyticity

Degree for $10^{-3}$

Coded-computing cost

ex18-2

The kink

Chebyshev convergence rate

Total error

Engineering implication

ex18-3

Compute components

Multiply

Compare with FL-only

ex18-4

Spectral gap

Mixing time

Operational

ex18-5

Quantum

Classical

Comparison

Advantage regime

ex18-6

Expected edges

Per-round exchanges

Comparison

Operational

ex18-7

(i) QK^T only

(ii) Both

Engineering

ex18-8

Naive coded MoE

Sparsity exploitation

Adaptive scheme

Challenge

ex18-9

Without cache

With cache of $5$

When cache helps

Operational

ex18-10

Setup

Cut-set argument

Non-linear converse difficulty

Open aspect

ex18-11

Setup

Lagrangian

First-order conditions

Frontier shape

Engineering

ex18-12

Per-edge AirComp

Total bandwidth

Synchronization

Privacy benefit

Open problem

ex18-13

$T = 1$ (classical)

$T = 5$

$T = 50$ (half-colluding)

$T = 90$

Operational

ex18-14

Classical vs. Quantum

Find $T$ where quantum exceeds classical

Resolution

Open

ex18-15

Example candidate: CCESA + AirComp

Achievability

Open questions

Impact

Research plan