Modern ML Architectures and Quantum Frontiers

The ML Landscape Has Moved

The coded-computing framework developed in Part II was inspired by matrix multiplication and gradient averaging — the dominant computations of 2017-2019 FL. Modern ML — transformer attention, mixture-of-experts, retrieval-augmented generation — imposes new computational patterns:

  • Attention softmax: n×nn \times n non-linear computation.
  • Mixture-of-experts: sparse activation of a subset of experts per input; the "computation pattern" is data- dependent.
  • Retrieval-augmented generation: each inference involves a PIR-like database lookup against a frozen knowledge base.

Each of these lies outside the coded-computing framework's reach. Extending coded computing to modern ML is a research frontier — and this section maps the major directions, along with a second frontier (quantum PIR) that completes the book's information-theoretic horizon.

,

Definition:

Coded Attention: Problem Statement

The transformer attention operation is Attn(Q,K,V)=softmax ⁣(QKTdk)V\text{Attn}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right) \mathbf{V} for query Q\mathbf{Q}, key K\mathbf{K}, value V\mathbf{V} matrices.

The central question: can attention be distributed across workers with coded redundancy for straggler tolerance?

The three sub-operations:

  1. QKT\mathbf{Q}\mathbf{K}^T: matrix multiplication — coded computing works (Chapter 5).

  2. softmax()\mathrm{softmax}(\cdot): non-linear — polynomial approximation or hybrid (Section 18.1).

  3. softmax()V\mathrm{softmax}(\cdot) \mathbf{V}: matrix multiplication of softmax output by V\mathbf{V} — can use Chapter 5 if softmax output is available.

The composition structure allows a partial coded solution: the first and third steps use standard coded computing, and the middle (softmax) uses approximation. End-to-end optimality is open.

Challenges in Coded Attention

Practical challenges in deploying coded attention:

  • Softmax is highly non-linear. Chebyshev approximation degree 10\sim 10 is needed for 1%1\% error on typical attention distributions. This shrinks the straggler tolerance compared to pure matrix multiplication.

  • Matrix dimensions are large. Attention in modern transformers is dk100d_k \sim 100, n103n \sim 10^3. Coding schemes must scale.

  • Attention patterns are sparse. Many attention values are near-zero; sparse coded computing is more efficient but less studied.

  • End-to-end optimization. The coded scheme should minimize end- to-end attention error, not any intermediate step — a compositional optimization problem.

Preliminary results (Lu et al. 2023) achieve 20%\sim 20\% recovery threshold improvement over naive composition; fundamental optimality is open.

Definition:

Coded Computing for Mixture-of-Experts

Mixture-of-Experts (MoE) models select, per input, a subset of kk out of MM experts. Each input uses a different subset — the "routing" depends on the input.

Coded-computing implications:

  • Different inputs use different subsets. Straggler-aware coding must account for this variability — a straggler on expert ee only affects inputs that use ee.

  • Expert imbalance. Some experts are used more than others (load imbalance). Coded schemes should balance experts as well as straggler tolerance.

  • Sparse activation. Only k/Mk/M of experts are active per input. The effective computation is sparse — coded schemes that exploit this can reduce communication.

Open framework: no coded-computing theory exists for MoE models. The routing structure creates a data- dependent workload distribution that breaks uniform straggler analysis.

,

Retrieval-Augmented Generation as PIR

Retrieval-augmented generation (RAG) augments a language model with a retrieval step: given a query, retrieve relevant documents from a database and condition the generation on them.

For privacy-sensitive RAG (e.g., medical RAG systems), the retrieval step is essentially a PIR problem: user wants one document from a database without revealing which one. Chapter 13-15's PIR framework applies directly.

Emerging research directions:

  1. PIR-augmented generation. Apply Sun-Jafar PIR (Ch 13) to the retrieval step of RAG. Trade-off: retrieval rate vs. query privacy.

  2. Cache-aided RAG. The language model may cache previously-retrieved documents. Chapter 15's cache-aided PIR applies. The CommIT contribution (Wan-Tuninetti-Caire 2021) on demand-private cached delivery is immediately relevant.

  3. Federated RAG. Users host local documents; the retrieval is distributed. Combines Part IV's PIR with Part V's wireless FL.

This is a live bridge between information-theoretic PIR (Part IV) and modern generative AI. The CommIT portfolio is well-positioned to contribute to this frontier.

Definition:

Quantum PIR: Problem Statement

Quantum PIR (QPIR) considers the PIR problem where databases and/or communication channels are quantum.

Three variants:

  1. Classical queries, quantum databases. The databases hold quantum states; user queries classically. Can the user retrieve a classical description of a quantum-stored file at higher rate than classical PIR?

  2. Quantum queries, classical databases. User queries with quantum states; databases respond classically or quantumly. Quantum superposition may allow simultaneously querying many files privately.

  3. Quantum end-to-end. Both databases and communication are quantum.

Notable results:

  • Song-Hayashi 2019: QPIR capacity matches classical PIR in some regimes.
  • Allaix et al. 2022: TT-colluding QPIR offers strictly better rate than classical TT-colluding PIR in some parameter ranges — a quantum advantage.
  • Full quantum-classical PIR capacity gap is open.

The operational implication: for ultra-sensitive retrieval (e.g., state secrets, genomic queries), quantum PIR may offer better privacy-rate trade-off than classical.

,

Theorem: Quantum PIR Rate Advantage

For TT-colluding PIR with NN databases and KK files, quantum entanglement between databases allows a retrieval rate QPIR(N,K,T)    NTN+TQ_{\text{PIR}}(N, K, T) \;\geq\; \frac{N - T}{N + T} (for T<N/2T < N/2). Compare with classical CPIR(N,K,T)1T/NC_{\text{PIR}}(N, K, T) \leq 1 - T/N for large KK.

Quantum advantage. For moderate T/NT/N, QPIR>CPIRQ_{\text{PIR}} > C_{\text{PIR}}. The advantage grows with TT: for T=N/3T = N/3, QPIR/CPIR=(N/3)/(2N/3)3/2=3/4Q_{\text{PIR}}/C_{\text{PIR}} = (N/3)/(2N/3) \cdot 3/2 = 3/4 vs. classical 2/32/3 — about 12.5%12.5\% relative improvement.

Open Frontiers Mapped

FrontierStatusKey open problem
Non-linear coded computingApproximation-based schemes knownInformation-theoretic lower bound
Three-axis privacy-robustness-DP2-axis results known; 3-axis lower boundJointly-optimal achievability
Decentralized FLGossip D-SGD knownJoint privacy-robustness-DP over graphs
Coded attentionPartial; hybrid approximationsEnd-to-end optimal transformer coding
Coded MoEOpen — no frameworkData-dependent workload distribution
RAG / federated RAGOpen — PIR connection unexploredPrivacy-rate trade-off for RAG
Quantum PIRAdvantage known in some regimesFull quantum-classical gap
Quantum secure aggregationOpen — preliminary schemesCapacity characterization

Quantum vs. Classical PIR: The Advantage Region

Visualize the regime where quantum TT- colluding PIR strictly outperforms classical. Sweep T/NT/N and NN; observe the quantum advantage growing in certain regimes, then shrinking as TNT \to N. For small TT, the advantage is negligible — production quantum PIR deployments would be justified only for moderate TT.

Parameters
12
10

Historical Note: Convergence of Fields

Twenty years ago, information theory, cryptography, wireless communications, and machine learning were largely separate disciplines. This book — secure and distributed computing over wireless networks — is the product of their convergence:

  • Information theory provides the converse bounds (Shannon capacity, mutual information).
  • Cryptography contributes the privacy primitives (secret sharing, commitments).
  • Wireless communications supplies the channel models and AirComp.
  • Machine learning motivates the computational workloads (gradient descent, representation learning).

The five CommIT contributions in this book span all four fields: coded shuffling (wireless + ML), SecAgg (crypto + wireless + ML), ByzSecAgg (coding + wireless + ML), CCESA (graph theory + wireless + crypto), IT-secure FRL (information theory + wireless + ML). Each closes a gap; each opens new questions.

The open problems in this chapter — non-linear coded computing, joint privacy-robustness-DP, decentralized FL, modern-ML coding, quantum PIR — are joint problems across these fields. The next generation of results will come from readers who bring all four fluencies together.

,

Key Takeaway

The research frontier spans modern ML and quantum information theory. Coded attention, coded MoE, federated RAG are bridges between this book's framework and the current generative-AI landscape. Quantum PIR offers rate advantages in specific regimes. Each direction involves joint information-theoretic, cryptographic, and ML work — precisely the convergence that produced this book. The next chapter (for the reader who goes beyond this book) is an open research program.

Closing the Book

This book developed the information- theoretic foundations of secure and distributed computing over wireless networks. The golden thread — privacy, robustness, and communication efficiency as fundamentally coupled — ran through each chapter:

  • Part I: The three-way coupling in the distributed-computing problem statement and its information-theoretic formulation.
  • Part II: Coded computing trades storage (computation load) for communication (straggler tolerance).
  • Part III: Secure aggregation trades cryptographic overhead for privacy guarantees.
  • Part IV: Private information retrieval trades databases for rate.
  • Part V: Wireless aggregation trades synchronization for bandwidth efficiency.

The five CommIT contributions mark the current research frontier; the open problems of this chapter mark where the next generation of results will appear. Information theory provides the converses; engineering gives the achievability; wireless systems bring the reality check; machine learning provides the application pressure.

The next reader — you — is invited to continue the development. There is no finite curriculum for wireless distributed computing; there are only new chapters.

Quick Check

Which of the following is NOT an open direction listed in this chapter?

Non-linear coded computing with information-theoretic recovery- threshold lower bound.

Joint privacy-robustness-DP achievability.

Quantum PIR with classical queries.

Classical Sun-Jafar PIR capacity for non-colluding databases.