Ferkans — Interactive Telecom Tutor

Why PIR Is an IA Problem

Private Information Retrieval (PIR) is the topic of Chapter 13: a user wants one file from a library replicated across $N$ databases, without revealing which file to any single database. The trick that makes PIR efficient — beating the trivial rate of downloading the whole library — is again finite-field interference alignment. Each query to a database is a linear combination of the user's query over all $F$ files; the user cancels out the "interference" (undesired files) using the algebraic structure of the queries across databases. The point is that the privacy requirement and the efficiency requirement are in tension: stronger privacy (resistance against more colluding databases) lowers the achievable rate. That tension is exactly the golden thread of this book.

This section previews the PIR capacity result (Sun–Jafar 2017) and shows where the alignment occurs. Chapter 13 will prove the capacity formula rigorously and extend to coded storage, $T$ - colluding databases, and symmetric PIR. Reading this preview is enough to understand why Chapter 4 — a "tool" chapter — is placed in Part I: the tool is used in at least three later chapters.

Definition:
Classical PIR Problem

A classical PIR system has:

A library of $F$ files $W_1, \ldots, W_F$ , each replicated on $N$ databases.
A user who wants to retrieve a single file $W_{\theta}$ with $\theta \in [F]$ chosen uniformly.
A privacy requirement: the query $Q^{(k)}$ sent to database $k$ must be statistically independent of $\theta$ , $I(\theta; Q^{(k)}) = 0$ .
A download budget: database $k$ returns $A^{(k)}$ as a deterministic function of $(Q^{(k)}, W_1, \ldots, W_F)$ .

The PIR rate is the fraction of useful bits in the total download: $R_{\text{PIR}} = |W| / \sum_k |A^{(k)}|$ . The PIR capacity $C_{\text{PIR}}(N, F)$ is the supremum of achievable rates.

Private Information Retrieval (PIR)

A protocol that lets a user retrieve file $W_\theta$ from $N$ databases without revealing $\theta$ to any single database. The PIR rate is bits-of-file per bit-downloaded; the classical PIR capacity is $C_{\text{PIR}} = (1 + 1/N + \cdots + 1/N^{F-1})^{-1}$ (Sun-Jafar 2017).

Theorem: Classical PIR Capacity (Sun–Jafar)

For $N \geq 2$ replicated databases and $F \geq 2$ files, the classical PIR capacity is $C_{\text{PIR}}(N, F) \;=\; \left(1 + \frac{1}{N} + \frac{1}{N^2} + \cdots + \frac{1}{N^{F-1}}\right)^{-1} \;=\; \frac{1 - 1/N}{1 - 1/N^F}.$ The capacity is matched by an explicit finite-field IA achievability scheme over $\mathbb{F}_q$ for $q \geq N$ .

The formula counts "coded interference": each database returns sums of sub-files scaled so that, when all $N$ answers are combined, the undesired file components align into a common $(F - 1)/N$ -dimensional subspace and can be cancelled. The factor $1/N^i$ at depth $i$ measures how much of the $i$ -th "interference layer" survives the alignment; the geometric sum is the total surviving overhead.

Operationally, the capacity approaches $1$ as $N \to \infty$ (more databases means less structural overhead) and drops towards $1/F$ as $N \to 1$ (a single database cannot hide the query). Chapter 13 develops the full converse proof.

Proof

Achievability sketch

Split each file $W_f$ into $N^F$ equal-size chunks, indexed by $(i_1, \ldots, i_F) \in [N]^F$ . The user generates a random permutation $\sigma$ of $[N]^F$ and sends to database $n$ a query that is a linear combination of chunks, with coefficients chosen so that (i) the interference from undesired files aligns into a common subspace, and (ii) each database's view is independent of $\theta$ . The user decodes by projecting along the alignment subspace.

Converse (cut-set-style)

For any scheme, a careful entropy-inequality argument bounds the total download: after conditioning on the desired file's partial reveal, the remaining download is at least $1/N$ of the file size on average — yielding the geometric-series sum. The details are in Sun–Jafar 2017 Thm. 1; Chapter 13 gives a full presentation.

Match

Achievability equals converse at every $(N, F)$ point, establishing the capacity exactly. $\blacksquare$

Example: PIR Capacity for $N = 2$ , $F = 2$

Compute the PIR capacity for 2 databases and 2 files, and describe an explicit scheme achieving the rate.

Solution

Capacity

$C_{\text{PIR}}(2, 2) = (1 + 1/2)^{-1} = 2/3$ .

Scheme

Split each file $W_1, W_2$ into two chunks: $W_i = (W_{i,1}, W_{i,2})$ . Suppose the user wants $W_1$ .

Query to DB 1: random chunk of $W_1$ plus random chunk of $W_2$ , e.g., $(W_{1,1}, W_{2,1})$ .
Query to DB 2: different chunk of $W_1$ XOR the same chunk of $W_2$ , e.g., $(W_{1,2}, W_{1,1} \oplus W_{2,1})$ .

Databases return the requested sums without knowing $\theta$ (the queries look uniform). The user recovers $W_{1,1}$ (directly), $W_{2,1}$ (directly), and $W_{1,2}$ (directly). Total download: $3$ chunks, file size $2$ chunks, rate $= 2/3$ . ✓

Generality

The same construction generalizes to arbitrary $(N, F)$ — each "level" of the alignment cancels a fraction $1/N$ of the interference, and the total cost is the geometric sum.

PIR Capacity vs. Databases (Preview)

Plot the classical PIR capacity $C_{\text{PIR}}(N, F)$ as a function of $N$ for several library sizes $F$ . As $N \to \infty$ , $C_{\text{PIR}} \to 1$ (no privacy overhead). As $F$ grows, the capacity decreases — there is more "interference" to hide. Chapter 13 develops coded-storage PIR ( $T$ -colluding, symmetric) and explores further tradeoffs.

Parameters

N

max — databases15

Range of databases to plot

F

— library size5

Number of files (library size)

Why This Matters: The Full Treatment Is Chapter 13

This preview has two goals: to justify Chapter 4's position in Part I (a tool chapter feeding multiple later parts) and to plant the vocabulary needed for Part IV. Chapter 13 begins with the Sun-Jafar scheme in detail, then treats $T$ -colluding databases ( $\leq T$ databases may collude without learning the query), coded-storage PIR (files stored via MDS codes rather than replicated), and symmetric PIR (the user learns only the requested file). Each extension refines the IA alignment argument of this section.

Key Takeaway

Finite-field IA is the common thread. Coded matrix multiplication (§4.2), coded caching delivery (§4.3), and classical PIR (§4.4) are all IA constructions — the same algebraic machinery, specialized to different communication objectives. The tradeoff in each case is between the parameter whose privacy / redundancy we buy (storage, users, colluders) and the parameter whose cost we pay (communication rate, download, delivery rate). Knowing Section 4.1's alignment recipe is enough to understand why the numerical capacity formulas of Chapters 5, 7, and 13 come out the way they do.

Common Mistake: PIR Is NOT "One-Time Pad Over All Files"

Mistake:

Claim that private retrieval can be achieved trivially by downloading the entire library — "no single database learns the query because every query looks the same".

Correction:

Downloading the entire library does hide the query, but at rate $1/F$ (one file's worth of useful data per $F$ files of download). The Sun-Jafar capacity is $(1 - 1/N) / (1 - 1/N^F)$ which approaches $1$ as $N \to \infty$ — strictly better than $1/F$ for $N \geq 2$ . The non-trivial content of PIR is precisely that the rate can be beaten via finite-field IA.

🔧Engineering Note

PIR in Modern Cloud-Storage Systems

Classical information-theoretic PIR has seen limited deployment in production clouds — the required $N \geq 2$ non-colluding databases is a strong trust assumption. Computationally-secure PIR (Kushilevitz-Ostrovsky 1997 and successors) is the variant used in some niche systems (ProtonMail message retrieval, certain academic medical-record stores). The information- theoretic PIR of Part IV of this book remains the benchmark — it represents what is achievable without computational assumptions and so establishes how much the computationally- secure variants give up.

Encrypted query systems (CryptDB, SealPIR) offer a middle ground but with their own tradeoffs in latency and database- side compute.

Practical Constraints

•
Classical IT-PIR requires non-colluding databases — often unrealistic
•
Computationally-secure PIR (SealPIR, XPIR) adds latency, weaker guarantees
•
Multi-server PIR in production systems remains a research area

📋 Ref: Microsoft SealPIR 2018; Beimel survey 2007

Quick Check

For $N = 4$ databases and $F = 3$ files, the classical PIR capacity is:

$(1 + 1/4 + 1/16)^{-1} \approx 0.762$

$1/F = 1/3$

$1 - 1/N = 0.75$

$1/N = 0.25$