The PIR Problem and Threat Model

Why Hide Which File You Want?

Suppose you run a private database query: financial transactions, medical records, web searches, patent filings. You want the result, but the fact of your asking — which record you wanted — is itself sensitive. A nosy database operator who knows "Alice queried record 7" can infer Alice's financial situation, health status, business plans. Even with strong access control on the contents, the query pattern leaks information.

Private Information Retrieval (PIR) is the information-theoretic answer: a protocol that delivers the requested record while hiding the identity of the request from the database. PIR is a different privacy concept than the secure aggregation of Part III: there we hide what a user contributed; here we hide which item a user requested.

The point is that PIR has its own beautiful information-theoretic structure — the Sun-Jafar capacity formula $C_{\text{PIR}} = (1 + 1/N + 1/N^2 + \cdots + 1/N^{K-1})^{-1}$ that tightly characterizes the achievable retrieval rate. This chapter develops the formula, the achievability scheme, and the matching converse. Chapters 14–15 extend to coded storage, colluding databases, and cache-aided settings — including a CommIT-relevant cache-aided variant in Chapter 15.

Definition:
Classical (Replicated-Storage) PIR

A classical PIR system has:

A library of $K$ files $W_1, \ldots, W_K$ , each file of length $L$ bits.
$N$ databases, each storing a complete copy of the entire library.
A user who wants to retrieve file $W_\theta$ where $\theta \in [K]$ is the user's desired index, drawn uniformly at random from a known prior.

The protocol operates in two steps:

Query phase. The user generates queries $Q^{(\theta, 1)}, \ldots, Q^{(\theta, N)}$ — one per database — using only $\theta$ and private randomness. The query for database $n$ is sent only to database $n$ .
Answer phase. Each database $n$ computes its answer $A^{(\theta, n)} = f_n(Q^{(\theta, n)}, W_1, \ldots, W_K)$ as a deterministic function of its received query and stored files. The user collects the $N$ answers and reconstructs $W_\theta$ .

Two requirements must hold:

Correctness: from the answers $\{A^{(\theta, n)}\}_n$ , the user reconstructs $W_\theta$ exactly (zero-error or vanishing error).
Privacy: for every database $n$ in isolation, the query $Q^{(\theta, n)}$ is statistically independent of $\theta$ . Formally, $I(\theta; Q^{(\theta, n)}) = 0$ .

The PIR rate is $R = L / D$ where $D$ is the total download size. The PIR capacity is $C_{\text{PIR}} = \sup_{\text{schemes}} R$ .

The classical setting assumes non-colluding databases: each database sees only its own query. Chapter 14 §14.2 generalizes to $T$ -colluding databases, where any $T$ databases pool their queries and the privacy guarantee must still hold against this stronger adversary.

Private Information Retrieval (PIR)

A protocol that retrieves one file from a multi- database library without revealing the identity of the requested file to any single database (or, in extensions, to any colluding subset of databases up to a threshold $T$ ).

PIR Rate $R$

The fraction of useful information per downloaded bit: $R = L / D$ where $L$ is the file size and $D$ the total download. Higher $R$ means more efficient PIR. The capacity $C_{\text{PIR}}$ is the supremum of achievable rates.

PIR Capacity $C_{\text{PIR}}$

The supremum of PIR rates over all valid schemes. For classical (replicated) PIR with $N$ databases and $K$ files: $C_{\text{PIR}} = (1 + 1/N + \\cdots + 1/N^{K-1})^{-1}$ (Sun-Jafar 2017).

Two Trivial Baselines and Why They're Suboptimal

Before constructing capacity-achieving PIR, consider two trivial baselines:

Download everything. The user downloads all $K$ files from one database, then locally extracts $W_\theta$ . Privacy holds (the database doesn't know which file the user uses), but the rate is $R = L/(KL) = 1/K$ — extremely wasteful.
Download from all databases. The user downloads $W_\theta$ from every database (replicated copies). No privacy: each database knows exactly which file was requested.

The first achieves privacy at the cost of efficiency ( $1/K$ rate); the second achieves efficiency at the cost of privacy. The Sun-Jafar PIR scheme breaks this tradeoff: it achieves $C_{\text{PIR}}(N, K) > 1/K$ for all $N \geq 2$ while preserving information-theoretic privacy against each individual database.

The point is that coordinated queries across multiple databases can achieve substantially higher rates than naive single-database approaches — exactly the finite-field IA story from Chapter 4.

Example: Trivial vs. Capacity for $N = 4, K = 3$

Compute the rate of the two trivial baselines and compare to the Sun-Jafar capacity for $N = 4$ databases, $K = 3$ files.

Solution

Trivial 1 — download all files

$R = 1/K = 1/3 \approx 0.333$ . Privacy holds.

Trivial 2 — download $W_\theta$

$R = 1$ from one database (or $1/N = 1/4$ if we naively replicate from each), but no privacy.

Sun-Jafar capacity

$C_{\text{PIR}} = (1 + 1/4 + 1/16)^{-1} = 1/1.3125 \approx 0.762$ .

Conclusion

Sun-Jafar achieves $0.762$ — much higher than the trivial $1/K = 0.333$ , while preserving privacy. The savings is the $K/(K-1)$ -style gain that comes from coordinated finite-field IA across databases.

PIR Capacity vs. Trivial Baselines

Plot the Sun-Jafar PIR capacity $C_{\text{PIR}}(N, K)$ as a function of $N$ for several library sizes $K$ , alongside the trivial baseline $1/K$ (privacy-preserving but wasteful). The capacity strictly dominates the trivial baseline for $N \geq 2$ and approaches $1$ as $N \to \infty$ — illustrating the operational power of coordinated multi-database PIR.

Parameters

K

— files8

Number of files in the library

N

max — databases15

Range of databases to plot

Definition:
PIR Threat Model

The classical PIR threat model has:

Honest-but-curious databases. Each database follows the protocol (returns the correct $A$ for its received $Q$ ) but may analyze its received query to try to infer $\theta$ .
Non-colluding databases. Databases do not share queries with each other. Each database sees only its own query in isolation.
Privacy guarantee. For every database $n$ , $I(\theta; Q^{(\theta, n)}) = 0$ . The query is information-theoretically independent of the desired index.

The classical setting is symmetric: the privacy guarantee must hold against every individual database, not just one specific one. Hence the user's strategy is non-adaptive — the queries are designed once and for all, before the protocol runs.

Variants in Chapter 14:

$T$ -colluding PIR: the same guarantee but against any $T$ -colluding subset of databases.
Symmetric PIR (SPIR): also requires the user learns nothing about other files.

The classical (non-colluding) setting is the cleanest information-theoretic problem and is the starting point.

The non-colluding assumption is strong but realistic in many deployments: PIR databases run by competing entities (different cloud providers, institutional silos) have no incentive to share query information with each other. The $T$ -colluding extension (Chapter 14) handles softer trust models.

PIR vs. Secure Aggregation: Different Privacy Problems

Property	Secure Aggregation (Part III)	PIR (Part IV)
What is hidden	Individual gradient $\mathbf{g}_k$	Desired file index $\theta$
Adversary	Server (sees aggregate)	Database (sees query)
Information leaked	Aggregate $\sum_k \mathbf{g}_k$	Identity of accessed file
Communication structure	Many users → one server	One user → many databases
Capacity formula	$\Theta(n^2)$ overhead (Bonawitz)	$(1 + 1/N + \cdots + 1/N^{K-1})^{-1}$ (Sun-Jafar)
CommIT contributions	Optimality (Ch. 10), ByzSecAgg (Ch. 11), CCESA (Ch. 12)	Cache-aided PIR (Ch. 15)

Common Mistake: PIR Is Not Encryption

Mistake:

Confuse PIR with database encryption: assume encrypting the database hides the access pattern.

Correction:

Encryption hides the content of the database from unauthorized parties. PIR hides the access pattern — which encrypted record was requested — even from the authorized database operator who can decrypt everything. The two are orthogonal:

No encryption + PIR: server sees content but not which record is requested.
Encryption + no PIR: server can't read content but knows which record is requested.
Encryption + PIR: full hiding (typical production goal).

The Sun-Jafar capacity is for the access-pattern privacy problem — it does not address content confidentiality, which is a separate (and typically computational, not information-theoretic) concern.

⚠️Engineering Note

Where PIR Matters in Practice

Real-world PIR applications:

Private medical records: a researcher queries a hospital database without revealing which patient they are studying.
Patent search: a company queries the patent database without revealing which technology area they are exploring.
Anonymous web browsing: a user retrieves web pages from an indexed cache without revealing which page they visit.
Privacy-preserving advertising: an ad network targets ads without learning user browsing history.
Encrypted DNS: DNS queries can be PIR-protected to hide which domains a user visits.

Production PIR deployments are limited (Microsoft SealPIR, Google's Private Set Intersection libraries) because the information-theoretic setting requires non-colluding databases — a strong trust assumption. Computationally-secure PIR (single-database, based on cryptographic assumptions) is more common in practice but with weaker guarantees.

Practical Constraints

•
Information-theoretic PIR: requires $N \geq 2$ non-colluding databases
•
Computationally-secure PIR: weaker guarantees, single database
•
Production deployments: SealPIR, XPIR research libraries

📋 Ref: Beimel survey 2007; Microsoft SealPIR

Historical Note: From Chor et al. (1995) to Sun-Jafar (2017)

1995–present

PIR was introduced by Benny Chor, Oded Goldreich, Eyal Kushilevitz, and Madhu Sudan in their 1995 FOCS paper "Private Information Retrieval." Their construction achieved sub-linear-in- $K$ download cost — a substantial improvement over the trivial $1/K$ rate but not matching the information-theoretic optimum.

For 22 years, the capacity of PIR — the maximum achievable rate — remained open. Hua Sun and Syed Jafar settled the question in 2017 with their landmark paper "The Capacity of Private Information Retrieval." Their result is twofold: (i) an explicit capacity-achieving scheme via finite-field IA, (ii) a matching cut-set converse. The capacity formula $C_{\text{PIR}} = (1 + 1/N + \cdots + 1/N^{K-1})^{-1}$ has the elegant geometric-series form that reveals the algebraic structure of the problem.

Sun-Jafar's result spawned a decade of follow-on work: coded-storage PIR (Tajeddine, El Rouayheb), $T$ -colluding PIR (Sun, Jafar), symmetric PIR, cache-aided PIR (Wei et al., and the CommIT group's contributions in Chapter 15).

Key Takeaway

PIR hides which file you ask for. The classical setting: $N$ replicated databases, $K$ files, user retrieves one without revealing the index to any single database. Sun-Jafar (2017) established the capacity formula $C_{\text{PIR}} = (1 + 1/N + \cdots + 1/N^{K-1})^{-1}$ — much higher than the trivial $1/K$ rate, especially for many databases. Section 13.2 develops the achievability; §13.3 proves the matching converse.

Quick Check

PIR differs from database encryption in that:

PIR hides the database content; encryption hides the access pattern.

PIR hides which record is being accessed; encryption hides what the records contain.

PIR is a special case of encryption.

PIR is computationally more efficient than encryption.