The PIR Problem and Threat Model
Why Hide Which File You Want?
Suppose you run a private database query: financial transactions, medical records, web searches, patent filings. You want the result, but the fact of your asking — which record you wanted — is itself sensitive. A nosy database operator who knows "Alice queried record 7" can infer Alice's financial situation, health status, business plans. Even with strong access control on the contents, the query pattern leaks information.
Private Information Retrieval (PIR) is the information-theoretic answer: a protocol that delivers the requested record while hiding the identity of the request from the database. PIR is a different privacy concept than the secure aggregation of Part III: there we hide what a user contributed; here we hide which item a user requested.
The point is that PIR has its own beautiful information-theoretic structure — the Sun-Jafar capacity formula that tightly characterizes the achievable retrieval rate. This chapter develops the formula, the achievability scheme, and the matching converse. Chapters 14–15 extend to coded storage, colluding databases, and cache-aided settings — including a CommIT-relevant cache-aided variant in Chapter 15.
Definition: Classical (Replicated-Storage) PIR
Classical (Replicated-Storage) PIR
A classical PIR system has:
- A library of files , each file of length bits.
- databases, each storing a complete copy of the entire library.
- A user who wants to retrieve file where is the user's desired index, drawn uniformly at random from a known prior.
The protocol operates in two steps:
-
Query phase. The user generates queries — one per database — using only and private randomness. The query for database is sent only to database .
-
Answer phase. Each database computes its answer as a deterministic function of its received query and stored files. The user collects the answers and reconstructs .
Two requirements must hold:
- Correctness: from the answers , the user reconstructs exactly (zero-error or vanishing error).
- Privacy: for every database in isolation, the query is statistically independent of . Formally, .
The PIR rate is where is the total download size. The PIR capacity is .
The classical setting assumes non-colluding databases: each database sees only its own query. Chapter 14 §14.2 generalizes to -colluding databases, where any databases pool their queries and the privacy guarantee must still hold against this stronger adversary.
Private Information Retrieval (PIR)
A protocol that retrieves one file from a multi- database library without revealing the identity of the requested file to any single database (or, in extensions, to any colluding subset of databases up to a threshold ).
PIR Rate
The fraction of useful information per downloaded bit: where is the file size and the total download. Higher means more efficient PIR. The capacity is the supremum of achievable rates.
PIR Capacity
The supremum of PIR rates over all valid schemes. For classical (replicated) PIR with databases and files: (Sun-Jafar 2017).
Two Trivial Baselines and Why They're Suboptimal
Before constructing capacity-achieving PIR, consider two trivial baselines:
-
Download everything. The user downloads all files from one database, then locally extracts . Privacy holds (the database doesn't know which file the user uses), but the rate is — extremely wasteful.
-
Download from all databases. The user downloads from every database (replicated copies). No privacy: each database knows exactly which file was requested.
The first achieves privacy at the cost of efficiency ( rate); the second achieves efficiency at the cost of privacy. The Sun-Jafar PIR scheme breaks this tradeoff: it achieves for all while preserving information-theoretic privacy against each individual database.
The point is that coordinated queries across multiple databases can achieve substantially higher rates than naive single-database approaches — exactly the finite-field IA story from Chapter 4.
Example: Trivial vs. Capacity for
Compute the rate of the two trivial baselines and compare to the Sun-Jafar capacity for databases, files.
Trivial 1 — download all files
. Privacy holds.
Trivial 2 — download $W_\theta$
from one database (or if we naively replicate from each), but no privacy.
Sun-Jafar capacity
.
Conclusion
Sun-Jafar achieves — much higher than the trivial , while preserving privacy. The savings is the -style gain that comes from coordinated finite-field IA across databases.
PIR Capacity vs. Trivial Baselines
Plot the Sun-Jafar PIR capacity as a function of for several library sizes , alongside the trivial baseline (privacy-preserving but wasteful). The capacity strictly dominates the trivial baseline for and approaches as — illustrating the operational power of coordinated multi-database PIR.
Parameters
Number of files in the library
Range of databases to plot
Definition: PIR Threat Model
PIR Threat Model
The classical PIR threat model has:
-
Honest-but-curious databases. Each database follows the protocol (returns the correct for its received ) but may analyze its received query to try to infer .
-
Non-colluding databases. Databases do not share queries with each other. Each database sees only its own query in isolation.
-
Privacy guarantee. For every database , . The query is information-theoretically independent of the desired index.
The classical setting is symmetric: the privacy guarantee must hold against every individual database, not just one specific one. Hence the user's strategy is non-adaptive — the queries are designed once and for all, before the protocol runs.
Variants in Chapter 14:
- -colluding PIR: the same guarantee but against any -colluding subset of databases.
- Symmetric PIR (SPIR): also requires the user learns nothing about other files.
The classical (non-colluding) setting is the cleanest information-theoretic problem and is the starting point.
The non-colluding assumption is strong but realistic in many deployments: PIR databases run by competing entities (different cloud providers, institutional silos) have no incentive to share query information with each other. The -colluding extension (Chapter 14) handles softer trust models.
PIR vs. Secure Aggregation: Different Privacy Problems
| Property | Secure Aggregation (Part III) | PIR (Part IV) |
|---|---|---|
| What is hidden | Individual gradient | Desired file index |
| Adversary | Server (sees aggregate) | Database (sees query) |
| Information leaked | Aggregate | Identity of accessed file |
| Communication structure | Many users → one server | One user → many databases |
| Capacity formula | overhead (Bonawitz) | (Sun-Jafar) |
| CommIT contributions | Optimality (Ch. 10), ByzSecAgg (Ch. 11), CCESA (Ch. 12) | Cache-aided PIR (Ch. 15) |
Common Mistake: PIR Is Not Encryption
Mistake:
Confuse PIR with database encryption: assume encrypting the database hides the access pattern.
Correction:
Encryption hides the content of the database from unauthorized parties. PIR hides the access pattern — which encrypted record was requested — even from the authorized database operator who can decrypt everything. The two are orthogonal:
- No encryption + PIR: server sees content but not which record is requested.
- Encryption + no PIR: server can't read content but knows which record is requested.
- Encryption + PIR: full hiding (typical production goal).
The Sun-Jafar capacity is for the access-pattern privacy problem — it does not address content confidentiality, which is a separate (and typically computational, not information-theoretic) concern.
Where PIR Matters in Practice
Real-world PIR applications:
- Private medical records: a researcher queries a hospital database without revealing which patient they are studying.
- Patent search: a company queries the patent database without revealing which technology area they are exploring.
- Anonymous web browsing: a user retrieves web pages from an indexed cache without revealing which page they visit.
- Privacy-preserving advertising: an ad network targets ads without learning user browsing history.
- Encrypted DNS: DNS queries can be PIR-protected to hide which domains a user visits.
Production PIR deployments are limited (Microsoft SealPIR, Google's Private Set Intersection libraries) because the information-theoretic setting requires non-colluding databases — a strong trust assumption. Computationally-secure PIR (single-database, based on cryptographic assumptions) is more common in practice but with weaker guarantees.
- •
Information-theoretic PIR: requires non-colluding databases
- •
Computationally-secure PIR: weaker guarantees, single database
- •
Production deployments: SealPIR, XPIR research libraries
Historical Note: From Chor et al. (1995) to Sun-Jafar (2017)
1995–presentPIR was introduced by Benny Chor, Oded Goldreich, Eyal Kushilevitz, and Madhu Sudan in their 1995 FOCS paper "Private Information Retrieval." Their construction achieved sub-linear-in- download cost — a substantial improvement over the trivial rate but not matching the information-theoretic optimum.
For 22 years, the capacity of PIR — the maximum achievable rate — remained open. Hua Sun and Syed Jafar settled the question in 2017 with their landmark paper "The Capacity of Private Information Retrieval." Their result is twofold: (i) an explicit capacity-achieving scheme via finite-field IA, (ii) a matching cut-set converse. The capacity formula has the elegant geometric-series form that reveals the algebraic structure of the problem.
Sun-Jafar's result spawned a decade of follow-on work: coded-storage PIR (Tajeddine, El Rouayheb), -colluding PIR (Sun, Jafar), symmetric PIR, cache-aided PIR (Wei et al., and the CommIT group's contributions in Chapter 15).
Key Takeaway
PIR hides which file you ask for. The classical setting: replicated databases, files, user retrieves one without revealing the index to any single database. Sun-Jafar (2017) established the capacity formula — much higher than the trivial rate, especially for many databases. Section 13.2 develops the achievability; §13.3 proves the matching converse.
Quick Check
PIR differs from database encryption in that:
PIR hides the database content; encryption hides the access pattern.
PIR hides which record is being accessed; encryption hides what the records contain.
PIR is a special case of encryption.
PIR is computationally more efficient than encryption.
Exactly. PIR addresses access-pattern privacy; encryption addresses content confidentiality. The two are orthogonal — production systems often combine them.