The PIR Problem and Threat Model

Why Hide Which File You Want?

Suppose you run a private database query: financial transactions, medical records, web searches, patent filings. You want the result, but the fact of your asking — which record you wanted — is itself sensitive. A nosy database operator who knows "Alice queried record 7" can infer Alice's financial situation, health status, business plans. Even with strong access control on the contents, the query pattern leaks information.

Private Information Retrieval (PIR) is the information-theoretic answer: a protocol that delivers the requested record while hiding the identity of the request from the database. PIR is a different privacy concept than the secure aggregation of Part III: there we hide what a user contributed; here we hide which item a user requested.

The point is that PIR has its own beautiful information-theoretic structure — the Sun-Jafar capacity formula CPIR=(1+1/N+1/N2++1/NK1)1C_{\text{PIR}} = (1 + 1/N + 1/N^2 + \cdots + 1/N^{K-1})^{-1} that tightly characterizes the achievable retrieval rate. This chapter develops the formula, the achievability scheme, and the matching converse. Chapters 14–15 extend to coded storage, colluding databases, and cache-aided settings — including a CommIT-relevant cache-aided variant in Chapter 15.

,

Definition:

Classical (Replicated-Storage) PIR

A classical PIR system has:

  • A library of KK files W1,,WKW_1, \ldots, W_K, each file of length LL bits.
  • NN databases, each storing a complete copy of the entire library.
  • A user who wants to retrieve file WθW_\theta where θ[K]\theta \in [K] is the user's desired index, drawn uniformly at random from a known prior.

The protocol operates in two steps:

  1. Query phase. The user generates queries Q(θ,1),,Q(θ,N)Q^{(\theta, 1)}, \ldots, Q^{(\theta, N)} — one per database — using only θ\theta and private randomness. The query for database nn is sent only to database nn.

  2. Answer phase. Each database nn computes its answer A(θ,n)=fn(Q(θ,n),W1,,WK)A^{(\theta, n)} = f_n(Q^{(\theta, n)}, W_1, \ldots, W_K) as a deterministic function of its received query and stored files. The user collects the NN answers and reconstructs WθW_\theta.

Two requirements must hold:

  • Correctness: from the answers {A(θ,n)}n\{A^{(\theta, n)}\}_n, the user reconstructs WθW_\theta exactly (zero-error or vanishing error).
  • Privacy: for every database nn in isolation, the query Q(θ,n)Q^{(\theta, n)} is statistically independent of θ\theta. Formally, I(θ;Q(θ,n))=0I(\theta; Q^{(\theta, n)}) = 0.

The PIR rate is R=L/DR = L / D where DD is the total download size. The PIR capacity is CPIR=supschemesRC_{\text{PIR}} = \sup_{\text{schemes}} R.

The classical setting assumes non-colluding databases: each database sees only its own query. Chapter 14 §14.2 generalizes to TT-colluding databases, where any TT databases pool their queries and the privacy guarantee must still hold against this stronger adversary.

,

Private Information Retrieval (PIR)

A protocol that retrieves one file from a multi- database library without revealing the identity of the requested file to any single database (or, in extensions, to any colluding subset of databases up to a threshold TT).

PIR Rate RR

The fraction of useful information per downloaded bit: R=L/DR = L / D where LL is the file size and DD the total download. Higher RR means more efficient PIR. The capacity CPIRC_{\text{PIR}} is the supremum of achievable rates.

PIR Capacity CPIRC_{\text{PIR}}

The supremum of PIR rates over all valid schemes. For classical (replicated) PIR with NN databases and KK files: CPIR=(1+1/N+cdots+1/NK1)1C_{\text{PIR}} = (1 + 1/N + \\cdots + 1/N^{K-1})^{-1} (Sun-Jafar 2017).

Two Trivial Baselines and Why They're Suboptimal

Before constructing capacity-achieving PIR, consider two trivial baselines:

  1. Download everything. The user downloads all KK files from one database, then locally extracts WθW_\theta. Privacy holds (the database doesn't know which file the user uses), but the rate is R=L/(KL)=1/KR = L/(KL) = 1/K — extremely wasteful.

  2. Download from all databases. The user downloads WθW_\theta from every database (replicated copies). No privacy: each database knows exactly which file was requested.

The first achieves privacy at the cost of efficiency (1/K1/K rate); the second achieves efficiency at the cost of privacy. The Sun-Jafar PIR scheme breaks this tradeoff: it achieves CPIR(N,K)>1/KC_{\text{PIR}}(N, K) > 1/K for all N2N \geq 2 while preserving information-theoretic privacy against each individual database.

The point is that coordinated queries across multiple databases can achieve substantially higher rates than naive single-database approaches — exactly the finite-field IA story from Chapter 4.

,

Example: Trivial vs. Capacity for N=4,K=3N = 4, K = 3

Compute the rate of the two trivial baselines and compare to the Sun-Jafar capacity for N=4N = 4 databases, K=3K = 3 files.

PIR Capacity vs. Trivial Baselines

Plot the Sun-Jafar PIR capacity CPIR(N,K)C_{\text{PIR}}(N, K) as a function of NN for several library sizes KK, alongside the trivial baseline 1/K1/K (privacy-preserving but wasteful). The capacity strictly dominates the trivial baseline for N2N \geq 2 and approaches 11 as NN \to \infty — illustrating the operational power of coordinated multi-database PIR.

Parameters
8

Number of files in the library

15

Range of databases to plot

Definition:

PIR Threat Model

The classical PIR threat model has:

  1. Honest-but-curious databases. Each database follows the protocol (returns the correct AA for its received QQ) but may analyze its received query to try to infer θ\theta.

  2. Non-colluding databases. Databases do not share queries with each other. Each database sees only its own query in isolation.

  3. Privacy guarantee. For every database nn, I(θ;Q(θ,n))=0I(\theta; Q^{(\theta, n)}) = 0. The query is information-theoretically independent of the desired index.

The classical setting is symmetric: the privacy guarantee must hold against every individual database, not just one specific one. Hence the user's strategy is non-adaptive — the queries are designed once and for all, before the protocol runs.

Variants in Chapter 14:

  • TT-colluding PIR: the same guarantee but against any TT-colluding subset of databases.
  • Symmetric PIR (SPIR): also requires the user learns nothing about other files.

The classical (non-colluding) setting is the cleanest information-theoretic problem and is the starting point.

The non-colluding assumption is strong but realistic in many deployments: PIR databases run by competing entities (different cloud providers, institutional silos) have no incentive to share query information with each other. The TT-colluding extension (Chapter 14) handles softer trust models.

PIR vs. Secure Aggregation: Different Privacy Problems

PropertySecure Aggregation (Part III)PIR (Part IV)
What is hiddenIndividual gradient gk\mathbf{g}_kDesired file index θ\theta
AdversaryServer (sees aggregate)Database (sees query)
Information leakedAggregate kgk\sum_k \mathbf{g}_kIdentity of accessed file
Communication structureMany users → one serverOne user → many databases
Capacity formulaΘ(n2)\Theta(n^2) overhead (Bonawitz)(1+1/N++1/NK1)1(1 + 1/N + \cdots + 1/N^{K-1})^{-1} (Sun-Jafar)
CommIT contributionsOptimality (Ch. 10), ByzSecAgg (Ch. 11), CCESA (Ch. 12)Cache-aided PIR (Ch. 15)

Common Mistake: PIR Is Not Encryption

Mistake:

Confuse PIR with database encryption: assume encrypting the database hides the access pattern.

Correction:

Encryption hides the content of the database from unauthorized parties. PIR hides the access pattern — which encrypted record was requested — even from the authorized database operator who can decrypt everything. The two are orthogonal:

  • No encryption + PIR: server sees content but not which record is requested.
  • Encryption + no PIR: server can't read content but knows which record is requested.
  • Encryption + PIR: full hiding (typical production goal).

The Sun-Jafar capacity is for the access-pattern privacy problem — it does not address content confidentiality, which is a separate (and typically computational, not information-theoretic) concern.

⚠️Engineering Note

Where PIR Matters in Practice

Real-world PIR applications:

  • Private medical records: a researcher queries a hospital database without revealing which patient they are studying.
  • Patent search: a company queries the patent database without revealing which technology area they are exploring.
  • Anonymous web browsing: a user retrieves web pages from an indexed cache without revealing which page they visit.
  • Privacy-preserving advertising: an ad network targets ads without learning user browsing history.
  • Encrypted DNS: DNS queries can be PIR-protected to hide which domains a user visits.

Production PIR deployments are limited (Microsoft SealPIR, Google's Private Set Intersection libraries) because the information-theoretic setting requires non-colluding databases — a strong trust assumption. Computationally-secure PIR (single-database, based on cryptographic assumptions) is more common in practice but with weaker guarantees.

Practical Constraints
  • Information-theoretic PIR: requires N2N \geq 2 non-colluding databases

  • Computationally-secure PIR: weaker guarantees, single database

  • Production deployments: SealPIR, XPIR research libraries

📋 Ref: Beimel survey 2007; Microsoft SealPIR

Historical Note: From Chor et al. (1995) to Sun-Jafar (2017)

1995–present

PIR was introduced by Benny Chor, Oded Goldreich, Eyal Kushilevitz, and Madhu Sudan in their 1995 FOCS paper "Private Information Retrieval." Their construction achieved sub-linear-in-KK download cost — a substantial improvement over the trivial 1/K1/K rate but not matching the information-theoretic optimum.

For 22 years, the capacity of PIR — the maximum achievable rate — remained open. Hua Sun and Syed Jafar settled the question in 2017 with their landmark paper "The Capacity of Private Information Retrieval." Their result is twofold: (i) an explicit capacity-achieving scheme via finite-field IA, (ii) a matching cut-set converse. The capacity formula CPIR=(1+1/N++1/NK1)1C_{\text{PIR}} = (1 + 1/N + \cdots + 1/N^{K-1})^{-1} has the elegant geometric-series form that reveals the algebraic structure of the problem.

Sun-Jafar's result spawned a decade of follow-on work: coded-storage PIR (Tajeddine, El Rouayheb), TT-colluding PIR (Sun, Jafar), symmetric PIR, cache-aided PIR (Wei et al., and the CommIT group's contributions in Chapter 15).

,

Key Takeaway

PIR hides which file you ask for. The classical setting: NN replicated databases, KK files, user retrieves one without revealing the index to any single database. Sun-Jafar (2017) established the capacity formula CPIR=(1+1/N++1/NK1)1C_{\text{PIR}} = (1 + 1/N + \cdots + 1/N^{K-1})^{-1} — much higher than the trivial 1/K1/K rate, especially for many databases. Section 13.2 develops the achievability; §13.3 proves the matching converse.

Quick Check

PIR differs from database encryption in that:

PIR hides the database content; encryption hides the access pattern.

PIR hides which record is being accessed; encryption hides what the records contain.

PIR is a special case of encryption.

PIR is computationally more efficient than encryption.