Coded-Storage PIR

Why Coded Storage?

Classical Sun-Jafar PIR (Chapter 13) assumes replicated storage: every one of the NN databases stores all KK files. Aggregate storage is KNK \cdot N file-units.

Production cloud storage rarely replicates. Reed- Solomon coded storage is the norm: an (N,r)(N, r)-MDS code stores each file as NN shares such that any rr shares reconstruct the file. Aggregate storage drops to KN/rK \cdot N / r file-units — typically a 3×3\times5×5\times reduction.

Question: can the user retrieve WθW_\theta privately when the databases hold MDS-coded shares (not full files)? If yes, at what rate? The answer is yes, with a capacity formula due to Tajeddine and El Rouayheb (2018).

Definition:

MDS-Coded Storage Model

Let Fq\mathbb{F}_q be a finite field, and let GFqr×N\mathbf{G} \in \mathbb{F}_q^{r \times N} be the generator matrix of an (N,r)(N, r)-MDS code.

Each file WkW_k is partitioned into rr sub-files Wk=(Wk(1),,Wk(r))W_k = (W_k^{(1)}, \ldots, W_k^{(r)}), each of length L/rL/r symbols. Database nn stores the nn-th column of [Wk(1),,Wk(r)]G.[W_k^{(1)}, \ldots, W_k^{(r)}] \cdot \mathbf{G}.

MDS property: any rr databases collectively store enough information to reconstruct WkW_k. No subset of fewer than rr databases stores any information about WkW_k (information-theoretically).

Total storage: KLK \cdot L symbols across NN databases, each storing KL/rK \cdot L / r symbols. Aggregate: KN/rK \cdot N / r file-units (vs. KNK \cdot N for replication).

Example: MDS Storage at (N,r)=(10,3)(N, r) = (10, 3)

A library has K=1000K = 1000 files, each of size L=1L = 1 MB. Compute the aggregate storage cost for: (a) replication across N=10N = 10 databases, (b) (10,3)(10, 3)-MDS coded storage.

Theorem: Coded-Storage PIR Capacity (Tajeddine–El Rouayheb)

For PIR with KK files stored across NN databases via an (N,r)(N, r)-MDS code, the PIR capacity is CPIR-MDS(N,K,r)  =  (1+rN+r2N2++rK1NK1)1.C_{\text{PIR-MDS}}(N, K, r) \;=\; \left(1 + \frac{r}{N} + \frac{r^2}{N^2} + \cdots + \frac{r^{K-1}}{N^{K-1}}\right)^{-1}. Equivalently: CPIR-MDS(N,K,r)  =  NrN(1(r/N)K)(1rN)  =  1r/N1(r/N)K.C_{\text{PIR-MDS}}(N, K, r) \;=\; \frac{N - r}{N \cdot \left(1 - (r/N)^K\right)} \cdot \left(1 - \frac{r}{N}\right) \;=\; \frac{1 - r/N}{1 - (r/N)^K}. Setting r=1r = 1 recovers the classical Sun-Jafar rate (since the geometric series in r/N=1/Nr/N = 1/N matches Chapter 13's formula).

Example: Coded-Storage Capacity at N=5N = 5, K=4K = 4, varying rr

Compute CPIR-MDS(5,4,r)C_{\text{PIR-MDS}}(5, 4, r) for r=1,2,3,4r = 1, 2, 3, 4. Compare with 1r/N1 - r/N (the asymptotic KK \to \infty rate).

Coded-Storage PIR: Storage vs. Rate Trade-off

For fixed NN databases and KK files, plot the PIR rate CPIR-MDS(N,K,r)C_{\text{PIR-MDS}}(N, K, r) as a function of MDS dimension rr. Each point on the curve corresponds to a different storage-rate operating point: small rr = high rate but high storage; large rr = low rate but small storage. The curve traces the Pareto-optimal frontier for coded-storage PIR.

Parameters
10
5

Storage-Rate Pareto Points at N=10,K=5N = 10, K = 5

rrAggregate Storage (file-units)Storage ReductionPIR RateRate / Storage Index
1 (replication)50501×1\times0.901\approx 0.9010.0180.018
225252×2\times0.831\approx 0.8310.0330.033
316.716.73×3\times0.745\approx 0.7450.0450.045
510105×5\times0.555\approx 0.5550.0560.056
95.65.69×9\times0.150\approx 0.1500.0270.027

Reading the Pareto Frontier

The rate/storage index column above is rate per file-unit of aggregate storage — a simple efficiency proxy. It peaks somewhere in the middle (here, r=5r = 5). This is the throughput-optimal operating point for the given cost.

In production, the choice of rr depends on economics: storage cost per byte versus bandwidth cost per byte. When storage is cheap (e.g., archival), small rr (replicate). When storage is expensive (e.g., DRAM caching layer), larger rr (heavy MDS coding). A cloud PIR deployment should benchmark both axes to find its optimum.

⚠️Engineering Note

Deploying Coded-Storage PIR

Practical guidelines for production coded-storage PIR deployments:

  • Reuse existing erasure-coded storage: cloud providers already use Reed-Solomon coded storage for redundancy. The PIR layer can be retrofitted on top.

  • Choose (N,r)(N, r) from existing cluster structure: if storage is replicated 3×3\times across regions, r=3r = 3 is natural. Modifying the storage layer for PIR is typically not worth it.

  • Field size: Fq\mathbb{F}_q with qNq \geq N suffices for MDS codes (Reed-Solomon). Use q=28q = 2^8 for byte-aligned arithmetic.

  • Latency budget: each query round requires one network round-trip. The PIR scheme uses O(K)O(K) rounds; for large KK, latency dominates.

  • Verifiability: coded-storage PIR is not Byzantine-robust unless extended (see Tajeddine & El Rouayheb 2018, §VI). Production deployments with untrusted databases require additional layers.

Practical Constraints
  • Field size: qNq \geq N for MDS codes

  • Latency: O(K)O(K) network round-trips

  • Storage already coded: rr from existing redundancy

  • Byzantine robustness: requires extension layer

📋 Ref: Tajeddine-El Rouayheb 2018; AWS Redundancy Zone codes

Common Mistake: Replication vs. MDS — Storage Cost Confusion

Mistake:

Assume that an (N,r)(N, r)-MDS code with r=1r = 1 means "one database stores everything" (and the others store nothing).

Correction:

An (N,r)(N, r)-MDS code with r=1r = 1 is replication: each of the NN databases stores a complete copy of every file. The MDS rate r=1r = 1 means "any 1 database reconstructs the library." For r=Nr = N, each database stores a unique 1/N1/N-th portion (no redundancy beyond the bare minimum). General r[1,N]r \in [1, N] interpolates between full replication (r=1r = 1) and zero redundancy (r=Nr = N). This nomenclature is standard in coding theory but can confuse information-theory readers used to (n,k)(n, k) Reed-Solomon notation.

Key Takeaway

Coded-storage PIR generalizes Sun-Jafar to MDS-coded databases. The capacity formula CPIR-MDS(N,K,r)=(i=0K1(r/N)i)1C_{\text{PIR-MDS}}(N, K, r) = (\sum_{i=0}^{K-1} (r/N)^i)^{-1} replaces classical r=1r = 1 with arbitrary rr. Storage cost reduces by factor rr; PIR rate decreases monotonically with rr. Production deployments should match rr to the existing storage redundancy.

Quick Check

For a coded-storage PIR system with N=6N = 6 databases and K=3K = 3 files, what is the rate cost of doubling the MDS dimension from r=1r = 1 to r=2r = 2 (relative to classical PIR)?

Rate decreases from 0.857\approx 0.857 to 0.692\approx 0.692 (a 19%\sim 19\% relative reduction).

Rate doubles, since storage halves.

Rate stays the same; only storage changes.

Rate becomes zero (impossible to do PIR with coded storage).