Ferkans — Interactive Telecom Tutor

Why Coded Storage?

Classical Sun-Jafar PIR (Chapter 13) assumes replicated storage: every one of the $N$ databases stores all $K$ files. Aggregate storage is $K \cdot N$ file-units.

Production cloud storage rarely replicates. Reed- Solomon coded storage is the norm: an $(N, r)$ -MDS code stores each file as $N$ shares such that any $r$ shares reconstruct the file. Aggregate storage drops to $K \cdot N / r$ file-units — typically a $3\times$ – $5\times$ reduction.

Question: can the user retrieve $W_\theta$ privately when the databases hold MDS-coded shares (not full files)? If yes, at what rate? The answer is yes, with a capacity formula due to Tajeddine and El Rouayheb (2018).

Definition:
MDS-Coded Storage Model

Let $\mathbb{F}_q$ be a finite field, and let $\mathbf{G} \in \mathbb{F}_q^{r \times N}$ be the generator matrix of an $(N, r)$ -MDS code.

Each file $W_k$ is partitioned into $r$ sub-files $W_k = (W_k^{(1)}, \ldots, W_k^{(r)})$ , each of length $L/r$ symbols. Database $n$ stores the $n$ -th column of $[W_k^{(1)}, \ldots, W_k^{(r)}] \cdot \mathbf{G}.$

MDS property: any $r$ databases collectively store enough information to reconstruct $W_k$ . No subset of fewer than $r$ databases stores any information about $W_k$ (information-theoretically).

Total storage: $K \cdot L$ symbols across $N$ databases, each storing $K \cdot L / r$ symbols. Aggregate: $K \cdot N / r$ file-units (vs. $K \cdot N$ for replication).

Example: MDS Storage at $(N, r) = (10, 3)$

A library has $K = 1000$ files, each of size $L = 1$ MB. Compute the aggregate storage cost for: (a) replication across $N = 10$ databases, (b) $(10, 3)$ -MDS coded storage.

Solution

Replication baseline

Each database stores all $1000$ files, $1000 \times 1\text{ MB} = 1\text{ GB}$ . Across $10$ databases: $10\text{ GB}$ aggregate.

$(10, 3)$-MDS coded storage

Each database stores $1/3$ of every file: $1000 \times (1/3)\text{ MB} \approx 333\text{ MB}$ . Across $10$ databases: $\approx 3.33\text{ GB}$ aggregate.

Reduction factor

$10\text{ GB} / 3.33\text{ GB} = 3$ — exactly the MDS rate $r$ . This is fundamental: MDS storage saves a factor of $r$ in aggregate storage versus replication.

PIR rate cost

The savings come at a cost: PIR rate drops below the replicated Sun-Jafar capacity. Exact PIR rate: see Theorem 14.1.1.

Theorem: Coded-Storage PIR Capacity (Tajeddine–El Rouayheb)

For PIR with $K$ files stored across $N$ databases via an $(N, r)$ -MDS code, the PIR capacity is $C_{\text{PIR-MDS}}(N, K, r) \;=\; \left(1 + \frac{r}{N} + \frac{r^2}{N^2} + \cdots + \frac{r^{K-1}}{N^{K-1}}\right)^{-1}.$ Equivalently: $C_{\text{PIR-MDS}}(N, K, r) \;=\; \frac{N - r}{N \cdot \left(1 - (r/N)^K\right)} \cdot \left(1 - \frac{r}{N}\right) \;=\; \frac{1 - r/N}{1 - (r/N)^K}.$ Setting $r = 1$ recovers the classical Sun-Jafar rate (since the geometric series in $r/N = 1/N$ matches Chapter 13's formula).

Proof

Achievability

The achievable scheme is a generalization of Sun-Jafar's scheme to MDS-coded storage. Each query is constructed in $K$ rounds, with each round using interference symbols across databases. Because of the MDS structure, $r$ databases jointly reconstruct each symbol of $W_\theta$ . The query design ensures privacy via random linear combinations across all $K$ files.

Converse

The converse uses the same cut-set inductive approach as Sun-Jafar (Chapter 13 §13.3), adapted for the MDS structure: the entropy of any $r$ -subset of database contents is bounded by the file content. Symmetrization and induction over $K$ yield the formula. Full details: Tajeddine & El Rouayheb 2018, §III–IV.

Asymptotic limits

For fixed $N, r$ and $K \to \infty$ : $C_{\text{PIR-MDS}}(N, K, r) \to 1 - r/N$ . Compare with classical Sun-Jafar's limit $1 - 1/N$ — the rate gap is exactly the difference $r/N - 1/N = (r-1)/N$ , the cost of MDS coding.

Example: Coded-Storage Capacity at $N = 5$ , $K = 4$ , varying $r$

Compute $C_{\text{PIR-MDS}}(5, 4, r)$ for $r = 1, 2, 3, 4$ . Compare with $1 - r/N$ (the asymptotic $K \to \infty$ rate).

Solution

$r = 1$ (classical)

$C_{\text{PIR-MDS}}(5, 4, 1) = (1 + 0.2 + 0.04 + 0.008)^{-1} = 1/1.248 \approx 0.801$ . Asymptotic limit: $1 - 1/5 = 0.8$ .

$r = 2$

$C_{\text{PIR-MDS}}(5, 4, 2) = (1 + 0.4 + 0.16 + 0.064)^{-1} = 1/1.624 \approx 0.616$ . Asymptotic limit: $1 - 2/5 = 0.6$ .

$r = 3$

$C_{\text{PIR-MDS}}(5, 4, 3) = (1 + 0.6 + 0.36 + 0.216)^{-1} = 1/2.176 \approx 0.460$ . Asymptotic limit: $1 - 3/5 = 0.4$ .

$r = 4$

$C_{\text{PIR-MDS}}(5, 4, 4) = (1 + 0.8 + 0.64 + 0.512)^{-1} = 1/2.952 \approx 0.339$ . Asymptotic limit: $1 - 4/5 = 0.2$ .

Operational

Each unit increase in $r$ (more storage savings) costs $\approx 0.2$ in PIR rate. The savings:cost ratio depends on $K$ . For large $K$ , the trade is close to $1$ : $(r-1)/r$ in storage:rate.

Coded-Storage PIR: Storage vs. Rate Trade-off

For fixed $N$ databases and $K$ files, plot the PIR rate $C_{\text{PIR-MDS}}(N, K, r)$ as a function of MDS dimension $r$ . Each point on the curve corresponds to a different storage-rate operating point: small $r$ = high rate but high storage; large $r$ = low rate but small storage. The curve traces the Pareto-optimal frontier for coded-storage PIR.

Parameters

N

— databases10

K

— files5

Storage-Rate Pareto Points at $N = 10, K = 5$

$r$	Aggregate Storage (file-units)	Storage Reduction	PIR Rate	Rate / Storage Index
1 (replication)	$50$	$1\times$	$\approx 0.901$	$0.018$
2	$25$	$2\times$	$\approx 0.831$	$0.033$
3	$16.7$	$3\times$	$\approx 0.745$	$0.045$
5	$10$	$5\times$	$\approx 0.555$	$0.056$
9	$5.6$	$9\times$	$\approx 0.150$	$0.027$

Reading the Pareto Frontier

The rate/storage index column above is rate per file-unit of aggregate storage — a simple efficiency proxy. It peaks somewhere in the middle (here, $r = 5$ ). This is the throughput-optimal operating point for the given cost.

In production, the choice of $r$ depends on economics: storage cost per byte versus bandwidth cost per byte. When storage is cheap (e.g., archival), small $r$ (replicate). When storage is expensive (e.g., DRAM caching layer), larger $r$ (heavy MDS coding). A cloud PIR deployment should benchmark both axes to find its optimum.

⚠️Engineering Note

Deploying Coded-Storage PIR

Practical guidelines for production coded-storage PIR deployments:

Reuse existing erasure-coded storage: cloud providers already use Reed-Solomon coded storage for redundancy. The PIR layer can be retrofitted on top.
Choose $(N, r)$ from existing cluster structure: if storage is replicated $3\times$ across regions, $r = 3$ is natural. Modifying the storage layer for PIR is typically not worth it.
Field size: $\mathbb{F}_q$ with $q \geq N$ suffices for MDS codes (Reed-Solomon). Use $q = 2^8$ for byte-aligned arithmetic.
Latency budget: each query round requires one network round-trip. The PIR scheme uses $O(K)$ rounds; for large $K$ , latency dominates.
Verifiability: coded-storage PIR is not Byzantine-robust unless extended (see Tajeddine & El Rouayheb 2018, §VI). Production deployments with untrusted databases require additional layers.

Practical Constraints

•
Field size: $q \geq N$ for MDS codes
•
Latency: $O(K)$ network round-trips
•
Storage already coded: $r$ from existing redundancy
•
Byzantine robustness: requires extension layer

📋 Ref: Tajeddine-El Rouayheb 2018; AWS Redundancy Zone codes

Common Mistake: Replication vs. MDS — Storage Cost Confusion

Mistake:

Assume that an $(N, r)$ -MDS code with $r = 1$ means "one database stores everything" (and the others store nothing).

Correction:

An $(N, r)$ -MDS code with $r = 1$ is replication: each of the $N$ databases stores a complete copy of every file. The MDS rate $r = 1$ means "any 1 database reconstructs the library." For $r = N$ , each database stores a unique $1/N$ -th portion (no redundancy beyond the bare minimum). General $r \in [1, N]$ interpolates between full replication ( $r = 1$ ) and zero redundancy ( $r = N$ ). This nomenclature is standard in coding theory but can confuse information-theory readers used to $(n, k)$ Reed-Solomon notation.

Key Takeaway

Coded-storage PIR generalizes Sun-Jafar to MDS-coded databases. The capacity formula $C_{\text{PIR-MDS}}(N, K, r) = (\sum_{i=0}^{K-1} (r/N)^i)^{-1}$ replaces classical $r = 1$ with arbitrary $r$ . Storage cost reduces by factor $r$ ; PIR rate decreases monotonically with $r$ . Production deployments should match $r$ to the existing storage redundancy.

Quick Check

For a coded-storage PIR system with $N = 6$ databases and $K = 3$ files, what is the rate cost of doubling the MDS dimension from $r = 1$ to $r = 2$ (relative to classical PIR)?

Rate decreases from $\approx 0.857$ to $\approx 0.692$ (a $\sim 19\%$ relative reduction).

Rate doubles, since storage halves.

Rate stays the same; only storage changes.

Rate becomes zero (impossible to do PIR with coded storage).