Correlated Files: A CommIT Extension

Real Libraries Are Correlated

The baseline MAN model assumes files are independent. In practice, file libraries exhibit massive structural correlation: the same movie in multiple resolutions, compressed and uncompressed versions, variants for different language tracks, HDR and SDR encodings of the same scene. If the server knows this correlation, can it exploit it to shrink the delivery rate?

The answer β€” yes, substantially β€” was developed by Wan, Tuninetti, Ji, and Caire in a line of work culminating in a 2020 paper on coded caching with correlated files. The key technique is an interference-alignment-style delivery that jointly encodes across files, exploiting the correlation structure during both placement and delivery.

Definition:

Correlated Files Model

In the correlated-files model, each file WnW_n is decomposed as Wnβ€…β€Š=β€…β€Š(Wcommon, Δn),W_n \;=\; (W_{\text{common}},\, \Delta_n), where WcommonW_{\text{common}} is a common component shared by all files (the parts identical across variants), and Ξ”n\Delta_n is a variant- specific component of size (1βˆ’Ο)F(1 - \rho) F bits, with ρ∈[0,1]\rho \in [0, 1] the correlation parameter. ρ=0\rho = 0 recovers the independent-files baseline; ρ=1\rho = 1 means all files are identical.

The model is simplified: in practice correlations are richer (multi- level hierarchies, partial overlaps). But this two-component decomposition captures the first-order effect and admits a clean rate analysis.

Theorem: Rate Formula for Correlated Files

For the correlated-files model with correlation ρ\rho, KK users, NN variants, and per-user cache MM, the achievable worst-case rate under the Wan-Tuninetti-Ji-Caire (WTJC) scheme is RWTJC(M,ρ)β€…β€Š=β€…β€ŠΟβ‹…Rcommon(Mc)+(1βˆ’Ο)β‹…RMAN(Mv),R_{\text{WTJC}}(M, \rho) \;=\; \rho \cdot R_{\text{common}}(M_{\text{c}}) + (1 - \rho) \cdot R_{\text{MAN}}(M_{\text{v}}), where the total cache MM is split into McM_{\text{c}} for the common part and MvM_{\text{v}} for the variants, and RcommonR_{\text{common}} is the common-part delivery rate.

Split the cache between common and variant content. The common content is identical across all files, so it acts as a shared library of effective size 1 file β€” delivery of the common part is nearly free (all users want the same thing). The variants behave as an independent-files problem of reduced "effective library" size; MAN applies there.

πŸŽ“CommIT Contribution(2020)

Coded Caching with Correlated Files

K. Wan, D. Tuninetti, M. Ji, G. Caire β€” IEEE Transactions on Information Theory, vol. 66, no. 2

This result shows how file correlation β€” ubiquitous in real libraries (multiple resolutions of the same movie, language dubs, HDR/SDR variants) β€” reduces the coded-caching delivery rate substantially. The key innovation is an interference-alignment-based delivery that jointly encodes across correlated files, achieving gains beyond what naive MAN applied to each correlated group provides.

Key insight. The uncoded "variant" portion of the library behaves like an independent-files MAN problem; the "common" portion is a single broadcast. Optimal cache allocation between the two portions depends on the correlation ρ\rho β€” more correlation means allocating more cache to variants. The gain over independent-files MAN can be substantial (e.g., 2Γ—2\times or more for video libraries with many resolutions per title).

The result illustrates the CommIT program's broader theme: extending the coded-caching machinery to richer library structures. Later work extends to heterogeneous caches, non-uniform popularity, and multi- resolution streaming.

coded-cachingcommitcorrelated-filesinterference-alignmentView Paper β†’

Correlated-Files Gain vs. Correlation ρ\rho

Normalized WTJC delivery rate R(ρ)/R(0)R(\rho)/R(0) as a function of the correlation coefficient ρ\rho. For ρ=0\rho = 0 (independent files), the rate equals MAN's baseline. As ρ\rho grows, the rate drops because the common-component cache allocation shrinks the effective variant-library size. By ρ=0.8\rho = 0.8 (typical for multi-resolution video libraries), the rate is roughly one-fifth of the independent- files rate β€” a substantial practical gain.

Parameters
10
20
0.2

Example: Streaming Video with Resolution Variants

A video library has 1000 titles, each encoded at 4 resolutions (480p, 720p, 1080p, 4K), so N=4000N = 4000 files. Users select a resolution based on their connection; the files for the same title share the lower- resolution encoding as a base layer, with correlation Οβ‰ˆ0.7\rho \approx 0.7. For K=100K = 100 users and per-user cache M=400M = 400 files (10%), compare MAN (independent assumption) with WTJC.

Common Mistake: Correlation Coefficient Is a Model, Not a Measurement

Mistake:

Reading "ρ=0.7\rho = 0.7" as a precise experimentally measured quantity.

Correction:

The WTJC model is a stylized decomposition β€” Wn=(Wcommon,Ξ”n)W_n = (W_{\text{common}}, \Delta_n) β€” with ρ\rho the fraction of common content. Real files do not have a single scalar "correlation" in this operational sense; the decomposition is a design choice that approximates the true file relationship. In practice, ρ\rho is a tuning parameter, and the resulting rate is a lower bound on what a scheme exploiting the full correlation structure could achieve.

For a multi-resolution video library, a better (but more complex) model is hierarchical: base + enhancement layers. The WTJC analysis generalizes but the closed-form rate loses its clean shape.