The Index Coding Problem
Why Index Coding?
After Chapters 2–3 we know what the MAN scheme does: combinatorial placement, XOR delivery, rate . But a natural follow-up question is: is this the best XOR delivery given the MAN placement? Or more ambitiously, given any placement, what is the minimum-length broadcast that satisfies all users?
This question has a name — the index coding problem — and it predates coded caching by a decade. In this chapter we formalize it, connect it rigorously to coded caching, and explain why the MAN scheme admits a surprisingly clean index-coding solution despite the general problem being NP-hard.
Definition: Index Coding Problem
Index Coding Problem
An index coding instance is a tuple where:
- is the number of source messages , each of size bits.
- is the number of receivers. Receiver has a side information set , meaning it already possesses .
- Receiver demands file with .
A code for consists of an encoder and decoders for every . The rate of the code is in file units.
The optimal rate is
The index coding problem was introduced by Birk and Kol (1998) to model satellite broadcast systems where receivers have heterogeneous prior-held content. The information-theoretic treatment (Bar-Yossef et al. 2011) connected it to graph theory — a connection we exploit in §4.2.
Example: The Butterfly Index Coding Instance
Two files , two receivers. Receiver 1 has , wants . Receiver 2 has , wants . What is the optimal rate ?
Uncoded lower bound
Each receiver is missing one file; without coding the server must send at least one file per receiver. Uncoded rate: .
Coded scheme
Send (one file's worth of bits). Receiver 1 computes using its side information . Receiver 2 computes . Rate: .
Optimality
because each receiver must obtain new bits. Hence . The XOR saves a factor of 2 — this tiny instance foreshadows the -factor coded caching gain.
Key Takeaway
The butterfly instance is the atomic unit of coded multicasting. Every XOR in the MAN delivery is a butterfly: two (or more) receivers, each with one file as side information and wanting the other. The MAN scheme exploits such butterflies simultaneously — one per -subset.
The Butterfly Index Coding Instance
Index coding
A network-coding problem where a single sender has messages to distribute to receivers via a broadcast link. Each receiver has a priori knowledge of a subset of messages (side information) and demands one specific message. The objective is to minimize the broadcast length.
Related: Coded multicasting, Conflict Graph
Conflict graph
A graph whose vertices correspond to the demand pairs and whose edges encode when two demand pairs conflict — i.e., when a single coded transmission cannot simultaneously satisfy both receivers. The chromatic number of upper bounds the optimal index coding rate.
Related: Index coding
Side information
Prior knowledge at a receiver, typically the set of messages already held. In coded caching, side information is the cache . In general index coding, side information can be any subset of the library.
Related: Index coding
Coded Caching Is an Index Coding Problem
Fix a placement scheme for coded caching. The delivery phase, with a given demand vector , is exactly an index coding instance:
- The source messages are the subfiles .
- Each user is a receiver with side information and demand (which requires the user to collect all missing subfiles for ).
- The broadcast rate is exactly the index coding rate.
The MAN scheme is thus a particular index coding scheme for a particular family of index coding instances — those induced by the combinatorial MAN placement. A general index coding theorem for arbitrary placements would immediately yield a general coded caching rate formula. This is why the two problems are intertwined.
Historical Note: From Satellite Broadcasts to Index Coding
1998–2014Birk and Kol introduced the index coding problem in 1998 to model "Informed-Source Coding On Demand" (ISCOD) in satellite broadcast systems. Their motivating example was a satellite distributing encrypted pay-per-view video to subscribers, where each subscriber's receiver had cached previous episodes (side information) and wanted the latest episode (demand).
The information-theoretic formalization came a decade later, when Bar-Yossef, Birk, Jayram, and Kol (2011) characterized the linear index coding rate as the minrank of the side-information graph — a matrix-theoretic invariant that is in general NP-hard to compute. Maleki, Cadambe, and Jafar (2014) connected index coding to interference alignment, showing that the asymptotic rate equals the fractional chromatic number for certain families.
Coded caching's arrival in 2014 brought a fresh perspective: one could design the side information (cache contents) to ensure that the resulting index coding instance is easy. This is the conceptual leap that lets coded caching achieve a clean rate formula where general index coding cannot.
General Index Coding Is Not Deployed — Coded Caching Is
A student reading the index coding literature might wonder why industry does not deploy it widely. The answer lies in the complexity of the problem:
- NP-hardness. Finding the optimal linear index code is equivalent to computing the minrank of a graph — NP-hard in general.
- Instance-specific design. Each demand vector is a different index coding instance. A system that re-solves the index coding problem on every peak-hour delivery round is infeasible.
- Coded caching sidesteps both. The MAN placement pre-designs the side information so that the resulting index coding instance is structurally simple — the chromatic number equals the MAN rate exactly. The delivery algorithm is a constant-time lookup: for every -subset, send one XOR.
In short: coded caching is a specialization of index coding that is simultaneously tractable and close to optimal. This is why it is a practical framework, not just a theoretical one.
- •
General index coding minrank is NP-hard (Peeters 1996 complexity)
- •
MAN instances have known optimal linear code of rate
- •
Structured (combinatorial) side information enables efficient delivery