The Index Coding Problem

Why Index Coding?

After Chapters 2–3 we know what the MAN scheme does: combinatorial placement, XOR delivery, rate R=K(1μ)/(1+Kμ)R = K(1 - \mu)/(1 + K\mu). But a natural follow-up question is: is this the best XOR delivery given the MAN placement? Or more ambitiously, given any placement, what is the minimum-length broadcast that satisfies all users?

This question has a name — the index coding problem — and it predates coded caching by a decade. In this chapter we formalize it, connect it rigorously to coded caching, and explain why the MAN scheme admits a surprisingly clean index-coding solution despite the general problem being NP-hard.

Definition:

Index Coding Problem

An index coding instance is a tuple I=(K,N,(Sk)k=1K,(dk)k=1K)\mathcal{I} = (K, N, (\mathcal{S}_k)_{k=1}^K, (d_k)_{k=1}^K) where:

  • NN is the number of source messages W1,,WNW_1, \ldots, W_N, each of size FF bits.
  • KK is the number of receivers. Receiver kk has a side information set Sk[N]\mathcal{S}_k \subseteq [N], meaning it already possesses {Wn:nSk}\{W_n : n \in \mathcal{S}_k\}.
  • Receiver kk demands file WdkW_{d_k} with dkSkd_k \notin \mathcal{S}_k.

A code for I\mathcal{I} consists of an encoder X:{W1,,WN}F2FX: \{W_1, \ldots, W_N\} \to \mathbb{F}_2^{\ell F} and decoders gk:(X,{Wn}nSk)Wdkg_k: (X, \{W_n\}_{n \in \mathcal{S}_k}) \to W_{d_k} for every kk. The rate of the code is R=R = \ell in file units.

The optimal rate is β(I)  =  inf{R:a valid code of rate R exists}.\beta(\mathcal{I}) \;=\; \inf\{R : \text{a valid code of rate } R \text{ exists}\}.

The index coding problem was introduced by Birk and Kol (1998) to model satellite broadcast systems where receivers have heterogeneous prior-held content. The information-theoretic treatment (Bar-Yossef et al. 2011) connected it to graph theory — a connection we exploit in §4.2.

,

Example: The Butterfly Index Coding Instance

Two files W1,W2W_1, W_2, two receivers. Receiver 1 has S1={2}\mathcal{S}_1 = \{2\}, wants W1W_1. Receiver 2 has S2={1}\mathcal{S}_2 = \{1\}, wants W2W_2. What is the optimal rate β(I)\beta(\mathcal{I})?

Key Takeaway

The butterfly instance is the atomic unit of coded multicasting. Every XOR in the MAN delivery is a butterfly: two (or more) receivers, each with one file as side information and wanting the other. The MAN scheme exploits (Kt+1)\binom{K}{t+1} such butterflies simultaneously — one per (t+1)(t+1)-subset.

The Butterfly Index Coding Instance

Two receivers, two files, each holds the other's demand as side information. A single XOR message X=W1W2X = W_1 \oplus W_2 simultaneously satisfies both. Rate: R=1R = 1 file (vs. uncoded R=2R = 2). This is the atomic unit of coded multicasting — every MAN XOR message is a generalized butterfly.

Index coding

A network-coding problem where a single sender has NN messages to distribute to KK receivers via a broadcast link. Each receiver has a priori knowledge of a subset of messages (side information) and demands one specific message. The objective is to minimize the broadcast length.

Related: Coded multicasting, Conflict Graph

Conflict graph

A graph GG whose vertices correspond to the KK demand pairs (dk,Sk)(d_k, \mathcal{S}_k) and whose edges encode when two demand pairs conflict — i.e., when a single coded transmission cannot simultaneously satisfy both receivers. The chromatic number of GG upper bounds the optimal index coding rate.

Related: Index coding

Side information

Prior knowledge at a receiver, typically the set of messages already held. In coded caching, side information is the cache Zk\mathcal{Z}_{k}. In general index coding, side information can be any subset of the library.

Related: Index coding

Coded Caching Is an Index Coding Problem

Fix a placement scheme for coded caching. The delivery phase, with a given demand vector d\mathbf{d}, is exactly an index coding instance:

  • The NN source messages are the subfiles {Wn,S}\{W_{n,\mathcal{S}}\}.
  • Each user kk is a receiver with side information Zk\mathcal{Z}_{k} and demand WdkW_{d_k} (which requires the user to collect all missing subfiles Wdk,TW_{d_k, \mathcal{T}} for T∌k\mathcal{T} \not\ni k).
  • The broadcast rate RR is exactly the index coding rate.

The MAN scheme is thus a particular index coding scheme for a particular family of index coding instances — those induced by the combinatorial MAN placement. A general index coding theorem for arbitrary placements would immediately yield a general coded caching rate formula. This is why the two problems are intertwined.

Historical Note: From Satellite Broadcasts to Index Coding

1998–2014

Birk and Kol introduced the index coding problem in 1998 to model "Informed-Source Coding On Demand" (ISCOD) in satellite broadcast systems. Their motivating example was a satellite distributing encrypted pay-per-view video to subscribers, where each subscriber's receiver had cached previous episodes (side information) and wanted the latest episode (demand).

The information-theoretic formalization came a decade later, when Bar-Yossef, Birk, Jayram, and Kol (2011) characterized the linear index coding rate as the minrank of the side-information graph — a matrix-theoretic invariant that is in general NP-hard to compute. Maleki, Cadambe, and Jafar (2014) connected index coding to interference alignment, showing that the asymptotic rate equals the fractional chromatic number for certain families.

Coded caching's arrival in 2014 brought a fresh perspective: one could design the side information (cache contents) to ensure that the resulting index coding instance is easy. This is the conceptual leap that lets coded caching achieve a clean rate formula where general index coding cannot.

⚠️Engineering Note

General Index Coding Is Not Deployed — Coded Caching Is

A student reading the index coding literature might wonder why industry does not deploy it widely. The answer lies in the complexity of the problem:

  1. NP-hardness. Finding the optimal linear index code is equivalent to computing the minrank of a graph — NP-hard in general.
  2. Instance-specific design. Each demand vector is a different index coding instance. A system that re-solves the index coding problem on every peak-hour delivery round is infeasible.
  3. Coded caching sidesteps both. The MAN placement pre-designs the side information so that the resulting index coding instance is structurally simple — the chromatic number equals the MAN rate exactly. The delivery algorithm is a constant-time lookup: for every (t+1)(t+1)-subset, send one XOR.

In short: coded caching is a specialization of index coding that is simultaneously tractable and close to optimal. This is why it is a practical framework, not just a theoretical one.

Practical Constraints
  • General index coding minrank is NP-hard (Peeters 1996 complexity)

  • MAN instances have known optimal linear code of rate (Kt)/(t+1)(K-t)/(t+1)

  • Structured (combinatorial) side information enables efficient delivery