Ferkans — Interactive Telecom Tutor

What If the Encoder Knows the Future?

Suppose the encoder knows the entire state sequence $S^n$ before transmission begins — non-causal state information. This may seem unrealistic at first, but it arises naturally in several important settings:

Broadcast channels: the transmitter knows the messages for all users, so the signal intended for user 2 is "state" known non-causally when encoding user 1's message.
Precoding: in MIMO downlink, the base station generates the interference and therefore knows it perfectly in advance.
Digital watermarking: the host signal is known to the embedder.

The Gel'fand-Pinsker theorem gives the capacity for this setting, and its Gaussian specialization (Costa's theorem) reveals the astonishing result that known interference can be canceled for free.

Theorem: The Gel'fand-Pinsker Theorem

The capacity of a DMC with state $(P_{Y|X,S}, P_S)$ where the state $S^n$ is known non-causally at the encoder and not at the decoder is

$C = \max_{P_{U,X|S}} \bigl[I(U; Y) - I(U; S)\bigr],$

where the maximization is over all joint distributions $P_{U,X|S}$ with $|\mathcal{U}| \leq |\mathcal{X}| \cdot |\mathcal{S}| + 1$ , and the mutual information is computed with $(U, X, S) \sim P_{U,X|S} P_S$ , $Y \sim P_{Y|X,S}$ .

The formula $I(U; Y) - I(U; S)$ has a natural interpretation: $I(U; Y)$ is the total information the decoder gets, and $I(U; S)$ is the "cost" of correlating the auxiliary codeword $U$ with the state. The encoder must embed the message into a codeword $U^n$ that is jointly typical with $S^n$ (covering), and the decoder must recover $U^n$ from $Y^n$ (packing). The net rate is the packing capacity minus the covering cost.

Proof

Achievability — random binning

Generate $2^{n(R + R')}$ codewords $\mathbf{u}$ i.i.d. $\sim P_U$ . Assign them uniformly to $2^{nR}$ bins, each containing $2^{nR'}$ codewords.

Encoding: Given message $m$ and state $S^n$ , find a codeword $\mathbf{u}(m, j)$ in bin $m$ that is jointly typical with $S^n$ . By the covering lemma, this succeeds if $R' > I(U; S)$ . Set $X_i = f(U_i, S_i)$ for each $i$ .

Achievability — decoding

Decoding: The decoder observes $Y^n$ and finds the unique $\mathbf{u}$ that is jointly typical with $Y^n$ . By the packing lemma, this succeeds if $R + R' < I(U; Y)$ .

Combining the rate constraints

From the covering constraint: $R' > I(U; S)$ . From the packing constraint: $R < I(U; Y) - R'$ . Eliminating $R'$ : $R < I(U; Y) - I(U; S)$ .

Converse

The converse uses Fano's inequality and the identification of the auxiliary $U_i = (M, S^{i-1}, S_{i+1}^n)$ . Standard single-letterization arguments yield $R \leq I(U; Y) - I(U; S) + \epsilon_n$ .

,

Random binning

A coding technique where codewords are randomly partitioned into bins. Used in the Gel'fand-Pinsker theorem (and Slepian-Wolf coding) to handle side information. The encoder selects a codeword from the correct bin that is compatible with the state.

Gel'fand-Pinsker Random Binning

The encoder partitions

2^{n(R+R')}

codewords into

2^{nR}

bins. Given message

m

and state

S^n

, it searches bin

m

for a codeword jointly typical with

S^n

. The covering cost

I(U;S)

determines how many codewords per bin are needed for this search to succeed.

The Packing-Covering Duality

The Gel'fand-Pinsker theorem beautifully illustrates a duality that pervades information theory:

Packing (channel coding): fit as many non-overlapping codewords as possible. Rate limited by $I(U; Y)$ .
Covering (source coding / binning): find a codeword jointly typical with the state. Rate cost $I(U; S)$ .

The net rate $I(U; Y) - I(U; S)$ is packing minus covering — exactly the same structure as Wyner-Ziv (lossy source coding with side information at the decoder) but with the roles of encoder and decoder reversed.

Historical Note: Gel'fand and Pinsker's 1980 Result

Simon Gel'fand and Mark Pinsker published their coding theorem for channels with non-causal state information in 1980, in a Russian journal. The result was not widely known in the Western literature until it was independently rediscovered and popularized by Heegard and El Gamal (1983) and Costa (1983). The Gaussian specialization by Costa brought the result to prominence and connected it to practical precoding problems.

It is one of those results where the information-theoretic insight preceded practical application by decades: dirty paper coding (DPC) became the theoretical foundation for MIMO broadcast channel capacity only in the early 2000s, with the work of Weingarten, Steinberg, and Shamai.

Common Mistake: The Auxiliary $U$ Is Not the Input $X$

Mistake:

Setting $U = X$ in the Gel'fand-Pinsker formula, which gives $I(X; Y) - I(X; S)$ and can be negative.

Correction:

The auxiliary $U$ is a design variable that can depend on both $X$ and $S$ . In Costa's theorem, the optimal choice is $U = X + \alpha S$ , which couples the codeword with the state. Setting $U = X$ ignores the state knowledge and is generally suboptimal. The optimization over $P_{U,X|S}$ is the heart of the Gel'fand-Pinsker theorem.

Quick Check

In the Gel'fand-Pinsker formula $C = \max [I(U;Y) - I(U;S)]$ , what does the term $I(U;S)$ represent?

The rate needed for the encoder to learn the state

The rate cost of finding a codeword jointly typical with the state (covering)

The mutual information between the state and the output

The penalty for not knowing the state at the decoder

Correction:

The rate cost of finding a codeword jointly typical with the state (covering)

The term $I(U; S)$ is the rate overhead for the binning operation: we need $2^{nI(U;S)}$ codewords per bin so that at least one is jointly typical with $S^n$ . This "covering cost" is subtracted from the "packing capacity" $I(U; Y)$ to give the net communication rate.

⚠️Engineering Note

Computational Complexity of Random Binning

The Gel'fand-Pinsker achievability proof relies on random binning: the encoder searches through $2^{nI(U;S)}$ codewords per bin to find one jointly typical with the state $S^n$ . In the worst case, this is an exponential-time search.

Practical implementations avoid exhaustive search:

Nested lattice codes (Erez-Shamai-Zamir, 2005) replace random binning with algebraic structure, enabling polynomial-time encoding.
Tomlinson-Harashima precoding uses modulo arithmetic as a practical approximation with $O(n)$ encoding complexity.
LDPC-based DPC uses syndrome coding to approximate binning, with iterative encoding in $O(n\log n)$ per iteration.

The gap between the information-theoretic optimum (random binning) and practical structured codes is typically 1-2 dB for DPC implementations.

Practical Constraints

•
Random binning: exponential search over $2^{n I(U;S)}$ codewords
•
Nested lattice codes: $O(n^3)$ for lattice reduction
•
THP/linear precoding: $O(K^2 n)$ for $K$ users, $n$ symbols

Non-Causal State Information: The Gel'fand-Pinsker Theorem