Mutual Information

The Central Quantity in Information Theory

If entropy measures the uncertainty of a single random variable, we now ask: how much does one random variable tell us about another? This quantity β€” mutual information β€” turns out to be the single most important concept in information theory. It determines the capacity of channels, the limits of compression with side information, and the fundamental tradeoffs in multiuser communication. Everything in this book ultimately revolves around computing, bounding, or optimizing mutual information.

Definition:

Mutual Information

The mutual information between discrete random variables XX and YY is

I(X;Y)=βˆ‘x∈Xβˆ‘y∈Yp(x,y)log⁑p(x,y)p(x) p(y).I(X;Y) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x,y) \log \frac{p(x,y)}{p(x)\,p(y)}.

Equivalently:

I(X;Y)=H(X)βˆ’H(X∣Y)=H(Y)βˆ’H(Y∣X)=H(X)+H(Y)βˆ’H(X,Y).I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y).

The first expression shows that I(X;Y)=D(PXYβˆ₯PXΓ—PY)I(X;Y) = D(P_{XY} \| P_X \times P_Y) β€” mutual information is the KL divergence between the joint distribution and the product of the marginals. It measures how far (X,Y)(X,Y) is from independence. The second set of expressions connect mutual information to entropy and conditional entropy.

Mutual information

The reduction in uncertainty about XX due to observing YY (or equivalently, about YY due to observing XX): I(X;Y)=H(X)βˆ’H(X∣Y)I(X;Y) = H(X) - H(X|Y). Always non-negative. Equals zero iff XX and YY are independent. Symmetric: I(X;Y)=I(Y;X)I(X;Y) = I(Y;X).

Related: Entropy, Conditional entropy, Kullback-Leibler divergence

Theorem: Properties of Mutual Information

For discrete random variables XX and YY:

  1. Symmetry: I(X;Y)=I(Y;X)I(X;Y) = I(Y;X).
  2. Non-negativity: I(X;Y)β‰₯0I(X;Y) \geq 0.
  3. Self-information: I(X;X)=H(X)I(X;X) = H(X).
  4. Independence: I(X;Y)=0I(X;Y) = 0 if and only if XβŠ₯YX \perp Y.
  5. Upper bounds: I(X;Y)≀min⁑{H(X),H(Y)}I(X;Y) \leq \min\{H(X), H(Y)\}.

Property 1 says information flow is symmetric β€” YY tells us as much about XX as XX tells us about YY. Property 3 says a variable is maximally informative about itself. Property 5 says we cannot learn more about XX from YY than the total uncertainty in XX.

Definition:

Conditional Mutual Information

The conditional mutual information of XX and YY given ZZ is

I(X;Y∣Z)=H(X∣Z)βˆ’H(X∣Y,Z)=βˆ‘zp(z)I(X;Y∣Z=z).I(X;Y|Z) = H(X|Z) - H(X|Y,Z) = \sum_{z} p(z) I(X;Y|Z=z).

It measures the average information that YY provides about XX when ZZ is already known.

Theorem: Chain Rule for Mutual Information

I(X1,X2,…,Xn;Y)=βˆ‘i=1nI(Xi;Y∣Xiβˆ’1,…,X1).I(X_1, X_2, \ldots, X_n ; Y) = \sum_{i=1}^{n} I(X_i ; Y | X_{i-1}, \ldots, X_1).$

The total information that the collection (X1,…,Xn)(X_1, \ldots, X_n) provides about YY equals the sum of the incremental contributions of each XiX_i, given the previous variables. This telescoping structure is the same pattern as the entropy chain rule β€” and it reappears in the converse proof of every channel coding theorem in this book.

Example: Mutual Information of the Binary Symmetric Channel

For the binary symmetric channel (BSC) with crossover probability ϡ\epsilon and uniform input X∼Bernoulli(1/2)X \sim \text{Bernoulli}(1/2), compute I(X;Y)I(X;Y).

BSC Mutual Information

Visualize the mutual information I(X;Y)=H(Y)βˆ’hb(Ο΅)I(X;Y) = H(Y) - h_b(\epsilon) for a binary symmetric channel as a function of the input bias pp and crossover probability Ο΅\epsilon. The capacity (maximum over pp) is achieved at p=1/2p = 1/2.

Parameters
0.1

BSC crossover probability

0.5

P(X=1)

Why This Matters: Mutual Information and Channel Capacity

The channel capacity is C=max⁑pXI(X;Y)C = \max_{p_X} I(X;Y) β€” the maximum mutual information over all input distributions. This is the fundamental theorem of channel coding (Chapter 9). For the AWGN channel, this yields C=12log⁑(1+SNR)C = \frac{1}{2}\log(1 + \text{SNR}), which is the starting point for all modern wireless system design. For MIMO channels, the capacity involves optimizing the input covariance matrix, leading to waterfilling across spatial modes (see Book telecom, Ch. 15).

Quick Check

If I(X;Y)=0.7I(X;Y) = 0.7 bits, what is I(Y;X)I(Y;X)?

0.70.7 bits

Cannot be determined without knowing H(X)H(X) and H(Y)H(Y)

1βˆ’0.7=0.31 - 0.7 = 0.3 bits

00 bits

Conditional mutual information

The mutual information between XX and YY when ZZ is known: I(X;Y∣Z)=H(X∣Z)βˆ’H(X∣Y,Z)I(X;Y|Z) = H(X|Z) - H(X|Y,Z). Can be larger or smaller than I(X;Y)I(X;Y) β€” conditioning on a third variable can increase or decrease mutual information.

Related: Mutual information