Entropy, Mutual Information, and Channel Capacity

Why Information Theory?

Before Shannon's 1948 landmark paper, engineers believed that reducing the error rate required reducing the data rate to zero. Shannon proved the opposite: for every noisy channel there exists a maximum rate β€” the channel capacity CC β€” below which communication with arbitrarily small error probability is possible. This section develops the mathematical machinery to define and compute CC.

Definition:

Entropy

The entropy of a discrete random variable XX with alphabet X\mathcal{X} and probability mass function p(x)p(x) is

H=βˆ’βˆ‘x∈Xp(x)log⁑2p(x)bitsH = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x) \quad \text{bits}

with the convention 0log⁑20=00 \log_2 0 = 0.

Entropy measures the average uncertainty (or information content) of XX. It is maximised when XX is uniformly distributed: H≀log⁑2∣X∣H \leq \log_2 |\mathcal{X}|, with equality iff p(x)=1/∣X∣p(x) = 1/|\mathcal{X}| for all xx.

For a fair coin (p=1/2p = 1/2), H=1H = 1 bit. For a biased coin (p=0.9p = 0.9), H=0.47H = 0.47 bits β€” less uncertainty because the outcome is more predictable.

,

Definition:

Conditional Entropy

The conditional entropy of XX given YY is

H(X∣Y)=βˆ’βˆ‘y∈Yp(y)βˆ‘x∈Xp(x∣y)log⁑2p(x∣y)H(X|Y) = -\sum_{y \in \mathcal{Y}} p(y) \sum_{x \in \mathcal{X}} p(x|y) \log_2 p(x|y)

This measures the residual uncertainty about XX after observing YY. The key property is

H(X∣Y)≀HH(X|Y) \leq H

with equality iff XX and YY are independent. Conditioning reduces entropy (on average).

Definition:

Mutual Information

The mutual information between XX and YY is

I=Hβˆ’H(X∣Y)=H(Y)βˆ’H(Y∣X)I = H - H(X|Y) = H(Y) - H(Y|X)

Equivalently,

I=βˆ‘x,yp(x,y)log⁑2p(x,y)p(x) p(y)I = \sum_{x,y} p(x,y) \log_2 \frac{p(x,y)}{p(x)\,p(y)}

Mutual information measures the information that YY provides about XX (and vice versa). Key properties:

  • Iβ‰₯0I \geq 0, with equality iff XβŠ₯YX \perp Y
  • I=I(Y;X)I = I(Y;X) (symmetric)
  • I≀HI \leq H and I≀H(Y)I \leq H(Y)

Definition:

Channel Capacity

The capacity of a discrete memoryless channel (DMC) with transition probabilities p(y∣x)p(y|x) is

C=max⁑p(x)Ibits/channel useC = \max_{p(x)} I \quad \text{bits/channel use}

The maximisation is over all possible input distributions p(x)p(x). Capacity is a property of the channel alone β€” it does not depend on any particular coding scheme.

,

Theorem: Shannon's Channel Coding Theorem

For a discrete memoryless channel with capacity CC:

  1. Achievability: For any rate R<CR < C, there exist codes of sufficiently large block length nn such that the maximum probability of error Pe(n)β†’0P_e^{(n)} \to 0 as nβ†’βˆžn \to \infty.

  2. Converse: For any rate R>CR > C, the error probability is bounded away from zero for all codes and all block lengths: Pe(n)β‰₯1βˆ’C/Rβˆ’1/nP_e^{(n)} \geq 1 - C/R - 1/n for large nn.

Together, CC is the supremum of achievable rates for reliable communication.

Shannon's theorem is an existence result: capacity-achieving codes exist, but the theorem does not tell us how to construct them. It took 45 years (turbo codes, 1993) and 48 years (LDPC codes, rediscovered 1996) to find practical codes that approach capacity within 1 dB.

, ,

Example: Binary Symmetric Channel Capacity

The binary symmetric channel (BSC) with crossover probability pp has input and output alphabets {0,1}\{0, 1\} and transition probabilities p(y≠x)=pp(y \neq x) = p. Compute its capacity.

Example: Binary Erasure Channel Capacity

The binary erasure channel (BEC) with erasure probability Ο΅\epsilon has output alphabet {0,1,e}\{0, 1, e\} where ee denotes an erasure. The transition probabilities are p(y=x∣x)=1βˆ’Ο΅p(y=x|x) = 1-\epsilon and p(y=e∣x)=Ο΅p(y=e|x) = \epsilon. Compute its capacity and the optimal input distribution.

Quick Check

Which of the following statements about entropy is false?

Hβ‰₯0H \geq 0 for any discrete random variable XX

H(X,Y)=H+H(Y)H(X,Y) = H + H(Y) always holds

H(X∣Y)≀HH(X|Y) \leq H, i.e., conditioning reduces entropy

I=I(Y;X)I = I(Y;X), i.e., mutual information is symmetric

Common Mistake: Capacity is a Channel Property, Not a Code Property

Mistake:

Saying "the capacity of LDPC codes" or "turbo codes achieve higher capacity than convolutional codes."

Correction:

Capacity C=max⁑p(x)IC = \max_{p(x)} I depends only on the channel transition probabilities p(y∣x)p(y|x). It does not change with the coding scheme. What changes is how close a code operates to CC. The correct statement is "LDPC codes operate within 0.5 dB of capacity" β€” the capacity itself is fixed.

Historical Note: Shannon's 1948 Paper

1948

Claude Shannon's "A Mathematical Theory of Communication," published in the Bell System Technical Journal in July and October 1948, is widely regarded as the founding document of information theory. Shannon introduced entropy as a measure of information, defined channel capacity, and proved the channel coding theorem β€” all in one paper. The result was so surprising that many engineers initially doubted it: the idea that one could transmit at a positive rate with vanishing error probability over a noisy channel contradicted the prevailing intuition that noise fundamentally limits reliability at any positive rate.

Historical Note: The Discrete Memoryless Channel Model

1948-1968

The discrete memoryless channel (DMC) abstraction β€” where each channel use is independent and identically distributed β€” was central to Shannon's original analysis. While real channels have memory (ISI, fading correlation), the DMC remains the foundation of coding theory. Extensions to channels with memory (e.g., Gilbert-Elliott model, finite-state channels) were developed in the 1960s by Blackwell, Breiman, Thomasian, and Gallager.

Entropy

A measure of the average uncertainty or information content of a discrete random variable: H=βˆ’βˆ‘p(x)log⁑2p(x)H = -\sum p(x)\log_2 p(x). Measured in bits when the logarithm base is 2.

Related: Mutual Information, Conditional Entropy, Mutual Information

Mutual Information

The amount of information that one random variable provides about another: I=Hβˆ’H(X∣Y)I = H - H(X|Y). It quantifies the reduction in uncertainty about XX due to knowledge of YY.

Related: Entropy, Channel Capacity, Conditional Entropy

Channel Capacity

The maximum rate at which information can be transmitted over a channel with arbitrarily low error probability: C=max⁑p(x)IC = \max_{p(x)} I. Measured in bits per channel use.

Related: Mutual Information, Shannon Theorem, Shannon's Channel Coding Theorem

Discrete Memoryless Channel (DMC)

A channel model where the output depends only on the current input through a fixed transition probability p(y∣x)p(y|x), independent of all previous inputs and outputs. The BSC and BEC are canonical examples.

Related: Channel Capacity, Bsc, Bec

Why This Matters: Deeper Treatment in the Information Theory Book

This section provides a condensed introduction to entropy, mutual information, and channel capacity β€” enough to motivate and compute capacity for wireless channels. The Information Theory and Applications (ITA) book develops the full theory:

  • Strong typicality and joint AEP (rigorous capacity proof)
  • Source coding (Huffman, arithmetic, Lempel-Ziv)
  • Rate-distortion theory (lossy compression limits)
  • Multi-user information theory (MAC, broadcast channel, interference channel, relay channel)
  • Network information theory (Slepian-Wolf, Wyner-Ziv)

Readers seeking rigorous proofs or multi-user extensions should consult the ITA book.

Key Takeaway

Channel capacity C=max⁑p(x)IC = \max_{p(x)} I is the maximum rate at which reliable communication is possible. Shannon's coding theorem guarantees that codes achieving rates arbitrarily close to CC exist, though finding practical capacity-approaching codes remained an open challenge for nearly 50 years.