Entropy, Mutual Information, and Channel Capacity
Why Information Theory?
Before Shannon's 1948 landmark paper, engineers believed that reducing the error rate required reducing the data rate to zero. Shannon proved the opposite: for every noisy channel there exists a maximum rate β the channel capacity β below which communication with arbitrarily small error probability is possible. This section develops the mathematical machinery to define and compute .
Definition: Entropy
Entropy
The entropy of a discrete random variable with alphabet and probability mass function is
with the convention .
Entropy measures the average uncertainty (or information content) of . It is maximised when is uniformly distributed: , with equality iff for all .
For a fair coin (), bit. For a biased coin (), bits β less uncertainty because the outcome is more predictable.
Definition: Conditional Entropy
Conditional Entropy
The conditional entropy of given is
This measures the residual uncertainty about after observing . The key property is
with equality iff and are independent. Conditioning reduces entropy (on average).
Definition: Mutual Information
Mutual Information
The mutual information between and is
Equivalently,
Mutual information measures the information that provides about (and vice versa). Key properties:
- , with equality iff
- (symmetric)
- and
Definition: Channel Capacity
Channel Capacity
The capacity of a discrete memoryless channel (DMC) with transition probabilities is
The maximisation is over all possible input distributions . Capacity is a property of the channel alone β it does not depend on any particular coding scheme.
Theorem: Shannon's Channel Coding Theorem
For a discrete memoryless channel with capacity :
-
Achievability: For any rate , there exist codes of sufficiently large block length such that the maximum probability of error as .
-
Converse: For any rate , the error probability is bounded away from zero for all codes and all block lengths: for large .
Together, is the supremum of achievable rates for reliable communication.
Shannon's theorem is an existence result: capacity-achieving codes exist, but the theorem does not tell us how to construct them. It took 45 years (turbo codes, 1993) and 48 years (LDPC codes, rediscovered 1996) to find practical codes that approach capacity within 1 dB.
Random coding argument (achievability sketch)
Generate codewords by drawing each symbol i.i.d. from the capacity-achieving input distribution . The receiver uses joint typicality decoding: declare if is jointly typical and no other codeword is jointly typical with .
By the asymptotic equipartition property (AEP), the probability of a decoding error averaged over the random code ensemble vanishes as provided . Since this holds for the average code, at least one deterministic code achieves the same performance.
Converse (Fano's inequality)
For any code with codewords and error probability , Fano's inequality gives
Combined with the data processing inequality , this yields
If , then .
Example: Binary Symmetric Channel Capacity
The binary symmetric channel (BSC) with crossover probability has input and output alphabets and transition probabilities . Compute its capacity.
Mutual information
For input distribution , :
This does not depend on because each row of the transition matrix has the same entropy .
To maximise , we maximise . The maximum bit is achieved when is uniform, which occurs when (uniform input).
Capacity
p = 0C = 1p = 1/2C = 0p = 1C = 1\blacksquare$
Example: Binary Erasure Channel Capacity
The binary erasure channel (BEC) with erasure probability has output alphabet where denotes an erasure. The transition probabilities are and . Compute its capacity and the optimal input distribution.
Compute $H(Y|X)$
Given , the output is with probability or with probability :
Compute $H(Y)$ and maximise
The output distribution has three symbols. With uniform input (): , , .
.
The capacity is simply bits per channel use, achieved by uniform input.
Quick Check
Which of the following statements about entropy is false?
for any discrete random variable
always holds
, i.e., conditioning reduces entropy
, i.e., mutual information is symmetric
False. The joint entropy . Equality holds only when and are independent. In general, .
Common Mistake: Capacity is a Channel Property, Not a Code Property
Mistake:
Saying "the capacity of LDPC codes" or "turbo codes achieve higher capacity than convolutional codes."
Correction:
Capacity depends only on the channel transition probabilities . It does not change with the coding scheme. What changes is how close a code operates to . The correct statement is "LDPC codes operate within 0.5 dB of capacity" β the capacity itself is fixed.
Historical Note: Shannon's 1948 Paper
1948Claude Shannon's "A Mathematical Theory of Communication," published in the Bell System Technical Journal in July and October 1948, is widely regarded as the founding document of information theory. Shannon introduced entropy as a measure of information, defined channel capacity, and proved the channel coding theorem β all in one paper. The result was so surprising that many engineers initially doubted it: the idea that one could transmit at a positive rate with vanishing error probability over a noisy channel contradicted the prevailing intuition that noise fundamentally limits reliability at any positive rate.
Historical Note: The Discrete Memoryless Channel Model
1948-1968The discrete memoryless channel (DMC) abstraction β where each channel use is independent and identically distributed β was central to Shannon's original analysis. While real channels have memory (ISI, fading correlation), the DMC remains the foundation of coding theory. Extensions to channels with memory (e.g., Gilbert-Elliott model, finite-state channels) were developed in the 1960s by Blackwell, Breiman, Thomasian, and Gallager.
Entropy
A measure of the average uncertainty or information content of a discrete random variable: . Measured in bits when the logarithm base is 2.
Related: Mutual Information, Conditional Entropy, Mutual Information
Mutual Information
The amount of information that one random variable provides about another: . It quantifies the reduction in uncertainty about due to knowledge of .
Related: Entropy, Channel Capacity, Conditional Entropy
Channel Capacity
The maximum rate at which information can be transmitted over a channel with arbitrarily low error probability: . Measured in bits per channel use.
Related: Mutual Information, Shannon Theorem, Shannon's Channel Coding Theorem
Discrete Memoryless Channel (DMC)
A channel model where the output depends only on the current input through a fixed transition probability , independent of all previous inputs and outputs. The BSC and BEC are canonical examples.
Related: Channel Capacity, Bsc, Bec
Why This Matters: Deeper Treatment in the Information Theory Book
This section provides a condensed introduction to entropy, mutual information, and channel capacity β enough to motivate and compute capacity for wireless channels. The Information Theory and Applications (ITA) book develops the full theory:
- Strong typicality and joint AEP (rigorous capacity proof)
- Source coding (Huffman, arithmetic, Lempel-Ziv)
- Rate-distortion theory (lossy compression limits)
- Multi-user information theory (MAC, broadcast channel, interference channel, relay channel)
- Network information theory (Slepian-Wolf, Wyner-Ziv)
Readers seeking rigorous proofs or multi-user extensions should consult the ITA book.
Key Takeaway
Channel capacity is the maximum rate at which reliable communication is possible. Shannon's coding theorem guarantees that codes achieving rates arbitrarily close to exist, though finding practical capacity-approaching codes remained an open challenge for nearly 50 years.