Channel Capacity
The Channel Coding Theorem
We now state the central result of this chapter and, arguably, the most important theorem in information theory. The channel coding theorem connects the operational definition of capacity (the supremum of achievable rates) to an information-theoretic formula involving mutual information. The proof occupies the next two sections: achievability (Section 3) and converse (Section 4).
Theorem: The Channel Coding Theorem (Shannon, 1948)
The capacity of the DMC is:
Specifically:
- (Achievability) For any , there exists a sequence of -codes with as .
- (Converse) If a sequence of -codes has , then .
Why is the right formula? There are two ways to see this:
From the source coding side: The channel output looks like a noisy version of the input . The mutual information measures how many bits per symbol the channel can actually convey β the rest is lost to noise. Maximizing over finds the input distribution that makes the most of the channel.
From the coding side: We need our codewords to be distinguishable at the decoder. The packing lemma tells us that about codewords can coexist without confusion. So the maximum rate is , and we choose to maximize this.
Overview of proof strategy
The proof follows a pattern that recurs throughout information theory:
Achievability: Random coding + joint typicality decoding + packing lemma for . Choose to maximize .
Converse: Fano's inequality + chain rule + single-letter bound .
This is the "proof pattern" β achievability via random coding, converse via Fano β that we will reuse for every channel model in the rest of the book. By the time we reach the MAC (Chapter 14), the reader should be able to construct both directions of the proof for new channel models.
Properties of Channel Capacity
Several important properties follow from the formula :
-
: Since for any (with equality when and are independent, e.g., a completely noisy channel).
-
: The input carries at most bits.
-
: The output can distinguish at most values.
-
The maximum exists: is a continuous function of over the compact simplex, so the maximum is achieved.
-
The maximum is unique: is concave in (for fixed ), so the maximizing is the unique global optimum. This is a convex optimization problem β we can compute capacity efficiently.
Definition: Capacity-Achieving Input Distribution
Capacity-Achieving Input Distribution
The capacity-achieving input distribution is the probability distribution over that maximizes :
Since is linear in , maximizing reduces to maximizing subject to the constraint that the output distribution is determined by through the channel law.
For symmetric channels, is the uniform distribution. For general channels, can be found via the Blahut-Arimoto algorithm (Section 6) or the KKT conditions.
Common Mistake: Capacity is Not Always
Mistake:
Assuming that the capacity of every DMC is (the maximum input entropy), or that a uniform input distribution always achieves capacity.
Correction:
The capacity is , not . The noise reduces the information conveyed per channel use. For example, the BSC() has capacity . The uniform input happens to be optimal for symmetric channels, but not in general.
Quick Check
Why is it important that is concave in for fixed ?
It means every local maximum is the global maximum
It means the capacity is always positive
It means the capacity-achieving distribution is always uniform
Correct! Concavity ensures that the optimization problem has a unique global maximum. Any local maximum is also global, so gradient-based methods (like Blahut-Arimoto) are guaranteed to find the capacity-achieving distribution.
Key Takeaway
The channel capacity is the single most important formula in information theory. It equates an operational quantity (the maximum reliable communication rate) with an information-theoretic quantity (the maximum mutual information). The concavity of in makes the capacity computable as a convex optimization problem.