Ferkans — Interactive Telecom Tutor

The Channel Coding Theorem

We now state the central result of this chapter and, arguably, the most important theorem in information theory. The channel coding theorem connects the operational definition of capacity (the supremum of achievable rates) to an information-theoretic formula involving mutual information. The proof occupies the next two sections: achievability (Section 3) and converse (Section 4).

Theorem: The Channel Coding Theorem (Shannon, 1948)

The capacity of the DMC $(\mathcal{X}, P_{Y|X}, \mathcal{Y})$ is:

$C = \max_{P_X} I(X; Y)$

Specifically:

(Achievability) For any $R < C$ , there exists a sequence of $(R, n)$ -codes with $P_{e,\max} \to 0$ as $n \to \infty$ .
(Converse) If a sequence of $(R, n)$ -codes has $P_e^{(n)} \to 0$ , then $R \leq C$ .

Why is $\max_{P_X} I(X; Y)$ the right formula? There are two ways to see this:

From the source coding side: The channel output $Y^n$ looks like a noisy version of the input $X^n$ . The mutual information $I(X; Y)$ measures how many bits per symbol the channel can actually convey — the rest is lost to noise. Maximizing over $P_X$ finds the input distribution that makes the most of the channel.

From the coding side: We need our $2^{nR}$ codewords to be distinguishable at the decoder. The packing lemma tells us that about $2^{nI(X;Y)}$ codewords can coexist without confusion. So the maximum rate is $I(X;Y)$ , and we choose $P_X$ to maximize this.

Proof

Overview of proof strategy

The proof follows a pattern that recurs throughout information theory:

Achievability: Random coding + joint typicality decoding + packing lemma $\implies$ $P_e \to 0$ for $R < I(X;Y)$ . Choose $P_X$ to maximize $I(X;Y)$ .

Converse: Fano's inequality + chain rule + single-letter bound $\implies$ $R \leq C$ .

This is the "proof pattern" — achievability via random coding, converse via Fano — that we will reuse for every channel model in the rest of the book. By the time we reach the MAC (Chapter 14), the reader should be able to construct both directions of the proof for new channel models.

, ,

Properties of Channel Capacity

Several important properties follow from the formula $C = \max_{P_X} I(X; Y)$ :

$C \geq 0$ : Since $I(X; Y) \geq 0$ for any $P_X$ (with equality when $X$ and $Y$ are independent, e.g., a completely noisy channel).
$C \leq \log|\mathcal{X}|$ : The input carries at most $\log|\mathcal{X}|$ bits.
$C \leq \log|\mathcal{Y}|$ : The output can distinguish at most $|\mathcal{Y}|$ values.
The maximum exists: $I(X; Y)$ is a continuous function of $P_X$ over the compact simplex, so the maximum is achieved.
The maximum is unique: $I(X; Y)$ is concave in $P_X$ (for fixed $P_{Y|X}$ ), so the maximizing $P_X^*$ is the unique global optimum. This is a convex optimization problem — we can compute capacity efficiently.

Definition:
Capacity-Achieving Input Distribution

The capacity-achieving input distribution $P_X^*$ is the probability distribution over $\mathcal{X}$ that maximizes $I(X; Y)$ :

$P_X^* = \arg\max_{P_X} I(X; Y) = \arg\max_{P_X} [H(Y) - H(Y|X)]$

Since $H(Y|X) = \sum_x P_X(x) H(Y|X=x)$ is linear in $P_X$ , maximizing $I(X;Y)$ reduces to maximizing $H(Y)$ subject to the constraint that the output distribution $P_Y$ is determined by $P_X$ through the channel law.

For symmetric channels, $P_X^*$ is the uniform distribution. For general channels, $P_X^*$ can be found via the Blahut-Arimoto algorithm (Section 6) or the KKT conditions.

Common Mistake: Capacity is Not Always $\log|\mathcal{X}|$

Mistake:

Assuming that the capacity of every DMC is $\log|\mathcal{X}|$ (the maximum input entropy), or that a uniform input distribution always achieves capacity.

Correction:

The capacity is $\max_{P_X} [H(Y) - H(Y|X)]$ , not $\max_{P_X} H(X)$ . The noise reduces the information conveyed per channel use. For example, the BSC( $p$ ) has capacity $1 - \mathcal{H}_2(p) < 1 = \log 2$ . The uniform input happens to be optimal for symmetric channels, but not in general.

Quick Check

Why is it important that $I(X; Y)$ is concave in $P_X$ for fixed $P_{Y|X}$ ?

It means every local maximum is the global maximum

It means the capacity is always positive

It means the capacity-achieving distribution is always uniform

Correction:

It means every local maximum is the global maximum

Correct! Concavity ensures that the optimization problem $\\max_{P_X} \I(X;Y)$ has a unique global maximum. Any local maximum is also global, so gradient-based methods (like Blahut-Arimoto) are guaranteed to find the capacity-achieving distribution.

Key Takeaway

The channel capacity $C = \max_{P_X} I(X;Y)$ is the single most important formula in information theory. It equates an operational quantity (the maximum reliable communication rate) with an information-theoretic quantity (the maximum mutual information). The concavity of $I(X;Y)$ in $P_X$ makes the capacity computable as a convex optimization problem.

Channel Capacity