Ferkans — Interactive Telecom Tutor

Why Information Theory?

Before Shannon's 1948 landmark paper, engineers believed that reducing the error rate required reducing the data rate to zero. Shannon proved the opposite: for every noisy channel there exists a maximum rate — the channel capacity $C$ — below which communication with arbitrarily small error probability is possible. This section develops the mathematical machinery to define and compute $C$ .

Definition:
Entropy

The entropy of a discrete random variable $X$ with alphabet $\mathcal{X}$ and probability mass function $p(x)$ is

$H = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x) \quad \text{bits}$

with the convention $0 \log_2 0 = 0$ .

Entropy measures the average uncertainty (or information content) of $X$ . It is maximised when $X$ is uniformly distributed: $H \leq \log_2 |\mathcal{X}|$ , with equality iff $p(x) = 1/|\mathcal{X}|$ for all $x$ .

For a fair coin ( $p = 1/2$ ), $H = 1$ bit. For a biased coin ( $p = 0.9$ ), $H = 0.47$ bits — less uncertainty because the outcome is more predictable.

,

Definition:
Conditional Entropy

The conditional entropy of $X$ given $Y$ is

$H(X|Y) = -\sum_{y \in \mathcal{Y}} p(y) \sum_{x \in \mathcal{X}} p(x|y) \log_2 p(x|y)$

This measures the residual uncertainty about $X$ after observing $Y$ . The key property is

$H(X|Y) \leq H$

with equality iff $X$ and $Y$ are independent. Conditioning reduces entropy (on average).

Definition:
Mutual Information

The mutual information between $X$ and $Y$ is

$I = H - H(X|Y) = H(Y) - H(Y|X)$

Equivalently,

$I = \sum_{x,y} p(x,y) \log_2 \frac{p(x,y)}{p(x)\,p(y)}$

Mutual information measures the information that $Y$ provides about $X$ (and vice versa). Key properties:

$I \geq 0$ , with equality iff $X \perp Y$
$I = I(Y;X)$ (symmetric)
$I \leq H$ and $I \leq H(Y)$

Definition:
Channel Capacity

The capacity of a discrete memoryless channel (DMC) with transition probabilities $p(y|x)$ is

$C = \max_{p(x)} I \quad \text{bits/channel use}$

The maximisation is over all possible input distributions $p(x)$ . Capacity is a property of the channel alone — it does not depend on any particular coding scheme.

,

Theorem: Shannon's Channel Coding Theorem

For a discrete memoryless channel with capacity $C$ :

Achievability: For any rate $R < C$ , there exist codes of sufficiently large block length $n$ such that the maximum probability of error $P_e^{(n)} \to 0$ as $n \to \infty$ .
Converse: For any rate $R > C$ , the error probability is bounded away from zero for all codes and all block lengths: $P_e^{(n)} \geq 1 - C/R - 1/n$ for large $n$ .

Together, $C$ is the supremum of achievable rates for reliable communication.

Shannon's theorem is an existence result: capacity-achieving codes exist, but the theorem does not tell us how to construct them. It took 45 years (turbo codes, 1993) and 48 years (LDPC codes, rediscovered 1996) to find practical codes that approach capacity within 1 dB.

Proof

Random coding argument (achievability sketch)

Generate $2^{nR}$ codewords by drawing each symbol i.i.d. from the capacity-achieving input distribution $p^*(x)$ . The receiver uses joint typicality decoding: declare $\hat{m} = m$ if $(\mathbf{x}_m, \mathbf{y})$ is jointly typical and no other codeword is jointly typical with $\mathbf{y}$ .

By the asymptotic equipartition property (AEP), the probability of a decoding error averaged over the random code ensemble vanishes as $n \to \infty$ provided $R < I$ . Since this holds for the average code, at least one deterministic code achieves the same performance.

Converse (Fano's inequality)

For any code with $2^{nR}$ codewords and error probability $P_e^{(n)}$ , Fano's inequality gives

$H(M|\hat{M}) \leq 1 + P_e^{(n)} \cdot nR$

Combined with the data processing inequality $I(M;\hat{M}) \leq nC$ , this yields

$nR \leq nC + 1 + P_e^{(n)} \cdot nR$

If $R > C$ , then $P_e^{(n)} \not\to 0$ . $\blacksquare$

, ,

Example: Binary Symmetric Channel Capacity

The binary symmetric channel (BSC) with crossover probability $p$ has input and output alphabets $\{0, 1\}$ and transition probabilities $p(y \neq x) = p$ . Compute its capacity.

Solution

Mutual information

For input distribution $P(X=0) = \pi$ , $P(X=1) = 1-\pi$ :

$H(Y|X) = H(p) = -p\log_2 p - (1-p)\log_2(1-p)$

This does not depend on $\pi$ because each row of the transition matrix has the same entropy $H(p)$ .

$I = H(Y) - H(p)$

To maximise $I$ , we maximise $H(Y)$ . The maximum $H(Y) = 1$ bit is achieved when $Y$ is uniform, which occurs when $\pi = 1/2$ (uniform input).

Capacity

$C_{\text{BSC}} = 1 - H(p) = 1 + p\log_2 p + (1-p)\log_2(1-p)$ $At$ p = 0 $:$ C = 1 $bit (perfect channel). At$ p = 1/2 $:$ C = 0 $bits (completely noisy). At$ p = 1 $:$ C = 1 $bit (deterministic inversion).$ \blacksquare$

Example: Binary Erasure Channel Capacity

The binary erasure channel (BEC) with erasure probability $\epsilon$ has output alphabet $\{0, 1, e\}$ where $e$ denotes an erasure. The transition probabilities are $p(y=x|x) = 1-\epsilon$ and $p(y=e|x) = \epsilon$ . Compute its capacity and the optimal input distribution.

Solution

Compute $H(Y|X)$

Given $X = x$ , the output is $x$ with probability $1-\epsilon$ or $e$ with probability $\epsilon$ :

$H(Y|X) = H(\epsilon) = -\epsilon\log_2\epsilon - (1-\epsilon)\log_2(1-\epsilon)$

Compute $H(Y)$ and maximise

The output distribution has three symbols. With uniform input ( $\pi = 1/2$ ): $P(Y=0) = (1-\epsilon)/2$ , $P(Y=1) = (1-\epsilon)/2$ , $P(Y=e) = \epsilon$ .

$H(Y) = H(\epsilon) + (1-\epsilon) \cdot 1 = H(\epsilon) + 1 - \epsilon$ .

$C_{\text{BEC}} = H(Y) - H(Y|X) = 1 - \epsilon$

The capacity is simply $1 - \epsilon$ bits per channel use, achieved by uniform input. $\blacksquare$

Quick Check

Which of the following statements about entropy is false?

$H \geq 0$ for any discrete random variable $X$

$H(X,Y) = H + H(Y)$ always holds

$H(X|Y) \leq H$ , i.e., conditioning reduces entropy

$I = I(Y;X)$ , i.e., mutual information is symmetric

Correction:

H(X,Y) = H + H(Y)

always holds

False. The joint entropy $H(X,Y) = H + H(Y|X)$ . Equality $H(X,Y) = H + H(Y)$ holds only when $X$ and $Y$ are independent. In general, $H(X,Y) \leq H + H(Y)$ .

Common Mistake: Capacity is a Channel Property, Not a Code Property

Mistake:

Saying "the capacity of LDPC codes" or "turbo codes achieve higher capacity than convolutional codes."

Correction:

Capacity $C = \max_{p(x)} I$ depends only on the channel transition probabilities $p(y|x)$ . It does not change with the coding scheme. What changes is how close a code operates to $C$ . The correct statement is "LDPC codes operate within 0.5 dB of capacity" — the capacity itself is fixed.

Historical Note: Shannon's 1948 Paper

1948

Claude Shannon's "A Mathematical Theory of Communication," published in the Bell System Technical Journal in July and October 1948, is widely regarded as the founding document of information theory. Shannon introduced entropy as a measure of information, defined channel capacity, and proved the channel coding theorem — all in one paper. The result was so surprising that many engineers initially doubted it: the idea that one could transmit at a positive rate with vanishing error probability over a noisy channel contradicted the prevailing intuition that noise fundamentally limits reliability at any positive rate.

Historical Note: The Discrete Memoryless Channel Model

1948-1968

The discrete memoryless channel (DMC) abstraction — where each channel use is independent and identically distributed — was central to Shannon's original analysis. While real channels have memory (ISI, fading correlation), the DMC remains the foundation of coding theory. Extensions to channels with memory (e.g., Gilbert-Elliott model, finite-state channels) were developed in the 1960s by Blackwell, Breiman, Thomasian, and Gallager.

Entropy

A measure of the average uncertainty or information content of a discrete random variable: $H = -\sum p(x)\log_2 p(x)$ . Measured in bits when the logarithm base is 2.

Mutual Information

The amount of information that one random variable provides about another: $I = H - H(X|Y)$ . It quantifies the reduction in uncertainty about $X$ due to knowledge of $Y$ .

Channel Capacity

The maximum rate at which information can be transmitted over a channel with arbitrarily low error probability: $C = \max_{p(x)} I$ . Measured in bits per channel use.

Discrete Memoryless Channel (DMC)

A channel model where the output depends only on the current input through a fixed transition probability $p(y|x)$ , independent of all previous inputs and outputs. The BSC and BEC are canonical examples.

Related: Channel Capacity, Bsc, Bec

Why This Matters: Deeper Treatment in the Information Theory Book

This section provides a condensed introduction to entropy, mutual information, and channel capacity — enough to motivate and compute capacity for wireless channels. The Information Theory and Applications (ITA) book develops the full theory:

Strong typicality and joint AEP (rigorous capacity proof)
Source coding (Huffman, arithmetic, Lempel-Ziv)
Rate-distortion theory (lossy compression limits)
Multi-user information theory (MAC, broadcast channel, interference channel, relay channel)
Network information theory (Slepian-Wolf, Wyner-Ziv)

Readers seeking rigorous proofs or multi-user extensions should consult the ITA book.

Key Takeaway

Channel capacity $C = \max_{p(x)} I$ is the maximum rate at which reliable communication is possible. Shannon's coding theorem guarantees that codes achieving rates arbitrarily close to $C$ exist, though finding practical capacity-approaching codes remained an open challenge for nearly 50 years.

Entropy, Mutual Information, and Channel Capacity