Ferkans — Interactive Telecom Tutor

Measuring Uncertainty

We have characterized discrete random variables by their PMF, mean, and variance. But there is another fundamental quantity — one that measures how uncertain or surprising a random variable is, independent of the numerical values it takes. Shannon entropy captures the average "information content" of a random variable and is the cornerstone of information theory. This section provides a preview; the full development appears in Book ITA, Chapter 1.

Definition:
Shannon Entropy of a Discrete Random Variable

The Shannon entropy of a discrete random variable $X$ with PMF $p$ and support $\mathcal{X}$ is

$H(X) \triangleq -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x),$

measured in bits. By convention, $0 \log_2 0 = 0$ (justified by $\lim_{t \to 0^+} t \log t = 0$ ).

When the logarithm base is $e$ (natural), the unit is nats; when base 10, hartleys. In this book (and throughout information theory), $\log$ without a subscript means $\log_2$ .

,

Definition:
Surprise (Self-Information)

The surprise (or self-information) of observing $X = x$ is

$\iota(x) \triangleq -\log_2 p(x) = \log_2 \frac{1}{p(x)}.$

An event with probability $1$ carries zero surprise; a rare event carries high surprise. The entropy is the expected surprise: $H(X) = \mathbb{E}[\iota(X)]$ .

Theorem: Entropy Bounds

For a discrete random variable $X$ with $|\mathcal{X}|$ possible values:

$0 \leq H(X) \leq \log_2 |\mathcal{X}|.$

$H(X) = 0$ if and only if $X$ is deterministic (one value has probability 1).
$H(X) = \log_2 |\mathcal{X}|$ if and only if $X$ is uniformly distributed.

A deterministic variable has no uncertainty, so entropy is zero. The uniform distribution is the "most uncertain" among all distributions on a given support — any other distribution concentrates mass somewhere, reducing uncertainty.

Proof

Lower bound

Each term $-p(x) \log_2 p(x) \geq 0$ since $p(x) \in [0, 1]$ implies $\log_2 p(x) \leq 0$ . So $H(X) \geq 0$ . Equality holds iff every nonzero term has $p(x) = 1$ , which forces $X$ to be deterministic.

Upper bound via Jensen's inequality

Since $\log$ is concave, Jensen's inequality gives:

$H(X) = \mathbb{E}\!\left[\log_2 \frac{1}{p(X)}\right] \leq \log_2 \mathbb{E}\!\left[\frac{1}{p(X)}\right] = \log_2 \sum_{x} p(x) \cdot \frac{1}{p(x)} = \log_2 |\mathcal{X}|.$

Equality in Jensen holds iff $1/p(X)$ is constant a.s., i.e., $p(x) = 1/|\mathcal{X}|$ for all $x$ . $\blacksquare$

Example: Entropy of the Bernoulli Distribution

Compute $H(X)$ for $X \sim \text{Bernoulli}(p)$ .

Solution

Apply the definition

$H(X) = -p \log_2 p - (1-p) \log_2(1-p) \triangleq h_b(p),$ $where$ h_b(p)$ is called the binary entropy function.

Properties

$h_b(0) = h_b(1) = 0$ (deterministic outcomes).
$h_b(1/2) = 1$ bit (maximum uncertainty — a fair coin).
$h_b(p) = h_b(1-p)$ (symmetric about $p = 1/2$ ).
$h_b$ is concave on $[0, 1]$ .

Binary Entropy Function $h_b(p)$

The binary entropy function $h_b(p) = -p\log_2 p - (1-p)\log_2(1-p)$ measures the uncertainty of a Bernoulli( $p$ ) random variable. It achieves its maximum of 1 bit at $p = 1/2$ .

Parameters

p

0.5

Example: Entropy of a Fair Die

Compute the entropy of a fair six-sided die.

Solution

Apply the uniform formula

For $X \sim \text{Uniform}\{1, 2, 3, 4, 5, 6\}$ :

$H(X) = \log_2 6 \approx 2.585 \text{ bits}.$

Interpretation

On average, we need about 2.585 bits to describe the outcome of a die roll. Since we cannot send a fractional bit, we need at least 3 bits per roll — but with clever coding (Huffman, arithmetic), we can approach $\log_2 6$ bits per symbol on average.

Theorem: Uniform Distribution Maximizes Entropy

Among all discrete distributions on a finite set $\mathcal{X}$ , the uniform distribution $p(x) = 1/|\mathcal{X}|$ uniquely maximizes the entropy:

$H(X) \leq \log_2 |\mathcal{X}|,$

with equality if and only if $X$ is uniform on $\mathcal{X}$ .

Proof

Already proved

This is the upper bound from Theorem TEntropy Bounds, proved via Jensen's inequality. $\blacksquare$

Entropy vs. Variance

Entropy and variance both measure "spread," but they capture different aspects. Variance depends on the numerical values of the random variable and measures deviation from the mean. Entropy depends only on the probabilities and measures uncertainty about the outcome. Two random variables can have the same entropy but vastly different variances (e.g., $X \sim \text{Uniform}\{0, 1\}$ and $Y \sim \text{Uniform}\{0, 1000\}$ both have $H = 1$ bit, but $\text{Var}(Y) \gg \text{Var}(X)$ ).

Common Mistake: Discrete Entropy Is Never Negative

Mistake:

Confusing discrete entropy (always $\geq 0$ ) with differential entropy (can be negative).

Correction:

For a discrete random variable, $H(X) \geq 0$ always. The differential entropy $h(X) = -\int f(x) \log f(x)\,dx$ for continuous RVs can be negative — for example, $h(X) = \frac{1}{2}\log(2\pi e \sigma^2)$ is negative when $\sigma^2 < 1/(2\pi e)$ .

🎓CommIT Contribution(2003)

Entropy-Based Resource Allocation in Wireless Networks

G. Caire, S. Shamai (Shitz) — IEEE Transactions on Information Theory

The entropy and mutual information concepts introduced in this section are the building blocks of capacity analysis. Caire and Shamai's landmark paper on the MIMO broadcast channel uses entropy-based arguments to characterize the capacity region — showing that dirty paper coding achieves the sum capacity. The information measures from this chapter are the language in which all capacity results are expressed.

information-theoryMIMObroadcast-channelView Paper →

Why This Matters: Entropy as the Fundamental Limit of Data Compression

Shannon's source coding theorem states that a discrete memoryless source with entropy $H(X)$ bits per symbol can be losslessly compressed to an average of $H(X)$ bits per symbol, but no fewer. This is the direct engineering consequence of entropy: it sets the minimum bit rate for representing data. In a communication system, source coding (compression) before channel coding reduces the required transmission rate, and entropy tells us exactly how much compression is possible.

Quick Check

Which distribution on $\{1, 2, 3, 4\}$ has the highest entropy?

$p = (1/2, 1/4, 1/8, 1/8)$

$p = (1/4, 1/4, 1/4, 1/4)$

$p = (1, 0, 0, 0)$

$p = (1/2, 1/2, 0, 0)$

Correction:

p = (1/4, 1/4, 1/4, 1/4)

The uniform distribution maximizes entropy: $H = \log_2 4 = 2$ bits.

Historical Note: Shannon, von Neumann, and the Name 'Entropy'

1948

Claude Shannon introduced the entropy function in his 1948 paper "A Mathematical Theory of Communication." The story goes that John von Neumann suggested the name "entropy" because (a) the formula is identical to Boltzmann's thermodynamic entropy, and (b) "nobody knows what entropy really is, so in a debate you will always have the advantage." Whether apocryphal or not, the name stuck, and the connection between information-theoretic and thermodynamic entropy has proved to be far deeper than a mere analogy.

,

Shannon Entropy

$H(X) = -\sum_x p(x) \log_2 p(x)$ . Measures the average information content (in bits) of a discrete random variable.

Key Takeaway

Shannon entropy $H(X) = -\sum p(x) \log_2 p(x)$ is the average surprise of a discrete random variable. It is bounded by $0 \leq H(X) \leq \log_2 |\mathcal{X}|$ , with the upper bound achieved by the uniform distribution. Entropy depends only on the probabilities, not on the numerical values of $X$ .

Entropy of Discrete Random Variables