Entropy of Discrete Random Variables

Measuring Uncertainty

We have characterized discrete random variables by their PMF, mean, and variance. But there is another fundamental quantity β€” one that measures how uncertain or surprising a random variable is, independent of the numerical values it takes. Shannon entropy captures the average "information content" of a random variable and is the cornerstone of information theory. This section provides a preview; the full development appears in Book ITA, Chapter 1.

Definition:

Shannon Entropy of a Discrete Random Variable

The Shannon entropy of a discrete random variable XX with PMF pp and support X\mathcal{X} is

H(X)β‰œβˆ’βˆ‘x∈Xp(x)log⁑2p(x),H(X) \triangleq -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x),

measured in bits. By convention, 0log⁑20=00 \log_2 0 = 0 (justified by lim⁑tβ†’0+tlog⁑t=0\lim_{t \to 0^+} t \log t = 0).

When the logarithm base is ee (natural), the unit is nats; when base 10, hartleys. In this book (and throughout information theory), log⁑\log without a subscript means log⁑2\log_2.

,

Definition:

Surprise (Self-Information)

The surprise (or self-information) of observing X=xX = x is

ΞΉ(x)β‰œβˆ’log⁑2p(x)=log⁑21p(x).\iota(x) \triangleq -\log_2 p(x) = \log_2 \frac{1}{p(x)}.

An event with probability 11 carries zero surprise; a rare event carries high surprise. The entropy is the expected surprise: H(X)=E[ΞΉ(X)]H(X) = \mathbb{E}[\iota(X)].

Theorem: Entropy Bounds

For a discrete random variable XX with ∣X∣|\mathcal{X}| possible values:

0≀H(X)≀log⁑2∣X∣.0 \leq H(X) \leq \log_2 |\mathcal{X}|.

  • H(X)=0H(X) = 0 if and only if XX is deterministic (one value has probability 1).
  • H(X)=log⁑2∣X∣H(X) = \log_2 |\mathcal{X}| if and only if XX is uniformly distributed.

A deterministic variable has no uncertainty, so entropy is zero. The uniform distribution is the "most uncertain" among all distributions on a given support β€” any other distribution concentrates mass somewhere, reducing uncertainty.

Example: Entropy of the Bernoulli Distribution

Compute H(X)H(X) for X∼Bernoulli(p)X \sim \text{Bernoulli}(p).

Binary Entropy Function hb(p)h_b(p)

The binary entropy function hb(p)=βˆ’plog⁑2pβˆ’(1βˆ’p)log⁑2(1βˆ’p)h_b(p) = -p\log_2 p - (1-p)\log_2(1-p) measures the uncertainty of a Bernoulli(pp) random variable. It achieves its maximum of 1 bit at p=1/2p = 1/2.

Parameters
0.5

Example: Entropy of a Fair Die

Compute the entropy of a fair six-sided die.

Theorem: Uniform Distribution Maximizes Entropy

Among all discrete distributions on a finite set X\mathcal{X}, the uniform distribution p(x)=1/∣X∣p(x) = 1/|\mathcal{X}| uniquely maximizes the entropy:

H(X)≀log⁑2∣X∣,H(X) \leq \log_2 |\mathcal{X}|,

with equality if and only if XX is uniform on X\mathcal{X}.

Entropy vs. Variance

Entropy and variance both measure "spread," but they capture different aspects. Variance depends on the numerical values of the random variable and measures deviation from the mean. Entropy depends only on the probabilities and measures uncertainty about the outcome. Two random variables can have the same entropy but vastly different variances (e.g., X∼Uniform{0,1}X \sim \text{Uniform}\{0, 1\} and Y∼Uniform{0,1000}Y \sim \text{Uniform}\{0, 1000\} both have H=1H = 1 bit, but Var(Y)≫Var(X)\text{Var}(Y) \gg \text{Var}(X)).

Common Mistake: Discrete Entropy Is Never Negative

Mistake:

Confusing discrete entropy (always β‰₯0\geq 0) with differential entropy (can be negative).

Correction:

For a discrete random variable, H(X)β‰₯0H(X) \geq 0 always. The differential entropy h(X)=βˆ’βˆ«f(x)log⁑f(x) dxh(X) = -\int f(x) \log f(x)\,dx for continuous RVs can be negative β€” for example, h(X)=12log⁑(2Ο€eΟƒ2)h(X) = \frac{1}{2}\log(2\pi e \sigma^2) is negative when Οƒ2<1/(2Ο€e)\sigma^2 < 1/(2\pi e).

πŸŽ“CommIT Contribution(2003)

Entropy-Based Resource Allocation in Wireless Networks

G. Caire, S. Shamai (Shitz) β€” IEEE Transactions on Information Theory

The entropy and mutual information concepts introduced in this section are the building blocks of capacity analysis. Caire and Shamai's landmark paper on the MIMO broadcast channel uses entropy-based arguments to characterize the capacity region β€” showing that dirty paper coding achieves the sum capacity. The information measures from this chapter are the language in which all capacity results are expressed.

information-theoryMIMObroadcast-channelView Paper β†’

Why This Matters: Entropy as the Fundamental Limit of Data Compression

Shannon's source coding theorem states that a discrete memoryless source with entropy H(X)H(X) bits per symbol can be losslessly compressed to an average of H(X)H(X) bits per symbol, but no fewer. This is the direct engineering consequence of entropy: it sets the minimum bit rate for representing data. In a communication system, source coding (compression) before channel coding reduces the required transmission rate, and entropy tells us exactly how much compression is possible.

Quick Check

Which distribution on {1,2,3,4}\{1, 2, 3, 4\} has the highest entropy?

p=(1/2,1/4,1/8,1/8)p = (1/2, 1/4, 1/8, 1/8)

p=(1/4,1/4,1/4,1/4)p = (1/4, 1/4, 1/4, 1/4)

p=(1,0,0,0)p = (1, 0, 0, 0)

p=(1/2,1/2,0,0)p = (1/2, 1/2, 0, 0)

Historical Note: Shannon, von Neumann, and the Name 'Entropy'

1948

Claude Shannon introduced the entropy function in his 1948 paper "A Mathematical Theory of Communication." The story goes that John von Neumann suggested the name "entropy" because (a) the formula is identical to Boltzmann's thermodynamic entropy, and (b) "nobody knows what entropy really is, so in a debate you will always have the advantage." Whether apocryphal or not, the name stuck, and the connection between information-theoretic and thermodynamic entropy has proved to be far deeper than a mere analogy.

,

Shannon Entropy

H(X)=βˆ’βˆ‘xp(x)log⁑2p(x)H(X) = -\sum_x p(x) \log_2 p(x). Measures the average information content (in bits) of a discrete random variable.

Related: Probability Mass Function (PMF)

Key Takeaway

Shannon entropy H(X)=βˆ’βˆ‘p(x)log⁑2p(x)H(X) = -\sum p(x) \log_2 p(x) is the average surprise of a discrete random variable. It is bounded by 0≀H(X)≀log⁑2∣X∣0 \leq H(X) \leq \log_2 |\mathcal{X}|, with the upper bound achieved by the uniform distribution. Entropy depends only on the probabilities, not on the numerical values of XX.