Entropy of Discrete Random Variables
Measuring Uncertainty
We have characterized discrete random variables by their PMF, mean, and variance. But there is another fundamental quantity β one that measures how uncertain or surprising a random variable is, independent of the numerical values it takes. Shannon entropy captures the average "information content" of a random variable and is the cornerstone of information theory. This section provides a preview; the full development appears in Book ITA, Chapter 1.
Definition: Shannon Entropy of a Discrete Random Variable
Shannon Entropy of a Discrete Random Variable
The Shannon entropy of a discrete random variable with PMF and support is
measured in bits. By convention, (justified by ).
When the logarithm base is (natural), the unit is nats; when base 10, hartleys. In this book (and throughout information theory), without a subscript means .
Definition: Surprise (Self-Information)
Surprise (Self-Information)
The surprise (or self-information) of observing is
An event with probability carries zero surprise; a rare event carries high surprise. The entropy is the expected surprise: .
Theorem: Entropy Bounds
For a discrete random variable with possible values:
- if and only if is deterministic (one value has probability 1).
- if and only if is uniformly distributed.
A deterministic variable has no uncertainty, so entropy is zero. The uniform distribution is the "most uncertain" among all distributions on a given support β any other distribution concentrates mass somewhere, reducing uncertainty.
Lower bound
Each term since implies . So . Equality holds iff every nonzero term has , which forces to be deterministic.
Upper bound via Jensen's inequality
Since is concave, Jensen's inequality gives:
Equality in Jensen holds iff is constant a.s., i.e., for all .
Example: Entropy of the Bernoulli Distribution
Compute for .
Apply the definition
h_b(p)$ is called the binary entropy function.
Properties
- (deterministic outcomes).
- bit (maximum uncertainty β a fair coin).
- (symmetric about ).
- is concave on .
Binary Entropy Function
The binary entropy function measures the uncertainty of a Bernoulli() random variable. It achieves its maximum of 1 bit at .
Parameters
Example: Entropy of a Fair Die
Compute the entropy of a fair six-sided die.
Apply the uniform formula
For :
Interpretation
On average, we need about 2.585 bits to describe the outcome of a die roll. Since we cannot send a fractional bit, we need at least 3 bits per roll β but with clever coding (Huffman, arithmetic), we can approach bits per symbol on average.
Theorem: Uniform Distribution Maximizes Entropy
Among all discrete distributions on a finite set , the uniform distribution uniquely maximizes the entropy:
with equality if and only if is uniform on .
Already proved
This is the upper bound from Theorem TEntropy Bounds, proved via Jensen's inequality.
Entropy vs. Variance
Entropy and variance both measure "spread," but they capture different aspects. Variance depends on the numerical values of the random variable and measures deviation from the mean. Entropy depends only on the probabilities and measures uncertainty about the outcome. Two random variables can have the same entropy but vastly different variances (e.g., and both have bit, but ).
Common Mistake: Discrete Entropy Is Never Negative
Mistake:
Confusing discrete entropy (always ) with differential entropy (can be negative).
Correction:
For a discrete random variable, always. The differential entropy for continuous RVs can be negative β for example, is negative when .
Entropy-Based Resource Allocation in Wireless Networks
The entropy and mutual information concepts introduced in this section are the building blocks of capacity analysis. Caire and Shamai's landmark paper on the MIMO broadcast channel uses entropy-based arguments to characterize the capacity region β showing that dirty paper coding achieves the sum capacity. The information measures from this chapter are the language in which all capacity results are expressed.
Why This Matters: Entropy as the Fundamental Limit of Data Compression
Shannon's source coding theorem states that a discrete memoryless source with entropy bits per symbol can be losslessly compressed to an average of bits per symbol, but no fewer. This is the direct engineering consequence of entropy: it sets the minimum bit rate for representing data. In a communication system, source coding (compression) before channel coding reduces the required transmission rate, and entropy tells us exactly how much compression is possible.
Quick Check
Which distribution on has the highest entropy?
The uniform distribution maximizes entropy: bits.
Historical Note: Shannon, von Neumann, and the Name 'Entropy'
1948Claude Shannon introduced the entropy function in his 1948 paper "A Mathematical Theory of Communication." The story goes that John von Neumann suggested the name "entropy" because (a) the formula is identical to Boltzmann's thermodynamic entropy, and (b) "nobody knows what entropy really is, so in a debate you will always have the advantage." Whether apocryphal or not, the name stuck, and the connection between information-theoretic and thermodynamic entropy has proved to be far deeper than a mere analogy.
Shannon Entropy
. Measures the average information content (in bits) of a discrete random variable.
Related: Probability Mass Function (PMF)
Key Takeaway
Shannon entropy is the average surprise of a discrete random variable. It is bounded by , with the upper bound achieved by the uniform distribution. Entropy depends only on the probabilities, not on the numerical values of .