Entropy and Its Operational Meaning
Why Entropy?
Suppose you are designing a communication system. A source produces symbols from an alphabet according to a known distribution . You want to represent each source output as a binary string and transmit it over a noiseless channel. The fundamental question is:
What is the minimum average number of bits per symbol needed to represent the source without any loss?
Shannon's extraordinary answer is that this minimum is a single number determined entirely by the source distribution — the entropy . This section defines entropy, builds intuition for why it measures "information," and establishes the basic properties that make it the cornerstone of the entire theory.
Definition: Shannon Entropy
Shannon Entropy
Let be a discrete random variable taking values in a finite alphabet with probability mass function . The entropy of is
where denotes the base-2 logarithm and we adopt the convention (justified by continuity, since ).
When we use the natural logarithm, the unit is nats; when we use , the unit is bits. One nat bits.
The notation depends on the distribution of , not on the particular values that takes. More precisely, is a functional of the PMF . We write when the PMF is understood from context.
Entropy
A measure of the average uncertainty or information content of a random variable. For a discrete random variable with PMF , .
Related: Conditional entropy, Mutual information
Bit (unit of information)
The entropy of a fair coin flip. One bit is the amount of information gained (or uncertainty resolved) by observing the outcome of a single fair binary experiment. Formally, .
Related: Entropy
Three Ways to Think About Entropy
Entropy admits several equivalent interpretations, each illuminating a different facet of the concept:
-
Average surprise. Define the surprise or self-information of outcome as . Then — entropy is the average surprise. Rare events carry more surprise; entropy averages this over the distribution.
-
Minimum expected description length. Shannon's source coding theorem (Chapter 5) proves that is the minimum average number of bits per symbol needed to losslessly encode an i.i.d. source with distribution . We cannot do better, and we can get arbitrarily close.
-
Average number of binary questions. Imagine playing "twenty questions" to identify . The optimal strategy requires yes/no questions on average. Each question is worth at most one bit.
Historical Note: Shannon, von Neumann, and the Name 'Entropy'
1948When Claude Shannon developed his theory in the late 1940s, he needed a name for the quantity . The story, as told by Myron Tribus, is that Shannon asked John von Neumann for advice. Von Neumann reportedly said: "You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage."
Whether or not the story is apocryphal, the connection is real. Boltzmann's entropy and Shannon's entropy are related: for a uniform distribution over microstates, . The physical and information-theoretic concepts are deeply intertwined — a connection that would later inspire Landauer's principle and the thermodynamics of computation.
Theorem: Bounds on Entropy
Let be a discrete random variable with alphabet , . Then:
The lower bound holds if and only if is deterministic (i.e., for some ). The upper bound holds if and only if is uniformly distributed over .
A deterministic variable carries no surprise — we always know its value. A uniform variable is maximally unpredictable — every outcome is equally likely. Entropy quantifies this spectrum from certainty to maximum uncertainty.
Non-negativity
Since for all , we have , hence . Summing over gives . Equality holds iff for all , which (by our convention ) means for all .
Upper bound via Jensen's inequality
We write
Since is concave, Jensen's inequality gives
Equality in Jensen's inequality holds iff is constant a.s., i.e., for all .
Example: Binary Entropy Function
Let , i.e., takes value with probability and value with probability , where . Compute and find the value of that maximizes it.
Direct computation
h_b(p)$ is called the binary entropy function.
Maximization
Taking the derivative:
Setting gives , so . Since , the function is strictly concave, confirming is a global maximum with bit.
Boundary values
. A deterministic coin carries no information. The maximum bit corresponds to a fair coin — the most uncertain binary experiment.
Example: Entropy of the Uniform Distribution
Let be uniformly distributed over . Compute .
Computation
Since for all :
Interpretation
To identify one of equally likely outcomes, we need bits. For , this is exactly bits — matching our intuition that binary questions suffice to identify one of items.
Example: Entropy of a Loaded Die
A die has faces with probabilities . Compute .
Direct calculation
$
Comparison with fair die
A fair die has entropy bits. The loaded die has lower entropy because it is more predictable — the outcome occurs half the time. The point is that entropy penalizes predictability: the more concentrated the distribution, the lower the entropy.
Binary Entropy Function
Explore the binary entropy function and see how entropy varies with the bias of a Bernoulli source. The maximum occurs at (fair coin), and the function is symmetric about this point.
Parameters
Probability of X = 1
Entropy of Discrete Distributions
Adjust the probabilities of a discrete distribution over outcomes and observe how the entropy changes. Notice that the entropy is maximized when the distribution is uniform and minimized when all mass is concentrated on a single outcome.
Parameters
Number of outcomes
0 = uniform, 1 = all mass on one symbol
Theorem: Grouping Property of Entropy
Let take values in with probabilities . Partition into groups , and let be the probability of group . Then
First identify which group belongs to (costing bits), then identify the specific outcome within that group. The total information is the sum. This is a recursive structure — it tells us that entropy "decomposes" naturally along any partition.
Expand the definition
Write . Substitute :
The first term is and the inner sum in the second term is .
Quick Check
A random variable takes four values with probabilities . What is ?
bits
bits
bits
bit
bits.
Quick Check
Under what condition is ?
is uniformly distributed
is deterministic (takes one value with probability 1)
has exactly two equally likely outcomes
has an infinite alphabet
When is deterministic, there is no uncertainty, so . This is the only case where entropy vanishes.
Theorem: Concavity of Entropy
The entropy is a concave function of the probability vector over the probability simplex.
That is, for any two PMFs and any :
Mixing two distributions can only increase (or maintain) uncertainty. The point is that concavity of entropy is not just a mathematical nicety — it is the reason that the capacity optimization is a convex problem, which is why we can actually compute capacity.
Apply concavity of $-t \log t$
The function is concave on (its second derivative is ).
Since is a sum of concave functions evaluated at different components, and the mixing operates component-wise:
Common Mistake: Entropy Is Not Variance
Mistake:
Confusing entropy with variance or treating them interchangeably as measures of "spread." For example, assuming that a distribution with higher variance necessarily has higher entropy.
Correction:
Entropy measures uncertainty about the identity of the outcome, not its numerical spread. A distribution concentrated on two extreme values (say with equal probability) has entropy bit and variance . A uniform distribution over has entropy bits but variance . The two quantities capture fundamentally different aspects of randomness.
Self-information (surprisal)
The self-information of an outcome with probability is . It quantifies the "surprise" of observing . Rare events have high self-information. Entropy is the expected self-information: .
Related: Entropy