Entropy and Its Operational Meaning

Why Entropy?

Suppose you are designing a communication system. A source produces symbols from an alphabet X={a1,,aM}\mathcal{X} = \{a_1, \ldots, a_M\} according to a known distribution pp. You want to represent each source output as a binary string and transmit it over a noiseless channel. The fundamental question is:

What is the minimum average number of bits per symbol needed to represent the source without any loss?

Shannon's extraordinary answer is that this minimum is a single number determined entirely by the source distribution — the entropy H(X)H(X). This section defines entropy, builds intuition for why it measures "information," and establishes the basic properties that make it the cornerstone of the entire theory.

Definition:

Shannon Entropy

Let XX be a discrete random variable taking values in a finite alphabet X\mathcal{X} with probability mass function p(x)=Pr(X=x)p(x) = \Pr(X = x). The entropy of XX is

H(X)=xXp(x)logp(x),H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x),

where log\log denotes the base-2 logarithm and we adopt the convention 0log0=00 \log 0 = 0 (justified by continuity, since limt0+tlogt=0\lim_{t \to 0^+} t \log t = 0).

When we use the natural logarithm, the unit is nats; when we use log2\log_2, the unit is bits. One nat =1ln2= \frac{1}{\ln 2} bits.

The notation H(X)H(X) depends on the distribution of XX, not on the particular values that XX takes. More precisely, H(X)=H(p)H(X) = H(p) is a functional of the PMF pp. We write H(X)H(X) when the PMF is understood from context.

Entropy

A measure of the average uncertainty or information content of a random variable. For a discrete random variable XX with PMF pp, H(X)=xp(x)logp(x)H(X) = -\sum_x p(x) \log p(x).

Related: Conditional entropy, Mutual information

Bit (unit of information)

The entropy of a fair coin flip. One bit is the amount of information gained (or uncertainty resolved) by observing the outcome of a single fair binary experiment. Formally, 1 bit=log22=11 \text{ bit} = \log_2 2 = 1.

Related: Entropy

Three Ways to Think About Entropy

Entropy admits several equivalent interpretations, each illuminating a different facet of the concept:

  1. Average surprise. Define the surprise or self-information of outcome xx as ι(x)=logp(x)=log1p(x)\iota(x) = -\log p(x) = \log \frac{1}{p(x)}. Then H(X)=E[ι(X)]H(X) = \mathbb{E}[\iota(X)] — entropy is the average surprise. Rare events carry more surprise; entropy averages this over the distribution.

  2. Minimum expected description length. Shannon's source coding theorem (Chapter 5) proves that H(X)H(X) is the minimum average number of bits per symbol needed to losslessly encode an i.i.d. source with distribution pp. We cannot do better, and we can get arbitrarily close.

  3. Average number of binary questions. Imagine playing "twenty questions" to identify XX. The optimal strategy requires H(X)H(X) yes/no questions on average. Each question is worth at most one bit.

Historical Note: Shannon, von Neumann, and the Name 'Entropy'

1948

When Claude Shannon developed his theory in the late 1940s, he needed a name for the quantity plogp-\sum p \log p. The story, as told by Myron Tribus, is that Shannon asked John von Neumann for advice. Von Neumann reportedly said: "You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage."

Whether or not the story is apocryphal, the connection is real. Boltzmann's entropy S=kBlnWS = k_B \ln W and Shannon's entropy H=plogpH = -\sum p \log p are related: for a uniform distribution over WW microstates, H=logWH = \log W. The physical and information-theoretic concepts are deeply intertwined — a connection that would later inspire Landauer's principle and the thermodynamics of computation.

Theorem: Bounds on Entropy

Let XX be a discrete random variable with alphabet X\mathcal{X}, X=M|\mathcal{X}| = M. Then:

0H(X)logM.0 \leq H(X) \leq \log M.

The lower bound H(X)=0H(X) = 0 holds if and only if XX is deterministic (i.e., p(x)=1p(x) = 1 for some xx). The upper bound H(X)=logMH(X) = \log M holds if and only if XX is uniformly distributed over X\mathcal{X}.

A deterministic variable carries no surprise — we always know its value. A uniform variable is maximally unpredictable — every outcome is equally likely. Entropy quantifies this spectrum from certainty to maximum uncertainty.

Example: Binary Entropy Function

Let XBernoulli(p)X \sim \text{Bernoulli}(p), i.e., XX takes value 11 with probability pp and value 00 with probability 1p1-p, where 0p10 \leq p \leq 1. Compute H(X)H(X) and find the value of pp that maximizes it.

Example: Entropy of the Uniform Distribution

Let XX be uniformly distributed over X={1,2,,M}\mathcal{X} = \{1, 2, \ldots, M\}. Compute H(X)H(X).

Example: Entropy of a Loaded Die

A die has faces {1,2,3,4,5,6}\{1, 2, 3, 4, 5, 6\} with probabilities (1/2,1/4,1/8,1/16,1/32,1/32)(1/2, 1/4, 1/8, 1/16, 1/32, 1/32). Compute H(X)H(X).

Binary Entropy Function

Explore the binary entropy function hb(p)=plog2p(1p)log2(1p)h_b(p) = -p\log_2 p - (1-p)\log_2(1-p) and see how entropy varies with the bias pp of a Bernoulli source. The maximum occurs at p=1/2p = 1/2 (fair coin), and the function is symmetric about this point.

Parameters
0.5

Probability of X = 1

Entropy of Discrete Distributions

Adjust the probabilities of a discrete distribution over MM outcomes and observe how the entropy changes. Notice that the entropy is maximized when the distribution is uniform and minimized when all mass is concentrated on a single outcome.

Parameters
4

Number of outcomes

0

0 = uniform, 1 = all mass on one symbol

Theorem: Grouping Property of Entropy

Let XX take values in X={a1,,aM}\mathcal{X} = \{a_1, \ldots, a_M\} with probabilities (p1,,pM)(p_1, \ldots, p_M). Partition X\mathcal{X} into groups G1,,GK\mathcal{G}_1, \ldots, \mathcal{G}_K, and let gk=aiGkpig_k = \sum_{a_i \in \mathcal{G}_k} p_i be the probability of group kk. Then

H(X)=H(g1,,gK)+k=1KgkH ⁣(pigk:aiGk).H(X) = H(g_1, \ldots, g_K) + \sum_{k=1}^{K} g_k \, H\!\left(\frac{p_i}{g_k} : a_i \in \mathcal{G}_k\right).

First identify which group XX belongs to (costing H(g1,,gK)H(g_1, \ldots, g_K) bits), then identify the specific outcome within that group. The total information is the sum. This is a recursive structure — it tells us that entropy "decomposes" naturally along any partition.

Quick Check

A random variable XX takes four values with probabilities (1/2,1/4,1/8,1/8)(1/2, 1/4, 1/8, 1/8). What is H(X)H(X)?

22 bits

7/47/4 bits

3/23/2 bits

11 bit

Quick Check

Under what condition is H(X)=0H(X) = 0?

XX is uniformly distributed

XX is deterministic (takes one value with probability 1)

XX has exactly two equally likely outcomes

XX has an infinite alphabet

Theorem: Concavity of Entropy

The entropy H(p)=xp(x)logp(x)H(p) = -\sum_{x} p(x) \log p(x) is a concave function of the probability vector p=(p(a1),,p(aM))\mathbf{p} = (p(a_1), \ldots, p(a_M)) over the probability simplex.

That is, for any two PMFs p1,p2\mathbf{p}_1, \mathbf{p}_2 and any λ[0,1]\lambda \in [0,1]:

H(λp1+(1λ)p2)λH(p1)+(1λ)H(p2).H(\lambda \mathbf{p}_1 + (1-\lambda)\mathbf{p}_2) \geq \lambda H(\mathbf{p}_1) + (1-\lambda) H(\mathbf{p}_2).

Mixing two distributions can only increase (or maintain) uncertainty. The point is that concavity of entropy is not just a mathematical nicety — it is the reason that the capacity optimization C=maxpXI(X;Y)C = \max_{p_X} I(X;Y) is a convex problem, which is why we can actually compute capacity.

Common Mistake: Entropy Is Not Variance

Mistake:

Confusing entropy with variance or treating them interchangeably as measures of "spread." For example, assuming that a distribution with higher variance necessarily has higher entropy.

Correction:

Entropy measures uncertainty about the identity of the outcome, not its numerical spread. A distribution concentrated on two extreme values (say {0,100}\{0, 100\} with equal probability) has entropy 11 bit and variance 25002500. A uniform distribution over {0,1,2,3}\{0, 1, 2, 3\} has entropy 22 bits but variance 5/45/4. The two quantities capture fundamentally different aspects of randomness.

Self-information (surprisal)

The self-information of an outcome xx with probability p(x)p(x) is ι(x)=logp(x)=log(1/p(x))\iota(x) = -\log p(x) = \log(1/p(x)). It quantifies the "surprise" of observing xx. Rare events have high self-information. Entropy is the expected self-information: H(X)=E[ι(X)]H(X) = \mathbb{E}[\iota(X)].

Related: Entropy