Ferkans — Interactive Telecom Tutor

Why Entropy?

Suppose you are designing a communication system. A source produces symbols from an alphabet $\mathcal{X} = \{a_1, \ldots, a_M\}$ according to a known distribution $p$ . You want to represent each source output as a binary string and transmit it over a noiseless channel. The fundamental question is:

What is the minimum average number of bits per symbol needed to represent the source without any loss?

Shannon's extraordinary answer is that this minimum is a single number determined entirely by the source distribution — the entropy $H(X)$ . This section defines entropy, builds intuition for why it measures "information," and establishes the basic properties that make it the cornerstone of the entire theory.

Definition:
Shannon Entropy

Let $X$ be a discrete random variable taking values in a finite alphabet $\mathcal{X}$ with probability mass function $p(x) = \Pr(X = x)$ . The entropy of $X$ is

$H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x),$

where $\log$ denotes the base-2 logarithm and we adopt the convention $0 \log 0 = 0$ (justified by continuity, since $\lim_{t \to 0^+} t \log t = 0$ ).

When we use the natural logarithm, the unit is nats; when we use $\log_2$ , the unit is bits. One nat $= \frac{1}{\ln 2}$ bits.

The notation $H(X)$ depends on the distribution of $X$ , not on the particular values that $X$ takes. More precisely, $H(X) = H(p)$ is a functional of the PMF $p$ . We write $H(X)$ when the PMF is understood from context.

Entropy

A measure of the average uncertainty or information content of a random variable. For a discrete random variable $X$ with PMF $p$ , $H(X) = -\sum_x p(x) \log p(x)$ .

Bit (unit of information)

The entropy of a fair coin flip. One bit is the amount of information gained (or uncertainty resolved) by observing the outcome of a single fair binary experiment. Formally, $1 \text{ bit} = \log_2 2 = 1$ .

Related: Entropy

Three Ways to Think About Entropy

Entropy admits several equivalent interpretations, each illuminating a different facet of the concept:

Average surprise. Define the surprise or self-information of outcome $x$ as $\iota(x) = -\log p(x) = \log \frac{1}{p(x)}$ . Then $H(X) = \mathbb{E}[\iota(X)]$ — entropy is the average surprise. Rare events carry more surprise; entropy averages this over the distribution.
Minimum expected description length. Shannon's source coding theorem (Chapter 5) proves that $H(X)$ is the minimum average number of bits per symbol needed to losslessly encode an i.i.d. source with distribution $p$ . We cannot do better, and we can get arbitrarily close.
Average number of binary questions. Imagine playing "twenty questions" to identify $X$ . The optimal strategy requires $H(X)$ yes/no questions on average. Each question is worth at most one bit.

Historical Note: Shannon, von Neumann, and the Name 'Entropy'

1948

When Claude Shannon developed his theory in the late 1940s, he needed a name for the quantity $-\sum p \log p$ . The story, as told by Myron Tribus, is that Shannon asked John von Neumann for advice. Von Neumann reportedly said: "You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage."

Whether or not the story is apocryphal, the connection is real. Boltzmann's entropy $S = k_B \ln W$ and Shannon's entropy $H = -\sum p \log p$ are related: for a uniform distribution over $W$ microstates, $H = \log W$ . The physical and information-theoretic concepts are deeply intertwined — a connection that would later inspire Landauer's principle and the thermodynamics of computation.

Theorem: Bounds on Entropy

Let $X$ be a discrete random variable with alphabet $\mathcal{X}$ , $|\mathcal{X}| = M$ . Then:

$0 \leq H(X) \leq \log M.$

The lower bound $H(X) = 0$ holds if and only if $X$ is deterministic (i.e., $p(x) = 1$ for some $x$ ). The upper bound $H(X) = \log M$ holds if and only if $X$ is uniformly distributed over $\mathcal{X}$ .

A deterministic variable carries no surprise — we always know its value. A uniform variable is maximally unpredictable — every outcome is equally likely. Entropy quantifies this spectrum from certainty to maximum uncertainty.

Proof

Non-negativity

Since $0 \leq p(x) \leq 1$ for all $x$ , we have $\log p(x) \leq 0$ , hence $-p(x) \log p(x) \geq 0$ . Summing over $x$ gives $H(X) \geq 0$ . Equality holds iff $p(x) \log p(x) = 0$ for all $x$ , which (by our convention $0 \log 0 = 0$ ) means $p(x) \in \{0, 1\}$ for all $x$ .

Upper bound via Jensen's inequality

We write

$H(X) = \mathbb{E}\!\left[\log \frac{1}{p(X)}\right].$

Since $\log$ is concave, Jensen's inequality gives

$H(X) \leq \log\!\left(\mathbb{E}\!\left[\frac{1}{p(X)}\right]\right) = \log\!\left(\sum_{x} p(x) \cdot \frac{1}{p(x)}\right) = \log M.$

Equality in Jensen's inequality holds iff $\frac{1}{p(X)}$ is constant a.s., i.e., $p(x) = \frac{1}{M}$ for all $x$ .

Example: Binary Entropy Function

Let $X \sim \text{Bernoulli}(p)$ , i.e., $X$ takes value $1$ with probability $p$ and value $0$ with probability $1-p$ , where $0 \leq p \leq 1$ . Compute $H(X)$ and find the value of $p$ that maximizes it.

Solution

Direct computation

$H(X) = -p \log p - (1-p) \log(1-p) \triangleq h_b(p).$ $The function$ h_b(p)$ is called the binary entropy function.

Maximization

Taking the derivative:

$h_b'(p) = -\log p + \log(1-p) = \log\frac{1-p}{p}.$

Setting $h_b'(p) = 0$ gives $p = 1-p$ , so $p = 1/2$ . Since $h_b''(p) = -\frac{1}{p(1-p)\ln 2} < 0$ , the function is strictly concave, confirming $p = 1/2$ is a global maximum with $h_b(1/2) = 1$ bit.

Boundary values

$h_b(0) = h_b(1) = 0$ . A deterministic coin carries no information. The maximum $h_b(1/2) = 1$ bit corresponds to a fair coin — the most uncertain binary experiment.

Example: Entropy of the Uniform Distribution

Let $X$ be uniformly distributed over $\mathcal{X} = \{1, 2, \ldots, M\}$ . Compute $H(X)$ .

Solution

Computation

Since $p(x) = 1/M$ for all $x$ :

$H(X) = -\sum_{x=1}^{M} \frac{1}{M} \log \frac{1}{M} = -\log \frac{1}{M} = \log M.$

Interpretation

To identify one of $M$ equally likely outcomes, we need $\log_2 M$ bits. For $M = 2^k$ , this is exactly $k$ bits — matching our intuition that $k$ binary questions suffice to identify one of $2^k$ items.

Example: Entropy of a Loaded Die

A die has faces $\{1, 2, 3, 4, 5, 6\}$ with probabilities $(1/2, 1/4, 1/8, 1/16, 1/32, 1/32)$ . Compute $H(X)$ .

Solution

Direct calculation

$H(X) = -\Bigl(\tfrac{1}{2}\log\tfrac{1}{2} + \tfrac{1}{4}\log\tfrac{1}{4} + \tfrac{1}{8}\log\tfrac{1}{8} + \tfrac{1}{16}\log\tfrac{1}{16} + 2 \cdot \tfrac{1}{32}\log\tfrac{1}{32}\Bigr)KATEXPLACEHOLDER0END= \tfrac{1}{2}(1) + \tfrac{1}{4}(2) + \tfrac{1}{8}(3) + \tfrac{1}{16}(4) + \tfrac{1}{32}(5) + \tfrac{1}{32}(5)KATEXPLACEHOLDER1END= \tfrac{1}{2} + \tfrac{1}{2} + \tfrac{3}{8} + \tfrac{1}{4} + \tfrac{5}{32} + \tfrac{5}{32} = \frac{63}{32} = 1.96875 \text{ bits}.$ $

Comparison with fair die

A fair die has entropy $\log_2 6 \approx 2.585$ bits. The loaded die has lower entropy because it is more predictable — the outcome $1$ occurs half the time. The point is that entropy penalizes predictability: the more concentrated the distribution, the lower the entropy.

Binary Entropy Function

Explore the binary entropy function $h_b(p) = -p\log_2 p - (1-p)\log_2(1-p)$ and see how entropy varies with the bias $p$ of a Bernoulli source. The maximum occurs at $p = 1/2$ (fair coin), and the function is symmetric about this point.

Parameters

Bias p0.5

Probability of X = 1

Entropy of Discrete Distributions

Adjust the probabilities of a discrete distribution over $M$ outcomes and observe how the entropy changes. Notice that the entropy is maximized when the distribution is uniform and minimized when all mass is concentrated on a single outcome.

Parameters

Alphabet size M4

Number of outcomes

Concentration0

0 = uniform, 1 = all mass on one symbol

Theorem: Grouping Property of Entropy

Let $X$ take values in $\mathcal{X} = \{a_1, \ldots, a_M\}$ with probabilities $(p_1, \ldots, p_M)$ . Partition $\mathcal{X}$ into groups $\mathcal{G}_1, \ldots, \mathcal{G}_K$ , and let $g_k = \sum_{a_i \in \mathcal{G}_k} p_i$ be the probability of group $k$ . Then

$H(X) = H(g_1, \ldots, g_K) + \sum_{k=1}^{K} g_k \, H\!\left(\frac{p_i}{g_k} : a_i \in \mathcal{G}_k\right).$

First identify which group $X$ belongs to (costing $H(g_1, \ldots, g_K)$ bits), then identify the specific outcome within that group. The total information is the sum. This is a recursive structure — it tells us that entropy "decomposes" naturally along any partition.

Proof

Expand the definition

Write $H(X) = -\sum_{k=1}^K \sum_{a_i \in \mathcal{G}_k} p_i \log p_i$ . Substitute $p_i = g_k \cdot (p_i / g_k)$ :

$-\sum_k \sum_{a_i \in \mathcal{G}_k} p_i [\log g_k + \log(p_i/g_k)]$

$= -\sum_k g_k \log g_k + \sum_k g_k \Bigl(-\sum_{a_i \in \mathcal{G}_k} \frac{p_i}{g_k} \log \frac{p_i}{g_k}\Bigr).$

The first term is $H(g_1, \ldots, g_K)$ and the inner sum in the second term is $H(p_i/g_k : a_i \in \mathcal{G}_k)$ .

Quick Check

A random variable $X$ takes four values with probabilities $(1/2, 1/4, 1/8, 1/8)$ . What is $H(X)$ ?

$2$ bits

$7/4$ bits

$3/2$ bits

$1$ bit

Correction:

7/4

bits

$H(X) = \frac{1}{2}(1) + \frac{1}{4}(2) + \frac{1}{8}(3) + \frac{1}{8}(3) = \frac{1}{2} + \frac{1}{2} + \frac{3}{8} + \frac{3}{8} = \frac{7}{4}$ bits.

Quick Check

Under what condition is $H(X) = 0$ ?

$X$ is uniformly distributed

$X$ is deterministic (takes one value with probability 1)

$X$ has exactly two equally likely outcomes

$X$ has an infinite alphabet

Correction:

X

is deterministic (takes one value with probability 1)

When $X$ is deterministic, there is no uncertainty, so $H(X) = 0$ . This is the only case where entropy vanishes.

Theorem: Concavity of Entropy

The entropy $H(p) = -\sum_{x} p(x) \log p(x)$ is a concave function of the probability vector $\mathbf{p} = (p(a_1), \ldots, p(a_M))$ over the probability simplex.

That is, for any two PMFs $\mathbf{p}_1, \mathbf{p}_2$ and any $\lambda \in [0,1]$ :

$H(\lambda \mathbf{p}_1 + (1-\lambda)\mathbf{p}_2) \geq \lambda H(\mathbf{p}_1) + (1-\lambda) H(\mathbf{p}_2).$

Mixing two distributions can only increase (or maintain) uncertainty. The point is that concavity of entropy is not just a mathematical nicety — it is the reason that the capacity optimization $C = \max_{p_X} I(X;Y)$ is a convex problem, which is why we can actually compute capacity.

Proof

Apply concavity of $-t \log t$

The function $\phi(t) = -t \log t$ is concave on $[0,1]$ (its second derivative is $\phi''(t) = -1/(t \ln 2) < 0$ ).

Since $H(\mathbf{p}) = \sum_x \phi(p(x))$ is a sum of concave functions evaluated at different components, and the mixing operates component-wise:

$H(\lambda \mathbf{p}_1 + (1-\lambda)\mathbf{p}_2) = \sum_x \phi(\lambda p_1(x) + (1-\lambda) p_2(x))$

$\geq \sum_x [\lambda \phi(p_1(x)) + (1-\lambda) \phi(p_2(x))] = \lambda H(\mathbf{p}_1) + (1-\lambda) H(\mathbf{p}_2).$

Common Mistake: Entropy Is Not Variance

Mistake:

Confusing entropy with variance or treating them interchangeably as measures of "spread." For example, assuming that a distribution with higher variance necessarily has higher entropy.

Correction:

Entropy measures uncertainty about the identity of the outcome, not its numerical spread. A distribution concentrated on two extreme values (say $\{0, 100\}$ with equal probability) has entropy $1$ bit and variance $2500$ . A uniform distribution over $\{0, 1, 2, 3\}$ has entropy $2$ bits but variance $5/4$ . The two quantities capture fundamentally different aspects of randomness.

Self-information (surprisal)

The self-information of an outcome $x$ with probability $p(x)$ is $\iota(x) = -\log p(x) = \log(1/p(x))$ . It quantifies the "surprise" of observing $x$ . Rare events have high self-information. Entropy is the expected self-information: $H(X) = \mathbb{E}[\iota(X)]$ .

Related: Entropy

Entropy and Its Operational Meaning

Why Entropy?

Definition: Shannon Entropy

Entropy

Bit (unit of information)

Three Ways to Think About Entropy

Historical Note: Shannon, von Neumann, and the Name 'Entropy'

Theorem: Bounds on Entropy

Non-negativity

Upper bound via Jensen's inequality

Example: Binary Entropy Function

Direct computation

Maximization

Boundary values

Example: Entropy of the Uniform Distribution

Computation

Interpretation

Example: Entropy of a Loaded Die

Direct calculation

Comparison with fair die

Binary Entropy Function

Parameters

Entropy of Discrete Distributions

Parameters

Theorem: Grouping Property of Entropy

Expand the definition

Quick Check

Quick Check

Theorem: Concavity of Entropy

Apply concavity of $-t \log t$

Common Mistake: Entropy Is Not Variance

Self-information (surprisal)

Definition:
Shannon Entropy