Joint Entropy and Conditional Entropy
From One Variable to Two
A single random variable in isolation is rarely interesting in communications. What matters is the relationship between the source message and the channel output . How much does observing tell us about ? How much residual uncertainty about remains after seeing ? To answer these questions we need to extend entropy to pairs of random variables β and that leads us to joint and conditional entropy.
Definition: Joint Entropy
Joint Entropy
Let be a pair of discrete random variables with joint PMF over alphabet . The joint entropy is
More generally, for random variables :
Definition: Conditional Entropy
Conditional Entropy
The conditional entropy of given is
This is the average residual uncertainty about after observing . It is not the entropy of given a specific value β that would be , which is a function of . Conditional entropy averages this over the distribution of .
The distinction between (a number) and (a function of ) is a common source of confusion. The conditional entropy is an expectation over , not a conditional quantity for a fixed realization.
Conditional entropy
The average uncertainty remaining about after observing : . Always non-negative. Equals zero iff is a deterministic function of .
Related: Entropy, Mutual information
Theorem: Chain Rule for Entropy
For any pair of discrete random variables :
More generally, for random variables:
To describe the pair , first describe (costing bits on average), then describe given the value of (costing additional bits on average). The total cost is the joint entropy. The chain rule as a telescoping sum is a pattern that reappears in the converse proof of the channel coding theorem and throughout multiuser information theory.
Expand the joint entropy
$
Identify the two terms
The first sum: .
The second sum: by definition.
Therefore .
General case by induction
Apply the two-variable chain rule repeatedly:
.
Unrolling the recursion yields the telescoping sum.
Theorem: Conditioning Reduces Entropy
For any discrete random variables and :
with equality if and only if and are independent.
On average, knowing can only reduce (or maintain) our uncertainty about . Information never hurts. This is the "information is non-negative" principle in disguise β it follows directly from the non-negativity of mutual information, which we define in the next section.
From the chain rule
By the chain rule, .
Therefore .
We will prove that in Section 1.4 using the information inequality. Equality holds iff , i.e., iff and are independent.
Common Mistake: Conditioning on a Specific Value Can Increase Entropy
Mistake:
Believing that for every specific value . For example, concluding that learning any particular fact about always reduces uncertainty about .
Correction:
The inequality holds for the average over , not pointwise. For specific values of , it is possible that . Consider a hidden coin: means it is fair (entropy 1 bit), means it is biased toward heads (entropy bit). If , then is low (close to the biased entropy), but bit, which is higher.
Example: Joint and Conditional Entropy for a Binary Pair
Let and let where is independent of and denotes XOR. This models a binary symmetric channel with crossover probability . Compute , , , , and .
Marginal entropies
bit (fair coin).
Since is uniform and is independent, is also uniform: . So bit.
Conditional entropy $\ntn{entropy}(Y|X)$
Given , we have , so has the same distribution as :
Joint entropy
By the chain rule:
Conditional entropy $\ntn{entropy}(X|Y)$
By symmetry of the chain rule:
The residual uncertainty about after observing equals . When (noiseless channel), : determines perfectly. When (useless channel), : tells us nothing about .
The Entropy Venn Diagram
The relationships between , , , , , and can be visualized as a Venn diagram:
- The left circle has area
- The right circle has area
- The intersection (overlap) has area
- The left crescent has area
- The right crescent has area
- The union has area
This diagram is useful for intuition, but dangerous if taken too literally. For three or more variables, the "areas" can become negative (the interaction information can be negative), so the Venn diagram breaks down. Use it as a mnemonic for two variables, but rely on algebraic identities for rigorous arguments.
Joint entropy
The entropy of the pair : . Satisfies with equality iff are independent.
Related: Entropy, Conditional entropy
Quick Check
If bits and bits, what is ?
bits
bits
bits
Cannot be determined
By the chain rule: bits.