Joint Entropy and Conditional Entropy

From One Variable to Two

A single random variable in isolation is rarely interesting in communications. What matters is the relationship between the source message XX and the channel output YY. How much does observing YY tell us about XX? How much residual uncertainty about XX remains after seeing YY? To answer these questions we need to extend entropy to pairs of random variables β€” and that leads us to joint and conditional entropy.

Definition:

Joint Entropy

Let (X,Y)(X, Y) be a pair of discrete random variables with joint PMF p(x,y)p(x, y) over alphabet XΓ—Y\mathcal{X} \times \mathcal{Y}. The joint entropy is

H(X,Y)=βˆ’βˆ‘x∈Xβˆ‘y∈Yp(x,y)log⁑p(x,y).H(X, Y) = -\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x,y) \log p(x,y).

More generally, for nn random variables X1,…,XnX_1, \ldots, X_n:

H(X1,…,Xn)=βˆ’βˆ‘x1,…,xnp(x1,…,xn)log⁑p(x1,…,xn).H(X_1, \ldots, X_n) = -\sum_{x_1, \ldots, x_n} p(x_1, \ldots, x_n) \log p(x_1, \ldots, x_n).

Definition:

Conditional Entropy

The conditional entropy of YY given XX is

H(Y∣X)=βˆ‘x∈Xp(x) H(Y∣X=x)=βˆ’βˆ‘x,yp(x,y)log⁑p(y∣x).H(Y|X) = \sum_{x \in \mathcal{X}} p(x) \, H(Y|X=x) = -\sum_{x,y} p(x,y) \log p(y|x).

This is the average residual uncertainty about YY after observing XX. It is not the entropy of YY given a specific value X=xX = x β€” that would be H(Y∣X=x)=βˆ’βˆ‘yp(y∣x)log⁑p(y∣x)H(Y|X=x) = -\sum_y p(y|x) \log p(y|x), which is a function of xx. Conditional entropy averages this over the distribution of XX.

The distinction between H(Y∣X)H(Y|X) (a number) and H(Y∣X=x)H(Y|X=x) (a function of xx) is a common source of confusion. The conditional entropy is an expectation over XX, not a conditional quantity for a fixed realization.

Conditional entropy

The average uncertainty remaining about YY after observing XX: H(Y∣X)=βˆ’βˆ‘x,yp(x,y)log⁑p(y∣x)H(Y|X) = -\sum_{x,y} p(x,y) \log p(y|x). Always non-negative. Equals zero iff YY is a deterministic function of XX.

Related: Entropy, Mutual information

Theorem: Chain Rule for Entropy

For any pair of discrete random variables (X,Y)(X, Y):

H(X,Y)=H(X)+H(Y∣X).H(X, Y) = H(X) + H(Y|X).

More generally, for nn random variables:

H(X1,X2,…,Xn)=βˆ‘i=1nH(Xi∣Xiβˆ’1,…,X1).H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^{n} H(X_i | X_{i-1}, \ldots, X_1).

To describe the pair (X,Y)(X, Y), first describe XX (costing H(X)H(X) bits on average), then describe YY given the value of XX (costing H(Y∣X)H(Y|X) additional bits on average). The total cost is the joint entropy. The chain rule as a telescoping sum is a pattern that reappears in the converse proof of the channel coding theorem and throughout multiuser information theory.

Theorem: Conditioning Reduces Entropy

For any discrete random variables XX and YY:

H(Y∣X)≀H(Y),H(Y|X) \leq H(Y),

with equality if and only if XX and YY are independent.

On average, knowing XX can only reduce (or maintain) our uncertainty about YY. Information never hurts. This is the "information is non-negative" principle in disguise β€” it follows directly from the non-negativity of mutual information, which we define in the next section.

Common Mistake: Conditioning on a Specific Value Can Increase Entropy

Mistake:

Believing that H(Y∣X=x)≀H(Y)H(Y|X=x) \leq H(Y) for every specific value xx. For example, concluding that learning any particular fact about XX always reduces uncertainty about YY.

Correction:

The inequality H(Y∣X)≀H(Y)H(Y|X) \leq H(Y) holds for the average over XX, not pointwise. For specific values of xx, it is possible that H(Y∣X=x)>H(Y)H(Y|X=x) > H(Y). Consider a hidden coin: X=0X=0 means it is fair (entropy 1 bit), X=1X=1 means it is biased toward heads (entropy <1< 1 bit). If Pr⁑(X=0)=0.01\Pr(X=0) = 0.01, then H(Y)H(Y) is low (close to the biased entropy), but H(Y∣X=0)=1H(Y|X=0) = 1 bit, which is higher.

Example: Joint and Conditional Entropy for a Binary Pair

Let X∼Bernoulli(1/2)X \sim \text{Bernoulli}(1/2) and let Y=XβŠ•ZY = X \oplus Z where Z∼Bernoulli(Ο΅)Z \sim \text{Bernoulli}(\epsilon) is independent of XX and βŠ•\oplus denotes XOR. This models a binary symmetric channel with crossover probability Ο΅\epsilon. Compute H(X)H(X), H(Y)H(Y), H(X,Y)H(X,Y), H(Y∣X)H(Y|X), and H(X∣Y)H(X|Y).

The Entropy Venn Diagram

The relationships between H(X)H(X), H(Y)H(Y), H(X,Y)H(X,Y), H(X∣Y)H(X|Y), H(Y∣X)H(Y|X), and I(X;Y)I(X;Y) can be visualized as a Venn diagram:

  • The left circle has area H(X)H(X)
  • The right circle has area H(Y)H(Y)
  • The intersection (overlap) has area I(X;Y)I(X;Y)
  • The left crescent has area H(X∣Y)H(X|Y)
  • The right crescent has area H(Y∣X)H(Y|X)
  • The union has area H(X,Y)H(X,Y)

This diagram is useful for intuition, but dangerous if taken too literally. For three or more variables, the "areas" can become negative (the interaction information I(X;Y;Z)I(X;Y;Z) can be negative), so the Venn diagram breaks down. Use it as a mnemonic for two variables, but rely on algebraic identities for rigorous arguments.

Joint entropy

The entropy of the pair (X,Y)(X,Y): H(X,Y)=βˆ’βˆ‘x,yp(x,y)log⁑p(x,y)H(X,Y) = -\sum_{x,y} p(x,y) \log p(x,y). Satisfies H(X,Y)≀H(X)+H(Y)H(X,Y) \leq H(X) + H(Y) with equality iff X,YX, Y are independent.

Related: Entropy, Conditional entropy

Quick Check

If H(X)=3H(X) = 3 bits and H(X,Y)=5H(X,Y) = 5 bits, what is H(Y∣X)H(Y|X)?

22 bits

55 bits

88 bits

Cannot be determined

Entropy Venn Diagram Animation

Animated Venn diagram showing how H(X)H(X), H(Y)H(Y), I(X;Y)I(X;Y), H(X∣Y)H(X|Y), and H(Y∣X)H(Y|X) relate as overlapping circles. The overlap grows with dependence and vanishes for independent variables.