Ferkans — Interactive Telecom Tutor

From One Variable to Two

A single random variable in isolation is rarely interesting in communications. What matters is the relationship between the source message $X$ and the channel output $Y$ . How much does observing $Y$ tell us about $X$ ? How much residual uncertainty about $X$ remains after seeing $Y$ ? To answer these questions we need to extend entropy to pairs of random variables — and that leads us to joint and conditional entropy.

Definition:
Joint Entropy

Let $(X, Y)$ be a pair of discrete random variables with joint PMF $p(x, y)$ over alphabet $\mathcal{X} \times \mathcal{Y}$ . The joint entropy is

$H(X, Y) = -\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x,y) \log p(x,y).$

More generally, for $n$ random variables $X_1, \ldots, X_n$ :

$H(X_1, \ldots, X_n) = -\sum_{x_1, \ldots, x_n} p(x_1, \ldots, x_n) \log p(x_1, \ldots, x_n).$

Definition:
Conditional Entropy

The conditional entropy of $Y$ given $X$ is

$H(Y|X) = \sum_{x \in \mathcal{X}} p(x) \, H(Y|X=x) = -\sum_{x,y} p(x,y) \log p(y|x).$

This is the average residual uncertainty about $Y$ after observing $X$ . It is not the entropy of $Y$ given a specific value $X = x$ — that would be $H(Y|X=x) = -\sum_y p(y|x) \log p(y|x)$ , which is a function of $x$ . Conditional entropy averages this over the distribution of $X$ .

The distinction between $H(Y|X)$ (a number) and $H(Y|X=x)$ (a function of $x$ ) is a common source of confusion. The conditional entropy is an expectation over $X$ , not a conditional quantity for a fixed realization.

Conditional entropy

The average uncertainty remaining about $Y$ after observing $X$ : $H(Y|X) = -\sum_{x,y} p(x,y) \log p(y|x)$ . Always non-negative. Equals zero iff $Y$ is a deterministic function of $X$ .

Related: Entropy, Mutual information

Theorem: Chain Rule for Entropy

For any pair of discrete random variables $(X, Y)$ :

$H(X, Y) = H(X) + H(Y|X).$

More generally, for $n$ random variables:

$H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^{n} H(X_i | X_{i-1}, \ldots, X_1).$

To describe the pair $(X, Y)$ , first describe $X$ (costing $H(X)$ bits on average), then describe $Y$ given the value of $X$ (costing $H(Y|X)$ additional bits on average). The total cost is the joint entropy. The chain rule as a telescoping sum is a pattern that reappears in the converse proof of the channel coding theorem and throughout multiuser information theory.

Proof

Expand the joint entropy

$H(X,Y) = -\sum_{x,y} p(x,y) \log p(x,y).KATEXPLACEHOLDER0END= -\sum_{x,y} p(x,y) [\log p(x) + \log p(y|x)]KATEXPLACEHOLDER1END= -\sum_{x,y} p(x,y) \log p(x) - \sum_{x,y} p(x,y) \log p(y|x).$ $

Identify the two terms

The first sum: $-\sum_x \bigl(\sum_y p(x,y)\bigr) \log p(x) = -\sum_x p(x) \log p(x) = H(X)$ .

The second sum: $-\sum_{x,y} p(x,y) \log p(y|x) = H(Y|X)$ by definition.

Therefore $H(X,Y) = H(X) + H(Y|X)$ .

General case by induction

Apply the two-variable chain rule repeatedly:

$H(X_1, \ldots, X_n) = H(X_1, \ldots, X_{n-1}) + H(X_n | X_1, \ldots, X_{n-1})$ .

Unrolling the recursion yields the telescoping sum.

Theorem: Conditioning Reduces Entropy

For any discrete random variables $X$ and $Y$ :

$H(Y|X) \leq H(Y),$

with equality if and only if $X$ and $Y$ are independent.

On average, knowing $X$ can only reduce (or maintain) our uncertainty about $Y$ . Information never hurts. This is the "information is non-negative" principle in disguise — it follows directly from the non-negativity of mutual information, which we define in the next section.

Proof

From the chain rule

By the chain rule, $H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)$ .

Therefore $H(Y) - H(Y|X) = H(X) - H(X|Y) = I(X;Y) \geq 0$ .

We will prove that $I(X;Y) \geq 0$ in Section 1.4 using the information inequality. Equality holds iff $I(X;Y) = 0$ , i.e., iff $X$ and $Y$ are independent.

Common Mistake: Conditioning on a Specific Value Can Increase Entropy

Mistake:

Believing that $H(Y|X=x) \leq H(Y)$ for every specific value $x$ . For example, concluding that learning any particular fact about $X$ always reduces uncertainty about $Y$ .

Correction:

The inequality $H(Y|X) \leq H(Y)$ holds for the average over $X$ , not pointwise. For specific values of $x$ , it is possible that $H(Y|X=x) > H(Y)$ . Consider a hidden coin: $X=0$ means it is fair (entropy 1 bit), $X=1$ means it is biased toward heads (entropy $< 1$ bit). If $\Pr(X=0) = 0.01$ , then $H(Y)$ is low (close to the biased entropy), but $H(Y|X=0) = 1$ bit, which is higher.

Example: Joint and Conditional Entropy for a Binary Pair

Let $X \sim \text{Bernoulli}(1/2)$ and let $Y = X \oplus Z$ where $Z \sim \text{Bernoulli}(\epsilon)$ is independent of $X$ and $\oplus$ denotes XOR. This models a binary symmetric channel with crossover probability $\epsilon$ . Compute $H(X)$ , $H(Y)$ , $H(X,Y)$ , $H(Y|X)$ , and $H(X|Y)$ .

Solution

Marginal entropies

$H(X) = 1$ bit (fair coin).

Since $X$ is uniform and $Z$ is independent, $Y$ is also uniform: $\Pr(Y=0) = \Pr(Y=1) = 1/2$ . So $H(Y) = 1$ bit.

Conditional entropy $\ntn{entropy}(Y|X)$

Given $X = x$ , we have $Y = x \oplus Z$ , so $Y|X=x$ has the same distribution as $Z$ :

$H(Y|X) = H(Z) = h_b(\epsilon).$

Joint entropy

By the chain rule:

$H(X,Y) = H(X) + H(Y|X) = 1 + h_b(\epsilon).$

Conditional entropy $\ntn{entropy}(X|Y)$

By symmetry of the chain rule:

$H(X|Y) = H(X,Y) - H(Y) = 1 + h_b(\epsilon) - 1 = h_b(\epsilon).$

The residual uncertainty about $X$ after observing $Y$ equals $h_b(\epsilon)$ . When $\epsilon = 0$ (noiseless channel), $H(X|Y) = 0$ : $Y$ determines $X$ perfectly. When $\epsilon = 1/2$ (useless channel), $H(X|Y) = 1$ : $Y$ tells us nothing about $X$ .

The Entropy Venn Diagram

The relationships between $H(X)$ , $H(Y)$ , $H(X,Y)$ , $H(X|Y)$ , $H(Y|X)$ , and $I(X;Y)$ can be visualized as a Venn diagram:

The left circle has area $H(X)$
The right circle has area $H(Y)$
The intersection (overlap) has area $I(X;Y)$
The left crescent has area $H(X|Y)$
The right crescent has area $H(Y|X)$
The union has area $H(X,Y)$

This diagram is useful for intuition, but dangerous if taken too literally. For three or more variables, the "areas" can become negative (the interaction information $I(X;Y;Z)$ can be negative), so the Venn diagram breaks down. Use it as a mnemonic for two variables, but rely on algebraic identities for rigorous arguments.

Joint entropy

The entropy of the pair $(X,Y)$ : $H(X,Y) = -\sum_{x,y} p(x,y) \log p(x,y)$ . Satisfies $H(X,Y) \leq H(X) + H(Y)$ with equality iff $X, Y$ are independent.

Related: Entropy, Conditional entropy

Quick Check

If $H(X) = 3$ bits and $H(X,Y) = 5$ bits, what is $H(Y|X)$ ?

$2$ bits

$5$ bits

$8$ bits

Cannot be determined

Correction:

2

bits

By the chain rule: $H(Y|X) = H(X,Y) - H(X) = 5 - 3 = 2$ bits.

Entropy Venn Diagram Animation

Animated Venn diagram showing how

H(X)

,

H(Y)

,

I(X;Y)

,

H(X|Y)

, and

H(Y|X)

relate as overlapping circles. The overlap grows with dependence and vanishes for independent variables.

Joint Entropy and Conditional Entropy

From One Variable to Two

Definition: Joint Entropy

Definition: Conditional Entropy

Conditional entropy

Theorem: Chain Rule for Entropy

Expand the joint entropy

Identify the two terms

General case by induction

Theorem: Conditioning Reduces Entropy

From the chain rule

Common Mistake: Conditioning on a Specific Value Can Increase Entropy

Example: Joint and Conditional Entropy for a Binary Pair

Marginal entropies

Conditional entropy $\ntn{entropy}(Y|X)$

Joint entropy

Conditional entropy $\ntn{entropy}(X|Y)$

The Entropy Venn Diagram

Joint entropy

Quick Check

Entropy Venn Diagram Animation

Definition:
Joint Entropy

Definition:
Conditional Entropy