Ferkans — Interactive Telecom Tutor

The Mother of All Information Inequalities

We now introduce the quantity that, in a precise sense, underlies everything else in this chapter. The Kullback-Leibler divergence $D(P \| Q)$ measures the "distance" between two distributions. Its non-negativity — the information inequality — is the single most fundamental result in information theory. From it, we derive: non-negativity of mutual information, the bound on entropy, the data processing inequality, and much more. Master this one inequality and the rest follows.

Definition:
Kullback-Leibler Divergence (Relative Entropy)

Let $P$ and $Q$ be two probability mass functions on the same finite alphabet $\mathcal{X}$ . The Kullback-Leibler divergence (or relative entropy) from $P$ to $Q$ is

$D(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)},$

with the conventions: $0 \log \frac{0}{q} = 0$ and $p \log \frac{p}{0} = +\infty$ when $p > 0$ .

The KL divergence is not a distance in the metric sense: it is asymmetric ( $D(P\|Q) \neq D(Q\|P)$ in general) and does not satisfy the triangle inequality. Nevertheless, it behaves like a "directed distance" in many important ways and is the natural measure of distributional mismatch in information theory, statistics, and machine learning.

Kullback-Leibler divergence

A measure of the "distance" from distribution $P$ to distribution $Q$ : $D(P \| Q) = \sum_x P(x) \log(P(x)/Q(x))$ . Always non-negative. Equals zero iff $P = Q$ . Not symmetric.

Related: Mutual information, Entropy

Information inequality

The statement that $D(P \| Q) \geq 0$ for all distributions $P, Q$ , with equality iff $P = Q$ . Proved via Jensen's inequality. The most fundamental inequality in information theory.

Related: Kullback-Leibler divergence

Theorem: The Information Inequality (Gibbs' Inequality)

For any two PMFs $P$ and $Q$ on the same alphabet $\mathcal{X}$ :

$D(P \| Q) \geq 0,$

with equality if and only if $P = Q$ .

You can think of $D(P \| Q)$ as the expected extra cost (in bits) of using a code designed for $Q$ when the true distribution is $P$ . Using the wrong code is never cheaper than using the right one — and it costs strictly more unless the two distributions are identical.

Proof

Apply Jensen's inequality

$-D(P \| Q) = -\sum_x P(x) \log \frac{P(x)}{Q(x)} = \sum_x P(x) \log \frac{Q(x)}{P(x)} = \mathbb{E}_P\!\left[\log \frac{Q(X)}{P(X)}\right].KATEXPLACEHOLDER0END\mathbb{E}_P\!\left[\log \frac{Q(X)}{P(X)}\right] \leq \log\!\left(\mathbb{E}_P\!\left[\frac{Q(X)}{P(X)}\right]\right) = \log\!\left(\sum_x Q(x)\right) = \log 1 = 0.$ $

Conclude non-negativity

Therefore $-D(P \| Q) \leq 0$ , which gives $D(P \| Q) \geq 0$ .

Equality condition

Equality in Jensen's inequality for strictly concave functions holds iff $Q(X)/P(X)$ is constant $P$ -a.s. Since both $P$ and $Q$ sum to $1$ , this constant must be $1$ , so $P = Q$ .

Consequences of the Information Inequality

Almost every major inequality in this chapter is a corollary of the information inequality. For example:

Non-negativity of MI: $I(X;Y) = D(P_{XY} \| P_X \times P_Y) \geq 0$ .
Conditioning reduces entropy: follows from $I(X;Y) \geq 0$ .
Entropy upper bound: $H(X) \leq \log|\mathcal{X}|$ follows from $D(P \| U) \geq 0$ where $U$ is uniform.
Log-sum inequality: a weighted version of the information inequality.

The point is that there is really just one fundamental inequality in discrete information theory. Everything else is a special case.

Theorem: Mutual Information as KL Divergence

$I(X;Y) = D(P_{XY} \| P_X \times P_Y).$ $

Mutual information measures how much the joint distribution $P_{XY}$ deviates from the product of the marginals $P_X \times P_Y$ . When $X$ and $Y$ are independent, $P_{XY} = P_X \times P_Y$ and $I(X;Y) = 0$ . The more dependent $X$ and $Y$ are, the larger the divergence.

Proof

Direct verification

$D(P_{XY} \| P_X \times P_Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)\,p(y)}KATEXPLACEHOLDER0END= \sum_{x,y} p(x,y) [\log p(x,y) - \log p(x) - \log p(y)]KATEXPLACEHOLDER1END= -H(X,Y) + H(X) + H(Y) = I(X;Y).$ $

Example: KL Divergence Is Not Symmetric

Let $P = (1/2, 1/4, 1/4)$ and $Q = (1/4, 1/2, 1/4)$ be PMFs on $\mathcal{X} = \{a, b, c\}$ . Compute $D(P \| Q)$ and $D(Q \| P)$ and verify they differ.

Solution

Compute $\ntn{kldiv}(P \| Q)$

$D(P \| Q) = \frac{1}{2}\log\frac{1/2}{1/4} + \frac{1}{4}\log\frac{1/4}{1/2} + \frac{1}{4}\log\frac{1/4}{1/4}KATEXPLACEHOLDER0END= \frac{1}{2}(1) + \frac{1}{4}(-1) + \frac{1}{4}(0) = \frac{1}{4} \text{ bit}.$ $

Compute $\ntn{kldiv}(Q \| P)$

$D(Q \| P) = \frac{1}{4}\log\frac{1/4}{1/2} + \frac{1}{2}\log\frac{1/2}{1/4} + \frac{1}{4}\log\frac{1/4}{1/4}KATEXPLACEHOLDER0END= \frac{1}{4}(-1) + \frac{1}{2}(1) + \frac{1}{4}(0) = \frac{1}{4} \text{ bit}.$ $

Observation

In this particular example, the two divergences happen to be equal due to the symmetry of the distributions (they are permutations of each other). In general, $D(P \| Q) \neq D(Q \| P)$ . A clearer asymmetry: take $P = (1/2, 1/2)$ and $Q = (1/4, 3/4)$ . Then $D(P\|Q) = \frac{1}{2}\log 2 + \frac{1}{2}\log \frac{2}{3} \approx 0.0589$ while $D(Q\|P) = \frac{1}{4}\log\frac{1}{2} + \frac{3}{4}\log\frac{3}{2} \approx 0.0566$ .

Theorem: Log-Sum Inequality

For non-negative numbers $a_1, \ldots, a_n$ and $b_1, \ldots, b_n$ :

$\sum_{i=1}^{n} a_i \log \frac{a_i}{b_i} \geq \left(\sum_{i=1}^{n} a_i\right) \log \frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n b_i},$

with equality iff $a_i / b_i$ is constant for all $i$ .

This is a weighted generalization of Jensen's inequality for the convex function $t \log t$ . It provides a convenient way to prove many information-theoretic inequalities without going back to Jensen's inequality each time.

Proof

Normalize and apply Jensen

Let $A = \sum_i a_i$ and $B = \sum_i b_i$ . Define $\alpha_i = a_i/A$ (a probability distribution).

$\frac{1}{A}\sum_i a_i \log \frac{a_i}{b_i} = \sum_i \alpha_i \log \frac{\alpha_i A}{b_i} = \sum_i \alpha_i \log \frac{\alpha_i}{b_i/B} + \log \frac{A}{B}.$

The first term is $D(\alpha \| \beta) \geq 0$ where $\beta_i = b_i/B$ . Therefore

$\sum_i a_i \log \frac{a_i}{b_i} \geq A \log \frac{A}{B}.$

Historical Note: Kullback and Leibler's Contribution

1951

Solomon Kullback and Richard Leibler introduced their divergence measure in a 1951 paper on "information and sufficiency." Kullback was a cryptanalyst at the NSA — his interest in divergence came from hypothesis testing in signals intelligence. The quantity $D(P \| Q)$ arises naturally as the expected log-likelihood ratio in the Neyman-Pearson lemma, and it governs the exponential rate at which hypothesis testing errors decrease with sample size.

The connection between information theory and statistics runs deep: Fisher information, sufficient statistics, and the Cramér-Rao bound all have natural information-theoretic interpretations that we will explore in Book FSI.

KL Divergence Between Two Distributions

Compare two distributions $P$ and $Q$ over a finite alphabet and visualize the KL divergence $D(P \| Q)$ . Adjust $Q$ to see how the divergence changes. Notice that $D(P \| Q) \to \infty$ when $Q(x) \to 0$ for any $x$ where $P(x) > 0$ .

Parameters

Q(x=1)0.5

Probability Q assigns to outcome 1 (P is fixed at 0.5)

Common Mistake: KL Divergence Is Not a Distance

Mistake:

Using $D(P \| Q)$ as if it were a symmetric distance metric. For example, writing " $P$ and $Q$ are close because $D(P \| Q)$ is small" without checking $D(Q \| P)$ .

Correction:

KL divergence is asymmetric: $D(P \| Q) \neq D(Q \| P)$ in general. It also violates the triangle inequality. The choice of which distribution goes first matters and depends on the application: $D(P \| Q)$ penalizes placing too little mass where $P$ has mass (mode-seeking), while $D(Q \| P)$ penalizes placing mass where $P$ has none (mode-covering). In hypothesis testing, $D(P_1 \| P_0)$ governs the type-II error exponent.

Log-sum inequality

$\sum a_i \log(a_i/b_i) \geq (\sum a_i) \log(\sum a_i / \sum b_i)$ for non-negative $a_i, b_i$ . A generalization of Jensen's inequality used to prove convexity of KL divergence and many other information inequalities.

Related: Kullback-Leibler divergence

The Information Inequality: $D(P \| Q) \geq 0$

Animates two distributions

P

and

Q

as

Q

morphs toward

P

, showing

D(P \| Q)

decreasing to zero. Demonstrates that KL divergence is zero if and only if

P = Q

.

Relative Entropy (Kullback-Leibler Divergence)

The Mother of All Information Inequalities

Definition: Kullback-Leibler Divergence (Relative Entropy)

Kullback-Leibler divergence

Information inequality

Theorem: The Information Inequality (Gibbs' Inequality)

Apply Jensen's inequality

Conclude non-negativity

Equality condition

Consequences of the Information Inequality

Theorem: Mutual Information as KL Divergence

Direct verification

Example: KL Divergence Is Not Symmetric

Compute $\ntn{kldiv}(P \| Q)$

Compute $\ntn{kldiv}(Q \| P)$

Observation

Theorem: Log-Sum Inequality

Normalize and apply Jensen

Historical Note: Kullback and Leibler's Contribution

KL Divergence Between Two Distributions

Parameters

Common Mistake: KL Divergence Is Not a Distance

Log-sum inequality

The Information Inequality: D(P∥Q)≥0D(P \| Q) \geq 0D(P∥Q)≥0

Definition:
Kullback-Leibler Divergence (Relative Entropy)

The Information Inequality: $D(P \| Q) \geq 0$