Relative Entropy (Kullback-Leibler Divergence)

The Mother of All Information Inequalities

We now introduce the quantity that, in a precise sense, underlies everything else in this chapter. The Kullback-Leibler divergence D(PQ)D(P \| Q) measures the "distance" between two distributions. Its non-negativity — the information inequality — is the single most fundamental result in information theory. From it, we derive: non-negativity of mutual information, the bound on entropy, the data processing inequality, and much more. Master this one inequality and the rest follows.

Definition:

Kullback-Leibler Divergence (Relative Entropy)

Let PP and QQ be two probability mass functions on the same finite alphabet X\mathcal{X}. The Kullback-Leibler divergence (or relative entropy) from PP to QQ is

D(PQ)=xXP(x)logP(x)Q(x),D(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)},

with the conventions: 0log0q=00 \log \frac{0}{q} = 0 and plogp0=+p \log \frac{p}{0} = +\infty when p>0p > 0.

The KL divergence is not a distance in the metric sense: it is asymmetric (D(PQ)D(QP)D(P\|Q) \neq D(Q\|P) in general) and does not satisfy the triangle inequality. Nevertheless, it behaves like a "directed distance" in many important ways and is the natural measure of distributional mismatch in information theory, statistics, and machine learning.

Kullback-Leibler divergence

A measure of the "distance" from distribution PP to distribution QQ: D(PQ)=xP(x)log(P(x)/Q(x))D(P \| Q) = \sum_x P(x) \log(P(x)/Q(x)). Always non-negative. Equals zero iff P=QP = Q. Not symmetric.

Related: Mutual information, Entropy

Information inequality

The statement that D(PQ)0D(P \| Q) \geq 0 for all distributions P,QP, Q, with equality iff P=QP = Q. Proved via Jensen's inequality. The most fundamental inequality in information theory.

Related: Kullback-Leibler divergence

Theorem: The Information Inequality (Gibbs' Inequality)

For any two PMFs PP and QQ on the same alphabet X\mathcal{X}:

D(PQ)0,D(P \| Q) \geq 0,

with equality if and only if P=QP = Q.

You can think of D(PQ)D(P \| Q) as the expected extra cost (in bits) of using a code designed for QQ when the true distribution is PP. Using the wrong code is never cheaper than using the right one — and it costs strictly more unless the two distributions are identical.

Consequences of the Information Inequality

Almost every major inequality in this chapter is a corollary of the information inequality. For example:

  • Non-negativity of MI: I(X;Y)=D(PXYPX×PY)0I(X;Y) = D(P_{XY} \| P_X \times P_Y) \geq 0.
  • Conditioning reduces entropy: follows from I(X;Y)0I(X;Y) \geq 0.
  • Entropy upper bound: H(X)logXH(X) \leq \log|\mathcal{X}| follows from D(PU)0D(P \| U) \geq 0 where UU is uniform.
  • Log-sum inequality: a weighted version of the information inequality.

The point is that there is really just one fundamental inequality in discrete information theory. Everything else is a special case.

Theorem: Mutual Information as KL Divergence

I(X;Y)=D(PXYPX×PY).I(X;Y) = D(P_{XY} \| P_X \times P_Y).$

Mutual information measures how much the joint distribution PXYP_{XY} deviates from the product of the marginals PX×PYP_X \times P_Y. When XX and YY are independent, PXY=PX×PYP_{XY} = P_X \times P_Y and I(X;Y)=0I(X;Y) = 0. The more dependent XX and YY are, the larger the divergence.

Example: KL Divergence Is Not Symmetric

Let P=(1/2,1/4,1/4)P = (1/2, 1/4, 1/4) and Q=(1/4,1/2,1/4)Q = (1/4, 1/2, 1/4) be PMFs on X={a,b,c}\mathcal{X} = \{a, b, c\}. Compute D(PQ)D(P \| Q) and D(QP)D(Q \| P) and verify they differ.

Theorem: Log-Sum Inequality

For non-negative numbers a1,,ana_1, \ldots, a_n and b1,,bnb_1, \ldots, b_n:

i=1nailogaibi(i=1nai)logi=1naii=1nbi,\sum_{i=1}^{n} a_i \log \frac{a_i}{b_i} \geq \left(\sum_{i=1}^{n} a_i\right) \log \frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n b_i},

with equality iff ai/bia_i / b_i is constant for all ii.

This is a weighted generalization of Jensen's inequality for the convex function tlogtt \log t. It provides a convenient way to prove many information-theoretic inequalities without going back to Jensen's inequality each time.

Historical Note: Kullback and Leibler's Contribution

1951

Solomon Kullback and Richard Leibler introduced their divergence measure in a 1951 paper on "information and sufficiency." Kullback was a cryptanalyst at the NSA — his interest in divergence came from hypothesis testing in signals intelligence. The quantity D(PQ)D(P \| Q) arises naturally as the expected log-likelihood ratio in the Neyman-Pearson lemma, and it governs the exponential rate at which hypothesis testing errors decrease with sample size.

The connection between information theory and statistics runs deep: Fisher information, sufficient statistics, and the Cramér-Rao bound all have natural information-theoretic interpretations that we will explore in Book FSI.

KL Divergence Between Two Distributions

Compare two distributions PP and QQ over a finite alphabet and visualize the KL divergence D(PQ)D(P \| Q). Adjust QQ to see how the divergence changes. Notice that D(PQ)D(P \| Q) \to \infty when Q(x)0Q(x) \to 0 for any xx where P(x)>0P(x) > 0.

Parameters
0.5

Probability Q assigns to outcome 1 (P is fixed at 0.5)

Common Mistake: KL Divergence Is Not a Distance

Mistake:

Using D(PQ)D(P \| Q) as if it were a symmetric distance metric. For example, writing "PP and QQ are close because D(PQ)D(P \| Q) is small" without checking D(QP)D(Q \| P).

Correction:

KL divergence is asymmetric: D(PQ)D(QP)D(P \| Q) \neq D(Q \| P) in general. It also violates the triangle inequality. The choice of which distribution goes first matters and depends on the application: D(PQ)D(P \| Q) penalizes placing too little mass where PP has mass (mode-seeking), while D(QP)D(Q \| P) penalizes placing mass where PP has none (mode-covering). In hypothesis testing, D(P1P0)D(P_1 \| P_0) governs the type-II error exponent.

Log-sum inequality

ailog(ai/bi)(ai)log(ai/bi)\sum a_i \log(a_i/b_i) \geq (\sum a_i) \log(\sum a_i / \sum b_i) for non-negative ai,bia_i, b_i. A generalization of Jensen's inequality used to prove convexity of KL divergence and many other information inequalities.

Related: Kullback-Leibler divergence

The Information Inequality: D(PQ)0D(P \| Q) \geq 0

Animates two distributions PP and QQ as QQ morphs toward PP, showing D(PQ)D(P \| Q) decreasing to zero. Demonstrates that KL divergence is zero if and only if P=QP = Q.