Relative Entropy (Kullback-Leibler Divergence)
The Mother of All Information Inequalities
We now introduce the quantity that, in a precise sense, underlies everything else in this chapter. The Kullback-Leibler divergence measures the "distance" between two distributions. Its non-negativity — the information inequality — is the single most fundamental result in information theory. From it, we derive: non-negativity of mutual information, the bound on entropy, the data processing inequality, and much more. Master this one inequality and the rest follows.
Definition: Kullback-Leibler Divergence (Relative Entropy)
Kullback-Leibler Divergence (Relative Entropy)
Let and be two probability mass functions on the same finite alphabet . The Kullback-Leibler divergence (or relative entropy) from to is
with the conventions: and when .
The KL divergence is not a distance in the metric sense: it is asymmetric ( in general) and does not satisfy the triangle inequality. Nevertheless, it behaves like a "directed distance" in many important ways and is the natural measure of distributional mismatch in information theory, statistics, and machine learning.
Kullback-Leibler divergence
A measure of the "distance" from distribution to distribution : . Always non-negative. Equals zero iff . Not symmetric.
Related: Mutual information, Entropy
Information inequality
The statement that for all distributions , with equality iff . Proved via Jensen's inequality. The most fundamental inequality in information theory.
Related: Kullback-Leibler divergence
Theorem: The Information Inequality (Gibbs' Inequality)
For any two PMFs and on the same alphabet :
with equality if and only if .
You can think of as the expected extra cost (in bits) of using a code designed for when the true distribution is . Using the wrong code is never cheaper than using the right one — and it costs strictly more unless the two distributions are identical.
Apply Jensen's inequality
$
Conclude non-negativity
Therefore , which gives .
Equality condition
Equality in Jensen's inequality for strictly concave functions holds iff is constant -a.s. Since both and sum to , this constant must be , so .
Consequences of the Information Inequality
Almost every major inequality in this chapter is a corollary of the information inequality. For example:
- Non-negativity of MI: .
- Conditioning reduces entropy: follows from .
- Entropy upper bound: follows from where is uniform.
- Log-sum inequality: a weighted version of the information inequality.
The point is that there is really just one fundamental inequality in discrete information theory. Everything else is a special case.
Theorem: Mutual Information as KL Divergence
$
Mutual information measures how much the joint distribution deviates from the product of the marginals . When and are independent, and . The more dependent and are, the larger the divergence.
Direct verification
$
Example: KL Divergence Is Not Symmetric
Let and be PMFs on . Compute and and verify they differ.
Compute $\ntn{kldiv}(P \| Q)$
$
Compute $\ntn{kldiv}(Q \| P)$
$
Observation
In this particular example, the two divergences happen to be equal due to the symmetry of the distributions (they are permutations of each other). In general, . A clearer asymmetry: take and . Then while .
Theorem: Log-Sum Inequality
For non-negative numbers and :
with equality iff is constant for all .
This is a weighted generalization of Jensen's inequality for the convex function . It provides a convenient way to prove many information-theoretic inequalities without going back to Jensen's inequality each time.
Normalize and apply Jensen
Let and . Define (a probability distribution).
The first term is where . Therefore
Historical Note: Kullback and Leibler's Contribution
1951Solomon Kullback and Richard Leibler introduced their divergence measure in a 1951 paper on "information and sufficiency." Kullback was a cryptanalyst at the NSA — his interest in divergence came from hypothesis testing in signals intelligence. The quantity arises naturally as the expected log-likelihood ratio in the Neyman-Pearson lemma, and it governs the exponential rate at which hypothesis testing errors decrease with sample size.
The connection between information theory and statistics runs deep: Fisher information, sufficient statistics, and the Cramér-Rao bound all have natural information-theoretic interpretations that we will explore in Book FSI.
KL Divergence Between Two Distributions
Compare two distributions and over a finite alphabet and visualize the KL divergence . Adjust to see how the divergence changes. Notice that when for any where .
Parameters
Probability Q assigns to outcome 1 (P is fixed at 0.5)
Common Mistake: KL Divergence Is Not a Distance
Mistake:
Using as if it were a symmetric distance metric. For example, writing " and are close because is small" without checking .
Correction:
KL divergence is asymmetric: in general. It also violates the triangle inequality. The choice of which distribution goes first matters and depends on the application: penalizes placing too little mass where has mass (mode-seeking), while penalizes placing mass where has none (mode-covering). In hypothesis testing, governs the type-II error exponent.
Log-sum inequality
for non-negative . A generalization of Jensen's inequality used to prove convexity of KL divergence and many other information inequalities.
Related: Kullback-Leibler divergence