Convexity Properties
Why Convexity Matters
In information theory, convexity is not a mathematical luxury — it is what separates problems we can solve from problems we cannot. When the capacity optimization turns out to be a concave maximization (equivalently, a convex optimization problem), we know that any local maximum is a global maximum, KKT conditions are sufficient, and efficient algorithms exist. This section establishes the convexity and concavity properties of entropy, divergence, and mutual information that make the entire theory computationally tractable.
Theorem: Convexity of KL Divergence
is convex in the pair . That is, for PMFs and :
Mixing pairs of distributions brings their divergence down. This joint convexity is a direct consequence of the log-sum inequality.
Apply the log-sum inequality
Let and . For each :
By the log-sum inequality applied with , , , , and summing over , we obtain the result.
Theorem: Concavity and Convexity of Mutual Information
Let be random variables where is the output of a channel with input distributed according to .
- is a concave function of for fixed .
- is a convex function of for fixed .
Part 1 is the reason we can compute capacity. Since and is concave in , the capacity problem is a concave maximization over the probability simplex — a convex optimization problem. This means: (i) any local maximum is global, (ii) the KKT conditions are necessary and sufficient, and (iii) algorithms like the Blahut-Arimoto algorithm converge to the global optimum.
Part 2 says that among all channels, the mutual information is lowest for "nice" (deterministic-like) channels and highest for "adversarial" mixtures. This convexity is used in minimax results and game-theoretic formulations of channel coding.
Concavity in $P_X$ (Part 1)
Write .
The output distribution is a linear function of . Since entropy is concave (Theorem 1.1.5), is concave in .
The conditional entropy is linear in (for fixed channel, each is a constant).
Therefore in .
Convexity in $P_{Y|X}$ (Part 2)
Write .
For fixed , is concave in (entropy of each row is concave, and a positively weighted sum of concave functions is concave).
The output entropy is a concave function of , which is linear in , so is concave in .
However, where both terms are concave. But the second term enters with a negative sign, so is convex. A more careful analysis using the convexity of in the pair establishes that is convex in for fixed .
Computing Capacity: The Blahut-Arimoto Algorithm
The concavity of in guarantees that the capacity optimization has a unique global maximum. The Blahut-Arimoto algorithm (1972) exploits this structure: it alternates between optimizing the input distribution and the "tilted" output distribution, converging monotonically to the capacity. Each iteration increases a lower bound on mutual information.
For a DMC with inputs and outputs, each iteration costs , and convergence is typically fast (10-50 iterations for 6 digits of accuracy). This makes capacity computation routine for any discrete channel.
- •
Requires finite input and output alphabets
- •
Convergence rate depends on the channel structure
- •
For continuous channels, discretization or parametric optimization is needed
Convexity Properties of Information Measures
| Quantity | Concave in | Convex in | Key consequence |
|---|---|---|---|
| — | Maximum entropy distributions exist and are unique | ||
| — | jointly | Projection onto convex sets of distributions | |
| (fixed channel) | (fixed input) | Capacity is a convex optimization problem |
Key Takeaway
Convexity is the reason information theory is computable. The concavity of mutual information in the input distribution ensures that channel capacity is a convex problem with a unique global optimum. Without this property, computing capacity would be intractable for all but the simplest channels.
Quick Check
Why does the concavity of in matter for computing channel capacity?
It guarantees any local maximum is a global maximum
It makes easier to compute
It proves that capacity exists
It implies is always positive
For concave functions, every local maximum is also a global maximum. This means gradient-based or alternating optimization algorithms (like Blahut-Arimoto) are guaranteed to find the capacity-achieving input distribution.