Convexity Properties

Why Convexity Matters

In information theory, convexity is not a mathematical luxury — it is what separates problems we can solve from problems we cannot. When the capacity optimization maxpXI(X;Y)\max_{p_X} I(X;Y) turns out to be a concave maximization (equivalently, a convex optimization problem), we know that any local maximum is a global maximum, KKT conditions are sufficient, and efficient algorithms exist. This section establishes the convexity and concavity properties of entropy, divergence, and mutual information that make the entire theory computationally tractable.

Theorem: Convexity of KL Divergence

D(PQ)D(P \| Q) is convex in the pair (P,Q)(P, Q). That is, for PMFs P1,P2,Q1,Q2P_1, P_2, Q_1, Q_2 and λ[0,1]\lambda \in [0,1]:

D(λP1+(1λ)P2λQ1+(1λ)Q2)λD(P1Q1)+(1λ)D(P2Q2).D(\lambda P_1 + (1-\lambda)P_2 \| \lambda Q_1 + (1-\lambda)Q_2) \leq \lambda D(P_1 \| Q_1) + (1-\lambda) D(P_2 \| Q_2).

Mixing pairs of distributions brings their divergence down. This joint convexity is a direct consequence of the log-sum inequality.

Theorem: Concavity and Convexity of Mutual Information

Let (X,Y)(X, Y) be random variables where YY is the output of a channel PYXP_{Y|X} with input XX distributed according to PXP_X.

  1. I(X;Y)I(X;Y) is a concave function of PXP_X for fixed PYXP_{Y|X}.
  2. I(X;Y)I(X;Y) is a convex function of PYXP_{Y|X} for fixed PXP_X.

Part 1 is the reason we can compute capacity. Since C=maxPXI(X;Y)C = \max_{P_X} I(X;Y) and II is concave in PXP_X, the capacity problem is a concave maximization over the probability simplex — a convex optimization problem. This means: (i) any local maximum is global, (ii) the KKT conditions are necessary and sufficient, and (iii) algorithms like the Blahut-Arimoto algorithm converge to the global optimum.

Part 2 says that among all channels, the mutual information is lowest for "nice" (deterministic-like) channels and highest for "adversarial" mixtures. This convexity is used in minimax results and game-theoretic formulations of channel coding.

🔧Engineering Note

Computing Capacity: The Blahut-Arimoto Algorithm

The concavity of I(X;Y)I(X;Y) in PXP_X guarantees that the capacity optimization has a unique global maximum. The Blahut-Arimoto algorithm (1972) exploits this structure: it alternates between optimizing the input distribution and the "tilted" output distribution, converging monotonically to the capacity. Each iteration increases a lower bound on mutual information.

For a DMC with X=M|\mathcal{X}| = M inputs and Y=N|\mathcal{Y}| = N outputs, each iteration costs O(MN)O(MN), and convergence is typically fast (10-50 iterations for 6 digits of accuracy). This makes capacity computation routine for any discrete channel.

Practical Constraints
  • Requires finite input and output alphabets

  • Convergence rate depends on the channel structure

  • For continuous channels, discretization or parametric optimization is needed

Convexity Properties of Information Measures

QuantityConcave inConvex inKey consequence
H(X)H(X)PXP_XMaximum entropy distributions exist and are unique
D(PQ)D(P \| Q)(P,Q)(P, Q) jointlyProjection onto convex sets of distributions
I(X;Y)I(X;Y)PXP_X (fixed channel)PYXP_{Y|X} (fixed input)Capacity is a convex optimization problem

Key Takeaway

Convexity is the reason information theory is computable. The concavity of mutual information in the input distribution ensures that channel capacity C=maxPXI(X;Y)C = \max_{P_X} I(X;Y) is a convex problem with a unique global optimum. Without this property, computing capacity would be intractable for all but the simplest channels.

Quick Check

Why does the concavity of I(X;Y)I(X;Y) in PXP_X matter for computing channel capacity?

It guarantees any local maximum is a global maximum

It makes I(X;Y)I(X;Y) easier to compute

It proves that capacity exists

It implies I(X;Y)I(X;Y) is always positive