Ferkans — Interactive Telecom Tutor

Why Convexity Matters

In information theory, convexity is not a mathematical luxury — it is what separates problems we can solve from problems we cannot. When the capacity optimization $\max_{p_X} I(X;Y)$ turns out to be a concave maximization (equivalently, a convex optimization problem), we know that any local maximum is a global maximum, KKT conditions are sufficient, and efficient algorithms exist. This section establishes the convexity and concavity properties of entropy, divergence, and mutual information that make the entire theory computationally tractable.

Theorem: Convexity of KL Divergence

$D(P \| Q)$ is convex in the pair $(P, Q)$ . That is, for PMFs $P_1, P_2, Q_1, Q_2$ and $\lambda \in [0,1]$ :

$D(\lambda P_1 + (1-\lambda)P_2 \| \lambda Q_1 + (1-\lambda)Q_2) \leq \lambda D(P_1 \| Q_1) + (1-\lambda) D(P_2 \| Q_2).$

Mixing pairs of distributions brings their divergence down. This joint convexity is a direct consequence of the log-sum inequality.

Proof

Apply the log-sum inequality

Let $P = \lambda P_1 + (1-\lambda) P_2$ and $Q = \lambda Q_1 + (1-\lambda) Q_2$ . For each $x$ :

$P(x) \log \frac{P(x)}{Q(x)} = [\lambda P_1(x) + (1-\lambda)P_2(x)] \log \frac{\lambda P_1(x) + (1-\lambda)P_2(x)}{\lambda Q_1(x) + (1-\lambda)Q_2(x)}.$

By the log-sum inequality applied with $a_1 = \lambda P_1(x)$ , $a_2 = (1-\lambda)P_2(x)$ , $b_1 = \lambda Q_1(x)$ , $b_2 = (1-\lambda)Q_2(x)$ , and summing over $x$ , we obtain the result.

Theorem: Concavity and Convexity of Mutual Information

Let $(X, Y)$ be random variables where $Y$ is the output of a channel $P_{Y|X}$ with input $X$ distributed according to $P_X$ .

$I(X;Y)$ is a concave function of $P_X$ for fixed $P_{Y|X}$ .
$I(X;Y)$ is a convex function of $P_{Y|X}$ for fixed $P_X$ .

Part 1 is the reason we can compute capacity. Since $C = \max_{P_X} I(X;Y)$ and $I$ is concave in $P_X$ , the capacity problem is a concave maximization over the probability simplex — a convex optimization problem. This means: (i) any local maximum is global, (ii) the KKT conditions are necessary and sufficient, and (iii) algorithms like the Blahut-Arimoto algorithm converge to the global optimum.

Part 2 says that among all channels, the mutual information is lowest for "nice" (deterministic-like) channels and highest for "adversarial" mixtures. This convexity is used in minimax results and game-theoretic formulations of channel coding.

Proof

Concavity in $P_X$ (Part 1)

Write $I(X;Y) = H(Y) - H(Y|X)$ .

The output distribution $P_Y = \sum_x P_X(x) P_{Y|X}(\cdot|x)$ is a linear function of $P_X$ . Since entropy is concave (Theorem 1.1.5), $H(Y)$ is concave in $P_X$ .

The conditional entropy $H(Y|X) = \sum_x P_X(x) H(Y|X=x)$ is linear in $P_X$ (for fixed channel, each $H(Y|X=x)$ is a constant).

Therefore $I(X;Y) = (\text{concave}) - (\text{linear}) = \text{concave}$ in $P_X$ .

Convexity in $P_{Y|X}$ (Part 2)

Write $I(X;Y) = H(Y) - H(Y|X)$ .

For fixed $P_X$ , $H(Y|X) = \sum_x P_X(x) H(Y|X=x)$ is concave in $P_{Y|X}$ (entropy of each row is concave, and a positively weighted sum of concave functions is concave).

The output entropy $H(Y)$ is a concave function of $P_Y$ , which is linear in $P_{Y|X}$ , so $H(Y)$ is concave in $P_{Y|X}$ .

However, $I(X;Y) = H(Y) - H(Y|X)$ where both terms are concave. But the second term enters with a negative sign, so $-H(Y|X)$ is convex. A more careful analysis using the convexity of $D$ in the pair $(P,Q)$ establishes that $I(X;Y) = \sum_x P_X(x) D(P_{Y|X=x} \| P_Y)$ is convex in $P_{Y|X}$ for fixed $P_X$ .

🔧Engineering Note

Computing Capacity: The Blahut-Arimoto Algorithm

The concavity of $I(X;Y)$ in $P_X$ guarantees that the capacity optimization has a unique global maximum. The Blahut-Arimoto algorithm (1972) exploits this structure: it alternates between optimizing the input distribution and the "tilted" output distribution, converging monotonically to the capacity. Each iteration increases a lower bound on mutual information.

For a DMC with $|\mathcal{X}| = M$ inputs and $|\mathcal{Y}| = N$ outputs, each iteration costs $O(MN)$ , and convergence is typically fast (10-50 iterations for 6 digits of accuracy). This makes capacity computation routine for any discrete channel.

Practical Constraints

•
Requires finite input and output alphabets
•
Convergence rate depends on the channel structure
•
For continuous channels, discretization or parametric optimization is needed

Convexity Properties of Information Measures

Quantity	Concave in	Convex in	Key consequence
$H(X)$	$P_X$	—	Maximum entropy distributions exist and are unique
$D(P \\| Q)$	—	$(P, Q)$ jointly	Projection onto convex sets of distributions
$I(X;Y)$	$P_X$ (fixed channel)	$P_{Y\|X}$ (fixed input)	Capacity is a convex optimization problem

Key Takeaway

Convexity is the reason information theory is computable. The concavity of mutual information in the input distribution ensures that channel capacity $C = \max_{P_X} I(X;Y)$ is a convex problem with a unique global optimum. Without this property, computing capacity would be intractable for all but the simplest channels.

Quick Check

Why does the concavity of $I(X;Y)$ in $P_X$ matter for computing channel capacity?

It guarantees any local maximum is a global maximum

It makes $I(X;Y)$ easier to compute

It proves that capacity exists

It implies $I(X;Y)$ is always positive

Correction:

It guarantees any local maximum is a global maximum

For concave functions, every local maximum is also a global maximum. This means gradient-based or alternating optimization algorithms (like Blahut-Arimoto) are guaranteed to find the capacity-achieving input distribution.