Ferkans — Interactive Telecom Tutor

From Entropy to Optimization

Which distribution over a given alphabet has the most entropy? If we constrain the mean or the variance, what distribution maximizes uncertainty? These questions are not merely academic. Maximum entropy distributions arise as:

The worst-case noise in channel capacity problems (the noise that makes communication hardest).
The least informative prior in Bayesian inference (the prior that assumes the least beyond the given constraints).
The solution to the capacity problem for many channels (e.g., the uniform distribution achieves capacity for symmetric channels).

The technique — Lagrange multipliers on entropy — is the same machinery that produces waterfilling for Gaussian channels. Master it here in the discrete case, and the continuous generalization (Chapter 2) will follow naturally.

Theorem: Uniform Distribution Maximizes Entropy

Among all distributions on a finite alphabet $\mathcal{X}$ with $|\mathcal{X}| = M$ , the entropy is uniquely maximized by the uniform distribution $U$ :

$H(X) \leq \log M,$

with equality iff $P_X = U$ (i.e., $p(x) = 1/M$ for all $x$ ).

The uniform distribution is the "most uncertain" distribution — it assumes nothing beyond the alphabet. Any deviation from uniformity concentrates probability on some outcomes, reducing entropy.

Proof

Via KL divergence

Let $U$ be the uniform distribution. Then:

$D(P \| U) = \sum_x p(x) \log \frac{p(x)}{1/M} = \sum_x p(x) \log p(x) + \log M = -H(X) + \log M.$

Since $D(P \| U) \geq 0$ : $H(X) \leq \log M$ . Equality iff $P = U$ .

Theorem: Maximum Entropy with a Mean Constraint

Among all distributions on $\mathcal{X} = \{0, 1, 2, \ldots\}$ with mean $\mathbb{E}[X] = \mu$ , the entropy is maximized by the geometric distribution:

$p^*(k) = \frac{1}{1+\mu}\left(\frac{\mu}{1+\mu}\right)^k, \quad k = 0, 1, 2, \ldots$

The maximum entropy is:

$H^* = (1+\mu)\log(1+\mu) - \mu \log \mu.$

With only a mean constraint, the geometric distribution is the "least informative" distribution. This is the discrete analogue of the exponential distribution maximizing continuous entropy under a mean constraint.

Proof

Lagrangian formulation

We maximize $H(p) = -\sum_k p(k) \log p(k)$ subject to $\sum_k p(k) = 1$ and $\sum_k k \, p(k) = \mu$ . The Lagrangian is:

$\mathcal{L} = -\sum_k p(k) \log p(k) - \lambda_0 \Bigl(\sum_k p(k) - 1\Bigr) - \lambda_1 \Bigl(\sum_k k\, p(k) - \mu\Bigr).$

Solve the KKT conditions

Setting $\partial \mathcal{L}/\partial p(k) = 0$ :

$-\log p(k) - \frac{1}{\ln 2} - \lambda_0 - \lambda_1 k = 0$

$\Rightarrow \quad p(k) = 2^{-(\lambda_0 + 1/\ln 2)} \cdot 2^{-\lambda_1 k} = C \cdot r^k$

where $C$ and $r$ are constants determined by the normalization and mean constraints. This is a geometric distribution with parameter $r = \mu/(1+\mu)$ .

Verify optimality

Since entropy is concave and the constraints are linear, the KKT conditions are sufficient for a global maximum. The geometric distribution is the unique maximizer.

Example: Maximum Entropy for Binary Variables

Find the distribution on $\{0, 1\}$ with mean $\mu = p$ that maximizes entropy.

Solution

Direct solution

For a binary variable with $\Pr(X=1) = p$ : $H(X) = h_b(p)$ .

Since $h_b$ is concave and there is only one free parameter (given the mean constraint fixes $p$ ), the maximum over all binary distributions with mean $p$ is simply $h_b(p)$ .

Without the mean constraint, the maximum is $h_b(1/2) = 1$ bit, achieved by the uniform (Bernoulli(1/2)) distribution.

Example: Maximum Entropy Under Mean and Variance Constraints

Among all distributions on $\mathbb{Z}$ (the integers) with mean $\mu$ and variance $\sigma^2$ , which distribution maximizes entropy? What does this preview for the continuous case?

Solution

Lagrangian with two constraints

Adding a variance constraint $\sum_k (k-\mu)^2 p(k) = \sigma^2$ leads to a distribution of the form:

$p^*(k) \propto \exp(-\alpha (k-\mu)^2)$

for some $\alpha > 0$ . This is a discrete Gaussian distribution.

Preview of the continuous case

In the continuous case (Chapter 2), the analogous result is that the Gaussian $\mathcal{N}(\mu, \sigma^2)$ uniquely maximizes differential entropy among all distributions with mean $\mu$ and variance $\sigma^2$ . This is perhaps the single most important fact in Gaussian channel analysis — it is the reason that Gaussian noise is the "worst-case" additive noise.

Historical Note: Jaynes and the Maximum Entropy Principle

1957

In 1957, E. T. Jaynes proposed the maximum entropy principle as a general method for statistical inference: given partial information about a distribution (in the form of constraints on expectations), the "least biased" distribution consistent with the constraints is the one that maximizes entropy. Jaynes argued this is the natural generalization of Laplace's principle of indifference.

The maximum entropy principle has found applications far beyond information theory: statistical mechanics (where it reproduces the Boltzmann and Gibbs distributions), natural language processing (MaxEnt classifiers), image reconstruction, and spectral analysis. The connection to exponential families in statistics is not a coincidence — every maximum entropy distribution under moment constraints belongs to an exponential family.

Maximum Entropy Distributions

Visualize maximum entropy distributions under different constraints. The unconstrained maximum entropy distribution is uniform. Adding a mean constraint yields a geometric distribution. Adding a variance constraint yields a discrete Gaussian.

Parameters

Constraint type

Which constraints to impose

Alphabet size10

Size of the support

Mean μ3

Mean constraint

🎓CommIT Contribution(1999)

Maximum Entropy Input Distributions for MIMO Channels

G. Caire, S. Shamai — IEEE Transactions on Information Theory

The maximum entropy principle plays a key role in determining capacity-achieving input distributions for channels with state information. Caire and Shamai showed that for certain channel models with state known at the transmitter, the optimal input distribution can be characterized via a maximum entropy argument combined with the waterfilling structure. This work bridges the classical maximum entropy framework of this chapter with the modern multiuser and state-dependent channel theory of Chapters 9-12.

maximum-entropychannel-capacitystate-informationView Paper →

Common Mistake: Maximum Entropy Is Not Always the Right Prior

Mistake:

Blindly applying the maximum entropy principle as a universal prior, especially when the chosen constraints do not capture the relevant structure of the problem.

Correction:

The maximum entropy distribution is the least informative distribution consistent with the given constraints. The quality of the result depends critically on the choice of constraints. If the constraints are poorly chosen (e.g., constraining the mean when the relevant structure is in the higher moments), the maximum entropy distribution may be misleading. In information theory, the constraints are usually natural (power constraints, bandwidth constraints), but in other applications, domain knowledge should guide the choice.