Maximum Entropy Distributions
From Entropy to Optimization
Which distribution over a given alphabet has the most entropy? If we constrain the mean or the variance, what distribution maximizes uncertainty? These questions are not merely academic. Maximum entropy distributions arise as:
- The worst-case noise in channel capacity problems (the noise that makes communication hardest).
- The least informative prior in Bayesian inference (the prior that assumes the least beyond the given constraints).
- The solution to the capacity problem for many channels (e.g., the uniform distribution achieves capacity for symmetric channels).
The technique — Lagrange multipliers on entropy — is the same machinery that produces waterfilling for Gaussian channels. Master it here in the discrete case, and the continuous generalization (Chapter 2) will follow naturally.
Theorem: Uniform Distribution Maximizes Entropy
Among all distributions on a finite alphabet with , the entropy is uniquely maximized by the uniform distribution :
with equality iff (i.e., for all ).
The uniform distribution is the "most uncertain" distribution — it assumes nothing beyond the alphabet. Any deviation from uniformity concentrates probability on some outcomes, reducing entropy.
Via KL divergence
Let be the uniform distribution. Then:
Since : . Equality iff .
Theorem: Maximum Entropy with a Mean Constraint
Among all distributions on with mean , the entropy is maximized by the geometric distribution:
The maximum entropy is:
With only a mean constraint, the geometric distribution is the "least informative" distribution. This is the discrete analogue of the exponential distribution maximizing continuous entropy under a mean constraint.
Lagrangian formulation
We maximize subject to and . The Lagrangian is:
Solve the KKT conditions
Setting :
where and are constants determined by the normalization and mean constraints. This is a geometric distribution with parameter .
Verify optimality
Since entropy is concave and the constraints are linear, the KKT conditions are sufficient for a global maximum. The geometric distribution is the unique maximizer.
Example: Maximum Entropy for Binary Variables
Find the distribution on with mean that maximizes entropy.
Direct solution
For a binary variable with : .
Since is concave and there is only one free parameter (given the mean constraint fixes ), the maximum over all binary distributions with mean is simply .
Without the mean constraint, the maximum is bit, achieved by the uniform (Bernoulli(1/2)) distribution.
Example: Maximum Entropy Under Mean and Variance Constraints
Among all distributions on (the integers) with mean and variance , which distribution maximizes entropy? What does this preview for the continuous case?
Lagrangian with two constraints
Adding a variance constraint leads to a distribution of the form:
for some . This is a discrete Gaussian distribution.
Preview of the continuous case
In the continuous case (Chapter 2), the analogous result is that the Gaussian uniquely maximizes differential entropy among all distributions with mean and variance . This is perhaps the single most important fact in Gaussian channel analysis — it is the reason that Gaussian noise is the "worst-case" additive noise.
Historical Note: Jaynes and the Maximum Entropy Principle
1957In 1957, E. T. Jaynes proposed the maximum entropy principle as a general method for statistical inference: given partial information about a distribution (in the form of constraints on expectations), the "least biased" distribution consistent with the constraints is the one that maximizes entropy. Jaynes argued this is the natural generalization of Laplace's principle of indifference.
The maximum entropy principle has found applications far beyond information theory: statistical mechanics (where it reproduces the Boltzmann and Gibbs distributions), natural language processing (MaxEnt classifiers), image reconstruction, and spectral analysis. The connection to exponential families in statistics is not a coincidence — every maximum entropy distribution under moment constraints belongs to an exponential family.
Maximum Entropy Distributions
Visualize maximum entropy distributions under different constraints. The unconstrained maximum entropy distribution is uniform. Adding a mean constraint yields a geometric distribution. Adding a variance constraint yields a discrete Gaussian.
Parameters
Which constraints to impose
Size of the support
Mean constraint
Maximum Entropy Input Distributions for MIMO Channels
The maximum entropy principle plays a key role in determining capacity-achieving input distributions for channels with state information. Caire and Shamai showed that for certain channel models with state known at the transmitter, the optimal input distribution can be characterized via a maximum entropy argument combined with the waterfilling structure. This work bridges the classical maximum entropy framework of this chapter with the modern multiuser and state-dependent channel theory of Chapters 9-12.
Common Mistake: Maximum Entropy Is Not Always the Right Prior
Mistake:
Blindly applying the maximum entropy principle as a universal prior, especially when the chosen constraints do not capture the relevant structure of the problem.
Correction:
The maximum entropy distribution is the least informative distribution consistent with the given constraints. The quality of the result depends critically on the choice of constraints. If the constraints are poorly chosen (e.g., constraining the mean when the relevant structure is in the higher moments), the maximum entropy distribution may be misleading. In information theory, the constraints are usually natural (power constraints, bandwidth constraints), but in other applications, domain knowledge should guide the choice.