Differential Entropy
From Discrete to Continuous
In Chapter 1 we defined entropy for discrete random variables. But most quantities in communications β signal amplitudes, channel gains, noise samples β are continuous. We need an analogue of entropy for continuous distributions. The natural attempt is to replace sums with integrals in Shannon's formula. This works, but with a crucial caveat: the resulting quantity, called differential entropy, behaves differently from discrete entropy in several important ways. It can be negative, it is not invariant under changes of variable, and it does not have a direct operational meaning as a description length. Despite these differences, differential entropy is indispensable because mutual information for continuous variables is well-behaved β it inherits all the good properties from the discrete case.
Definition: Differential Entropy
Differential Entropy
Let be a continuous random variable with probability density function . The differential entropy of is
provided the integral exists (possibly as or ).
We use the same notation (lowercase) to distinguish differential entropy from discrete entropy (uppercase). Some authors use for both; context makes the meaning clear.
Differential entropy
The continuous analogue of Shannon entropy: . Unlike discrete entropy, differential entropy can be negative, is not invariant under invertible transformations, and does not directly represent a compression limit.
Related: Entropy, Entropy power
Theorem: Properties of Differential Entropy
Let be a continuous random variable with PDF and let be real constants with . Then:
- Translation invariance: .
- Scaling: .
- Can be negative: There exist distributions with .
- Conditioning reduces differential entropy (on average): with equality iff .
- Mutual information is well-defined: , exactly as in the discrete case.
Property 2 shows that differential entropy is not invariant under coordinate changes β unlike discrete entropy, which depends only on the probabilities, not the labels. Squeezing a distribution (multiplying by ) makes it more "concentrated" and decreases differential entropy.
Property 3 is surprising at first: a random variable with "less than zero" uncertainty? The resolution is that differential entropy measures uncertainty relative to the Lebesgue measure, not in absolute terms. The important quantities β mutual information, KL divergence β are always non-negative.
Translation invariance
The PDF of is , so:
by the substitution .
Scaling
The PDF of is , so:
.
Negative differential entropy example
Let . Then on :
bits.
Example: Differential Entropy of the Uniform Distribution
Compute for .
Computation
on , so:
.
For : . For : . For : .
Example: Differential Entropy of the Gaussian Distribution
Compute for .
Write the PDF
.
Compute
.
This is one of the most important formulas in information theory. Note that increases logarithmically with the variance β a more spread-out Gaussian has higher differential entropy.
Example: Differential Entropy of the Exponential Distribution
Compute for with PDF for .
Compute
.
The exponential distribution maximizes entropy among all non-negative distributions with mean , just as the geometric distribution maximizes discrete entropy under a mean constraint (Chapter 1, Β§1.7).
Differential Entropy of Common Distributions
Compare the differential entropy of the Gaussian, uniform, and exponential distributions as you vary their parameters. The Gaussian always has the highest differential entropy for a given variance.
Parameters
Controls the spread of the distributions
Common Mistake: Differential Entropy Is Not the Limit of Discrete Entropy
Mistake:
Assuming that , where is the quantized version of with bin width .
Correction:
The correct relationship is . The discrete entropy as (finer quantization requires more bits), but the excess over converges to . This is why differential entropy can be negative β it is "entropy minus an infinite constant." We make this precise in Section 2.5.
Common Mistake: Differential Entropy Depends on the Coordinate System
Mistake:
Treating differential entropy as coordinate-invariant. For example, claiming that in Cartesian coordinates equals in polar coordinates.
Correction:
Under a change of variables where is a diffeomorphism:
.
Differential entropy changes under coordinate transformations. Mutual information , by contrast, is invariant β the Jacobian terms cancel. This is why mutual information, not differential entropy, is the fundamental quantity for continuous variables.
Quick Check
What is the differential entropy of in bits?
bits
bits
bits
bits
bits.
Historical Note: Shannon's Treatment of Continuous Entropy
1948In his 1948 paper, Shannon introduced differential entropy with full awareness of its limitations. He noted that it is "in some ways a dubious quantity" β negative, coordinate-dependent, and without the clean operational interpretation of discrete entropy. Yet he recognized that the differences of differential entropies (i.e., mutual information) are well-behaved, and this is all that matters for channel capacity.
The modern perspective is that differential entropy is best understood through the lens of relative entropy: is always well-defined and non-negative for densities , and differential entropy is what you get when you take to be the Lebesgue measure (which is not a proper probability measure, explaining the anomalies).
Numerical Computation of Differential Entropy
Computing numerically requires care. Common pitfalls include:
- Tail truncation: For heavy-tailed distributions (e.g., Cauchy), the integral converges slowly. Use importance sampling or analytical tail bounds.
- Density estimation: When is unknown and must be estimated from samples, kernel density estimation (KDE) introduces bias. The k-nearest neighbor (kNN) estimator of Kozachenko-Leonenko is often more reliable.
- High dimensions: In with , histogram-based entropy estimation fails due to the curse of dimensionality. Use parametric models (e.g., Gaussian fit) or manifold-based estimators.
- Numerical precision: The integrand is well-behaved near (since ), but roundoff errors in log evaluation for very small can cause issues.
- β’
kNN estimators converge as β exponentially slow in dimension
- β’
Gaussian assumption gives closed-form entropy but may be poor for multimodal distributions
- β’
For MIMO systems, sample covariance matrix estimation suffices for Gaussian entropy
Probability density function (PDF)
A function such that and . Unlike PMFs, PDFs can exceed 1.
Related: Differential entropy