Differential Entropy

From Discrete to Continuous

In Chapter 1 we defined entropy for discrete random variables. But most quantities in communications β€” signal amplitudes, channel gains, noise samples β€” are continuous. We need an analogue of entropy for continuous distributions. The natural attempt is to replace sums with integrals in Shannon's formula. This works, but with a crucial caveat: the resulting quantity, called differential entropy, behaves differently from discrete entropy in several important ways. It can be negative, it is not invariant under changes of variable, and it does not have a direct operational meaning as a description length. Despite these differences, differential entropy is indispensable because mutual information for continuous variables is well-behaved β€” it inherits all the good properties from the discrete case.

Definition:

Differential Entropy

Let XX be a continuous random variable with probability density function fX(x)f_X(x). The differential entropy of XX is

h(X)=βˆ’βˆ«βˆ’βˆžβˆžfX(x)log⁑fX(x) dx,h(X) = -\int_{-\infty}^{\infty} f_X(x) \log f_X(x) \, dx,

provided the integral exists (possibly as +∞+\infty or βˆ’βˆž-\infty).

We use the same notation hh (lowercase) to distinguish differential entropy from discrete entropy HH (uppercase). Some authors use hh for both; context makes the meaning clear.

Differential entropy

The continuous analogue of Shannon entropy: h(X)=βˆ’βˆ«f(x)log⁑f(x) dxh(X) = -\int f(x) \log f(x)\,dx. Unlike discrete entropy, differential entropy can be negative, is not invariant under invertible transformations, and does not directly represent a compression limit.

Related: Entropy, Entropy power

Theorem: Properties of Differential Entropy

Let XX be a continuous random variable with PDF fXf_X and let a,ba, b be real constants with a≠0a \neq 0. Then:

  1. Translation invariance: h(X+b)=h(X)h(X + b) = h(X).
  2. Scaling: h(aX)=h(X)+log⁑∣a∣h(aX) = h(X) + \log|a|.
  3. Can be negative: There exist distributions with h(X)<0h(X) < 0.
  4. Conditioning reduces differential entropy (on average): h(X∣Y)≀h(X)h(X|Y) \leq h(X) with equality iff XβŠ₯YX \perp Y.
  5. Mutual information is well-defined: I(X;Y)=h(X)βˆ’h(X∣Y)β‰₯0I(X;Y) = h(X) - h(X|Y) \geq 0, exactly as in the discrete case.

Property 2 shows that differential entropy is not invariant under coordinate changes β€” unlike discrete entropy, which depends only on the probabilities, not the labels. Squeezing a distribution (multiplying by a<1a < 1) makes it more "concentrated" and decreases differential entropy.

Property 3 is surprising at first: a random variable with "less than zero" uncertainty? The resolution is that differential entropy measures uncertainty relative to the Lebesgue measure, not in absolute terms. The important quantities β€” mutual information, KL divergence β€” are always non-negative.

Example: Differential Entropy of the Uniform Distribution

Compute h(X)h(X) for X∼Uniform(a,b)X \sim \text{Uniform}(a, b).

Example: Differential Entropy of the Gaussian Distribution

Compute h(X)h(X) for X∼N(0,Οƒ2)X \sim \mathcal{N}(0, \sigma^2).

Example: Differential Entropy of the Exponential Distribution

Compute h(X)h(X) for X∼Exp(Ξ»)X \sim \text{Exp}(\lambda) with PDF f(x)=Ξ»eβˆ’Ξ»xf(x) = \lambda e^{-\lambda x} for xβ‰₯0x \geq 0.

Differential Entropy of Common Distributions

Compare the differential entropy of the Gaussian, uniform, and exponential distributions as you vary their parameters. The Gaussian always has the highest differential entropy for a given variance.

Parameters
1

Controls the spread of the distributions

Common Mistake: Differential Entropy Is Not the Limit of Discrete Entropy

Mistake:

Assuming that h(X)=lim⁑Δ→0H(XΞ”)h(X) = \lim_{\Delta \to 0} H(X^\Delta), where XΞ”X^\Delta is the quantized version of XX with bin width Ξ”\Delta.

Correction:

The correct relationship is h(X)=lim⁑Δ→0[H(XΞ”)+log⁑Δ]h(X) = \lim_{\Delta \to 0} [H(X^\Delta) + \log \Delta]. The discrete entropy H(XΞ”)β†’βˆžH(X^\Delta) \to \infty as Ξ”β†’0\Delta \to 0 (finer quantization requires more bits), but the excess over βˆ’log⁑Δ-\log \Delta converges to h(X)h(X). This is why differential entropy can be negative β€” it is "entropy minus an infinite constant." We make this precise in Section 2.5.

Common Mistake: Differential Entropy Depends on the Coordinate System

Mistake:

Treating differential entropy as coordinate-invariant. For example, claiming that h(X)h(X) in Cartesian coordinates equals h(X)h(X) in polar coordinates.

Correction:

Under a change of variables Y=g(X)Y = g(X) where gg is a diffeomorphism:

h(Y)=h(X)+E[log⁑∣gβ€²(X)∣]h(Y) = h(X) + \mathbb{E}[\log|g'(X)|].

Differential entropy changes under coordinate transformations. Mutual information I(X;Y)I(X;Y), by contrast, is invariant β€” the Jacobian terms cancel. This is why mutual information, not differential entropy, is the fundamental quantity for continuous variables.

Quick Check

What is the differential entropy of X∼N(0,4)X \sim \mathcal{N}(0, 4) in bits?

12log⁑(8Ο€e)β‰ˆ2.047\frac{1}{2}\log(8\pi e) \approx 2.047 bits

12log⁑(8Ο€e)β‰ˆ3.047\frac{1}{2}\log(8\pi e) \approx 3.047 bits

22 bits

12log⁑(2Ο€e)β‰ˆ2.047\frac{1}{2}\log(2\pi e) \approx 2.047 bits

Historical Note: Shannon's Treatment of Continuous Entropy

1948

In his 1948 paper, Shannon introduced differential entropy with full awareness of its limitations. He noted that it is "in some ways a dubious quantity" β€” negative, coordinate-dependent, and without the clean operational interpretation of discrete entropy. Yet he recognized that the differences of differential entropies (i.e., mutual information) are well-behaved, and this is all that matters for channel capacity.

The modern perspective is that differential entropy is best understood through the lens of relative entropy: D(fβˆ₯g)D(f \| g) is always well-defined and non-negative for densities f,gf, g, and differential entropy is what you get when you take gg to be the Lebesgue measure (which is not a proper probability measure, explaining the anomalies).

πŸ”§Engineering Note

Numerical Computation of Differential Entropy

Computing h(X)=βˆ’βˆ«f(x)log⁑f(x) dxh(X) = -\int f(x) \log f(x)\,dx numerically requires care. Common pitfalls include:

  • Tail truncation: For heavy-tailed distributions (e.g., Cauchy), the integral converges slowly. Use importance sampling or analytical tail bounds.
  • Density estimation: When ff is unknown and must be estimated from samples, kernel density estimation (KDE) introduces bias. The k-nearest neighbor (kNN) estimator of Kozachenko-Leonenko is often more reliable.
  • High dimensions: In Rn\mathbb{R}^n with n>10n > 10, histogram-based entropy estimation fails due to the curse of dimensionality. Use parametric models (e.g., Gaussian fit) or manifold-based estimators.
  • Numerical precision: The flog⁑ff \log f integrand is well-behaved near f=0f = 0 (since lim⁑fβ†’0flog⁑f=0\lim_{f \to 0} f \log f = 0), but roundoff errors in log evaluation for very small ff can cause issues.
Practical Constraints
  • β€’

    kNN estimators converge as O(nβˆ’1/d)O(n^{-1/d}) β€” exponentially slow in dimension

  • β€’

    Gaussian assumption gives closed-form entropy but may be poor for multimodal distributions

  • β€’

    For MIMO systems, sample covariance matrix estimation suffices for Gaussian entropy

Probability density function (PDF)

A function fX(x)β‰₯0f_X(x) \geq 0 such that Pr⁑(a≀X≀b)=∫abfX(x) dx\Pr(a \leq X \leq b) = \int_a^b f_X(x)\,dx and βˆ«βˆ’βˆžβˆžfX(x) dx=1\int_{-\infty}^{\infty} f_X(x)\,dx = 1. Unlike PMFs, PDFs can exceed 1.

Related: Differential entropy