Ferkans — Interactive Telecom Tutor

From Discrete to Continuous

In Chapter 1 we defined entropy for discrete random variables. But most quantities in communications — signal amplitudes, channel gains, noise samples — are continuous. We need an analogue of entropy for continuous distributions. The natural attempt is to replace sums with integrals in Shannon's formula. This works, but with a crucial caveat: the resulting quantity, called differential entropy, behaves differently from discrete entropy in several important ways. It can be negative, it is not invariant under changes of variable, and it does not have a direct operational meaning as a description length. Despite these differences, differential entropy is indispensable because mutual information for continuous variables is well-behaved — it inherits all the good properties from the discrete case.

Definition:
Differential Entropy

Let $X$ be a continuous random variable with probability density function $f_X(x)$ . The differential entropy of $X$ is

$h(X) = -\int_{-\infty}^{\infty} f_X(x) \log f_X(x) \, dx,$

provided the integral exists (possibly as $+\infty$ or $-\infty$ ).

We use the same notation $h$ (lowercase) to distinguish differential entropy from discrete entropy $H$ (uppercase). Some authors use $h$ for both; context makes the meaning clear.

Differential entropy

The continuous analogue of Shannon entropy: $h(X) = -\int f(x) \log f(x)\,dx$ . Unlike discrete entropy, differential entropy can be negative, is not invariant under invertible transformations, and does not directly represent a compression limit.

Related: Entropy, Entropy power

Theorem: Properties of Differential Entropy

Let $X$ be a continuous random variable with PDF $f_X$ and let $a, b$ be real constants with $a \neq 0$ . Then:

Translation invariance: $h(X + b) = h(X)$ .
Scaling: $h(aX) = h(X) + \log|a|$ .
Can be negative: There exist distributions with $h(X) < 0$ .
Conditioning reduces differential entropy (on average): $h(X|Y) \leq h(X)$ with equality iff $X \perp Y$ .
Mutual information is well-defined: $I(X;Y) = h(X) - h(X|Y) \geq 0$ , exactly as in the discrete case.

Property 2 shows that differential entropy is not invariant under coordinate changes — unlike discrete entropy, which depends only on the probabilities, not the labels. Squeezing a distribution (multiplying by $a < 1$ ) makes it more "concentrated" and decreases differential entropy.

Property 3 is surprising at first: a random variable with "less than zero" uncertainty? The resolution is that differential entropy measures uncertainty relative to the Lebesgue measure, not in absolute terms. The important quantities — mutual information, KL divergence — are always non-negative.

Proof

Translation invariance

The PDF of $X + b$ is $f_{X+b}(x) = f_X(x-b)$ , so:

$h(X+b) = -\int f_X(x-b) \log f_X(x-b)\,dx = -\int f_X(u) \log f_X(u)\,du = h(X)$

by the substitution $u = x - b$ .

Scaling

The PDF of $aX$ is $f_{aX}(x) = \frac{1}{|a|} f_X(x/a)$ , so:

$h(aX) = -\int \frac{1}{|a|} f_X(x/a) \log\!\left(\frac{1}{|a|} f_X(x/a)\right) dx$

$= -\int f_X(u) [\log f_X(u) - \log|a|]\,du = h(X) + \log|a|$ .

Negative differential entropy example

Let $X \sim \text{Uniform}(0, 1/4)$ . Then $f_X(x) = 4$ on $[0, 1/4]$ :

$h(X) = -\int_0^{1/4} 4 \log 4 \, dx = -\log 4 = -2$ bits.

Example: Differential Entropy of the Uniform Distribution

Compute $h(X)$ for $X \sim \text{Uniform}(a, b)$ .

Solution

Computation

$f_X(x) = \frac{1}{b-a}$ on $[a,b]$ , so:

$h(X) = -\int_a^b \frac{1}{b-a} \log \frac{1}{b-a}\,dx = \log(b-a)$ .

For $b - a > 1$ : $h(X) > 0$ . For $b - a < 1$ : $h(X) < 0$ . For $b - a = 1$ : $h(X) = 0$ .

Example: Differential Entropy of the Gaussian Distribution

Compute $h(X)$ for $X \sim \mathcal{N}(0, \sigma^2)$ .

Solution

Write the PDF

$f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{x^2}{2\sigma^2}\right)$ .

Compute

$h(X) = -\int f(x) \log f(x)\,dx = -\int f(x)\left[-\frac{1}{2}\log(2\pi\sigma^2) - \frac{x^2}{2\sigma^2 \ln 2}\right]dx$

$= \frac{1}{2}\log(2\pi\sigma^2) + \frac{1}{2\ln 2} = \frac{1}{2}\log(2\pi e \sigma^2)$ .

This is one of the most important formulas in information theory. Note that $h(X)$ increases logarithmically with the variance — a more spread-out Gaussian has higher differential entropy.

Example: Differential Entropy of the Exponential Distribution

Compute $h(X)$ for $X \sim \text{Exp}(\lambda)$ with PDF $f(x) = \lambda e^{-\lambda x}$ for $x \geq 0$ .

Solution

Compute

$h(X) = -\int_0^\infty \lambda e^{-\lambda x} [\ln\lambda - \lambda x]\,\frac{dx}{\ln 2}$

$= \frac{1}{\ln 2}[-\ln\lambda + \lambda \cdot \frac{1}{\lambda}] = \frac{1}{\ln 2}[1 - \ln\lambda] = \log\frac{e}{\lambda}$ .

The exponential distribution maximizes entropy among all non-negative distributions with mean $1/\lambda$ , just as the geometric distribution maximizes discrete entropy under a mean constraint (Chapter 1, §1.7).

Differential Entropy of Common Distributions

Compare the differential entropy of the Gaussian, uniform, and exponential distributions as you vary their parameters. The Gaussian always has the highest differential entropy for a given variance.

Parameters

Standard deviation σ1

Controls the spread of the distributions

Common Mistake: Differential Entropy Is Not the Limit of Discrete Entropy

Mistake:

Assuming that $h(X) = \lim_{\Delta \to 0} H(X^\Delta)$ , where $X^\Delta$ is the quantized version of $X$ with bin width $\Delta$ .

Correction:

The correct relationship is $h(X) = \lim_{\Delta \to 0} [H(X^\Delta) + \log \Delta]$ . The discrete entropy $H(X^\Delta) \to \infty$ as $\Delta \to 0$ (finer quantization requires more bits), but the excess over $-\log \Delta$ converges to $h(X)$ . This is why differential entropy can be negative — it is "entropy minus an infinite constant." We make this precise in Section 2.5.

Common Mistake: Differential Entropy Depends on the Coordinate System

Mistake:

Treating differential entropy as coordinate-invariant. For example, claiming that $h(X)$ in Cartesian coordinates equals $h(X)$ in polar coordinates.

Correction:

Under a change of variables $Y = g(X)$ where $g$ is a diffeomorphism:

$h(Y) = h(X) + \mathbb{E}[\log|g'(X)|]$ .

Differential entropy changes under coordinate transformations. Mutual information $I(X;Y)$ , by contrast, is invariant — the Jacobian terms cancel. This is why mutual information, not differential entropy, is the fundamental quantity for continuous variables.

Quick Check

What is the differential entropy of $X \sim \mathcal{N}(0, 4)$ in bits?

$\frac{1}{2}\log(8\pi e) \approx 2.047$ bits

$\frac{1}{2}\log(8\pi e) \approx 3.047$ bits

$2$ bits

$\frac{1}{2}\log(2\pi e) \approx 2.047$ bits

Correction:

\frac{1}{2}\log(8\pi e) \approx 3.047

bits

$h(X) = \frac{1}{2}\log_2(2\pi e \sigma^2) = \frac{1}{2}\log_2(2\pi e \cdot 4) = \frac{1}{2}\log_2(8\pi e) \approx 3.047$ bits.

Historical Note: Shannon's Treatment of Continuous Entropy

1948

In his 1948 paper, Shannon introduced differential entropy with full awareness of its limitations. He noted that it is "in some ways a dubious quantity" — negative, coordinate-dependent, and without the clean operational interpretation of discrete entropy. Yet he recognized that the differences of differential entropies (i.e., mutual information) are well-behaved, and this is all that matters for channel capacity.

The modern perspective is that differential entropy is best understood through the lens of relative entropy: $D(f \| g)$ is always well-defined and non-negative for densities $f, g$ , and differential entropy is what you get when you take $g$ to be the Lebesgue measure (which is not a proper probability measure, explaining the anomalies).

🔧Engineering Note

Numerical Computation of Differential Entropy

Computing $h(X) = -\int f(x) \log f(x)\,dx$ numerically requires care. Common pitfalls include:

Tail truncation: For heavy-tailed distributions (e.g., Cauchy), the integral converges slowly. Use importance sampling or analytical tail bounds.
Density estimation: When $f$ is unknown and must be estimated from samples, kernel density estimation (KDE) introduces bias. The k-nearest neighbor (kNN) estimator of Kozachenko-Leonenko is often more reliable.
High dimensions: In $\mathbb{R}^n$ with $n > 10$ , histogram-based entropy estimation fails due to the curse of dimensionality. Use parametric models (e.g., Gaussian fit) or manifold-based estimators.
Numerical precision: The $f \log f$ integrand is well-behaved near $f = 0$ (since $\lim_{f \to 0} f \log f = 0$ ), but roundoff errors in log evaluation for very small $f$ can cause issues.

Practical Constraints

•
kNN estimators converge as $O(n^{-1/d})$ — exponentially slow in dimension
•
Gaussian assumption gives closed-form entropy but may be poor for multimodal distributions
•
For MIMO systems, sample covariance matrix estimation suffices for Gaussian entropy

Probability density function (PDF)

A function $f_X(x) \geq 0$ such that $\Pr(a \leq X \leq b) = \int_a^b f_X(x)\,dx$ and $\int_{-\infty}^{\infty} f_X(x)\,dx = 1$ . Unlike PMFs, PDFs can exceed 1.

Related: Differential entropy

Differential Entropy

From Discrete to Continuous

Definition: Differential Entropy

Differential entropy

Theorem: Properties of Differential Entropy

Translation invariance

Scaling

Negative differential entropy example

Example: Differential Entropy of the Uniform Distribution

Computation

Example: Differential Entropy of the Gaussian Distribution

Write the PDF

Compute

Example: Differential Entropy of the Exponential Distribution

Compute

Differential Entropy of Common Distributions

Parameters

Common Mistake: Differential Entropy Is Not the Limit of Discrete Entropy

Common Mistake: Differential Entropy Depends on the Coordinate System

Quick Check

Historical Note: Shannon's Treatment of Continuous Entropy

Numerical Computation of Differential Entropy

Probability density function (PDF)

Definition:
Differential Entropy