Multivariate Differential Entropy

From Scalars to Vectors

In MIMO systems, we transmit and receive vectors — the channel input is XRn\mathbf{X} \in \mathbb{R}^n (or Cn\mathbb{C}^n), the noise is a random vector, and the capacity depends on the covariance structure. Understanding differential entropy for random vectors is essential for the Gaussian vector channel (Chapter 10), the MIMO capacity (Book telecom, Ch. 15), and the entropy power inequality (Section 2.4).

Definition:

Multivariate Differential Entropy

Let X=(X1,,Xn)T\mathbf{X} = (X_1, \ldots, X_n)^T be a continuous random vector with joint PDF fX(x)f_{\mathbf{X}}(\mathbf{x}). The joint differential entropy is

h(X)=RnfX(x)logfX(x)dx.h(\mathbf{X}) = -\int_{\mathbb{R}^n} f_{\mathbf{X}}(\mathbf{x}) \log f_{\mathbf{X}}(\mathbf{x}) \, d\mathbf{x}.

The conditional differential entropy is h(XY)=fX,YlogfXYdxdyh(\mathbf{X}|\mathbf{Y}) = -\int f_{\mathbf{X},\mathbf{Y}} \log f_{\mathbf{X}|\mathbf{Y}} \, d\mathbf{x}\,d\mathbf{y}.

Theorem: Gaussian Vector Maximizes Entropy Under Covariance Constraint

Let XRn\mathbf{X} \in \mathbb{R}^n be a random vector with covariance matrix KX=E[(Xμ)(Xμ)T]\mathbf{K}_X = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T]. Then:

h(X)12log((2πe)ndet(KX)),h(\mathbf{X}) \leq \frac{1}{2}\log\bigl((2\pi e)^n \det(\mathbf{K}_X)\bigr),

with equality if and only if XN(μ,KX)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K}_X).

The determinant of the covariance matrix measures the "volume" of the uncertainty ellipsoid. The Gaussian spreads its probability as uniformly as possible over this ellipsoid, maximizing entropy. The factor (2πe)n(2\pi e)^n is the "volume efficiency" of the Gaussian in nn dimensions.

Example: Entropy of a Bivariate Gaussian

Let X=(X1,X2)TN(0,K)\mathbf{X} = (X_1, X_2)^T \sim \mathcal{N}(\mathbf{0}, \mathbf{K}) with K=(1ρρ1)\mathbf{K} = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} where ρ<1|\rho| < 1. Compute h(X)h(\mathbf{X}), h(X1)h(X_1), h(X2X1)h(X_2|X_1), and I(X1;X2)I(X_1; X_2).

Mutual Information of Correlated Gaussians

Visualize how the mutual information I(X1;X2)=12log(1ρ2)I(X_1; X_2) = -\frac{1}{2}\log(1 - \rho^2) grows as the correlation ρ\rho increases. The joint density contours flatten toward a line as ρ1|\rho| \to 1, making X2X_2 increasingly predictable from X1X_1.

Parameters
0.5

Correlation coefficient between X₁ and X₂

Theorem: Hadamard's Inequality via Entropy

For any random vector X=(X1,,Xn)T\mathbf{X} = (X_1, \ldots, X_n)^T with covariance matrix KX\mathbf{K}_X:

h(X)i=1nh(Xi),h(\mathbf{X}) \leq \sum_{i=1}^{n} h(X_i),

with equality iff X1,,XnX_1, \ldots, X_n are independent. For Gaussian vectors, this implies Hadamard's inequality:

det(KX)i=1nKii.\det(\mathbf{K}_X) \leq \prod_{i=1}^{n} K_{ii}.

Independence maximizes joint entropy for given marginals. For Gaussian vectors, this translates into the determinant inequality: the product of diagonal entries (variances) upper-bounds the determinant. Equality holds when the covariance matrix is diagonal.

Covariance matrix

For a random vector X\mathbf{X} with mean μ\boldsymbol{\mu}: ΣX=E[(Xμ)(Xμ)T]\boldsymbol{\Sigma}_{X} = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T]. Always positive semidefinite. For Gaussian vectors, the covariance matrix (together with the mean) fully determines the distribution.

Related: Differential entropy

Entropy power

For a random variable XX in Rn\mathbb{R}^n: N(X)=12πe22h(X)/nN(X) = \frac{1}{2\pi e} 2^{2h(X)/n}. The entropy power of a Gaussian is its variance. The entropy power inequality states that N(X+Y)N(X)+N(Y)N(X+Y) \geq N(X) + N(Y) for independent X,YX, Y.

Related: Differential entropy

Discrete vs Continuous Information Measures

PropertyDiscrete (HH)Continuous (hh)
Non-negativityH(X)0H(X) \geq 0 alwaysh(X)h(X) can be negative
Maximum entropyUniform on X\mathcal{X}: logX\log|\mathcal{X}|Gaussian with variance σ2\sigma^2: 12log(2πeσ2)\frac{1}{2}\log(2\pi e\sigma^2)
Coordinate invarianceYes (depends only on PMF)No (changes under coordinate transforms)
MI well-defined?Yes, I(X;Y)0I(X;Y) \geq 0Yes, I(X;Y)=h(X)h(XY)0I(X;Y) = h(X) - h(X|Y) \geq 0
Operational meaningMinimum avg. description lengthNo direct operational meaning alone

Quick Check

For a Gaussian vector XN(0,K)\mathbf{X} \sim \mathcal{N}(\mathbf{0}, \mathbf{K}), what happens to h(X)h(\mathbf{X}) when we double K\mathbf{K} (i.e., replace K\mathbf{K} by 2K2\mathbf{K})?

hh increases by n2log2=n2\frac{n}{2}\log 2 = \frac{n}{2} bit

hh doubles

hh increases by 12log2\frac{1}{2}\log 2 bit

hh stays the same