Jensen's Inequality

Convexity Meets Expectation

Jensen's inequality is arguably the single most important inequality in information theory. Its statement is geometric: if you have a convex function and take its expectation, the result is at least as large as the function evaluated at the mean. This simple observation has profound consequences: it implies the non-negativity of Kullback-Leibler divergence (D(PQ)0D(P \| Q) \geq 0), the concavity of entropy, and the fact that fading hurts average capacity in wireless channels. Let us make this precise.

Definition:

Convex and Concave Functions

A function g:RRg: \mathbb{R} \to \mathbb{R} is convex if for all x,yRx, y \in \mathbb{R} and λ[0,1]\lambda \in [0, 1]: g(λx+(1λ)y)λg(x)+(1λ)g(y).g(\lambda x + (1-\lambda) y) \leq \lambda g(x) + (1-\lambda) g(y). It is strictly convex if equality holds only when λ{0,1}\lambda \in \{0, 1\} or x=yx = y. A function is (strictly) concave if g-g is (strictly) convex. Equivalently, for twice-differentiable gg: convex iff g(x)0g''(x) \geq 0 for all xx.

Key examples: exe^x is convex; logx\log x is concave on (0,)(0, \infty); x2x^2 is convex; x\sqrt{x} is concave on [0,)[0, \infty).

Theorem: Jensen's Inequality

Let XX be a random variable with finite mean and gg a convex function. Then: g(E[X])E[g(X)],g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)], provided E[g(X)]\mathbb{E}[g(X)] exists. If gg is concave, the inequality reverses: g(E[X])E[g(X)]g(\mathbb{E}[X]) \geq \mathbb{E}[g(X)]. If gg is strictly convex (or concave), equality holds if and only if XX is constant almost surely.

A convex function "bows upward," so averaging the function values is at least as large as evaluating the function at the average input. Geometrically: the chord connecting two points on a convex curve lies above the curve. Taking a weighted average (expectation) over many points preserves this property.

,

Example: Jensen Implies Non-Negativity of KL Divergence

Use Jensen's inequality to prove the information inequality: D(PQ)0D(P \| Q) \geq 0 for any two probability distributions P,QP, Q on the same alphabet, with equality iff P=QP = Q.

Example: Fading Hurts Average Capacity

Consider a fading channel with instantaneous SNR γ>0\gamma > 0 (random) and capacity log(1+γ)\log(1 + \gamma) nats. Show that the average capacity under fading is strictly less than the capacity at the average SNR.

Jensen's Gap for Fading Capacity

Compare the ergodic capacity E[log(1+γ)]\mathbb{E}[\log(1 + \gamma)] with the AWGN capacity log(1+E[γ])\log(1 + \mathbb{E}[\gamma]) as the fading severity varies. The gap quantifies the penalty of fading.

Parameters
10

Geometric Proof of Jensen's Inequality

Visual demonstration of Jensen's inequality: the chord of a convex function lies above the function, and the tangent line lies below. Expectation as a weighted average preserves these relationships.
The blue curve is g(x)=exg(x) = e^x (convex). The red dot at (E[X],g(E[X]))(\mathbb{E}[X], g(\mathbb{E}[X])) lies below the green dot at (E[X],E[g(X)])(\mathbb{E}[X], \mathbb{E}[g(X)]).

Definition:

Jensen's Inequality (Multivariate)

If g:RnRg: \mathbb{R}^n \to \mathbb{R} is convex and XRn\mathbf{X} \in \mathbb{R}^n is a random vector with finite mean: g(E[X])E[g(X)].g(\mathbb{E}[\mathbf{X}]) \leq \mathbb{E}[g(\mathbf{X})]. The proof follows the same supporting hyperplane argument, using a subgradient vg(E[X])\mathbf{v} \in \partial g(\mathbb{E}[\mathbf{X}]).

Jensen and the AM-GM Inequality

The classical AM-GM inequality 1nxi(xi)1/n\frac{1}{n}\sum x_i \geq (\prod x_i)^{1/n} is a special case of Jensen applied to g(x)=logxg(x) = -\log x (convex) with the uniform distribution on {x1,,xn}\{x_1, \ldots, x_n\}. More generally, for any concave ϕ\phi: ϕ(E[X])E[ϕ(X)]\phi(\mathbb{E}[X]) \geq \mathbb{E}[\phi(X)].

Common Mistake: Getting the Direction Wrong

Mistake:

Writing E[log(1+γ)]log(1+E[γ])\mathbb{E}[\log(1 + \gamma)] \geq \log(1 + \mathbb{E}[\gamma]) and concluding that fading helps.

Correction:

log\log is concave, so Jensen gives E[log(1+γ)]log(1+E[γ])\mathbb{E}[\log(1+\gamma)] \leq \log(1 + \mathbb{E}[\gamma]). The inequality reverses relative to the convex case. Always check: convex \Rightarrow "\geq" on the right; concave \Rightarrow "\leq" on the right.

🎓CommIT Contribution(1999)

Fading Channel Capacity with CSI

G. Caire, S. Shamai (Shitz)IEEE Trans. Inf. Theory, vol. 45, no. 6

Caire and Shamai characterized the capacity of fading channels with various degrees of channel state information at transmitter and receiver. Jensen's inequality is central to their analysis: without CSI at the transmitter, the capacity loss due to fading is exactly the Jensen gap log(1+E[γ])E[log(1+γ)]\log(1 + \mathbb{E}[\gamma]) - \mathbb{E}[\log(1 + \gamma)]. With transmitter CSI, water-filling partially recovers this loss.

fadingcapacityCSIView Paper →

Key Takeaway

Jensen's inequality connects the geometry of convex/concave functions with expectation. Its two most important applications for us are: (1) the information inequality D(PQ)0D(P \| Q) \geq 0, which is the foundation of information theory, and (2) the fading penalty E[log(1+γ)]log(1+E[γ])\mathbb{E}[\log(1+\gamma)] \leq \log(1+\mathbb{E}[\gamma]), which quantifies how randomness in the channel hurts throughput.

Historical Note: Johan Jensen and the 1906 Paper

1906

Johan Ludwig William Valdemar Jensen (1859--1925) was a Danish mathematician and engineer who worked at the Copenhagen Telephone Company. He published the inequality that bears his name in 1906 in Acta Mathematica. The inequality was known in special cases (AM-GM, Cauchy) much earlier, but Jensen gave the first general formulation for convex functions. His engineering background is fitting: the inequality has become one of the most-used tools in communications theory.

Convex function

A function gg satisfying g(λx+(1λ)y)λg(x)+(1λ)g(y)g(\lambda x + (1-\lambda)y) \leq \lambda g(x) + (1-\lambda) g(y) for all x,yx, y and λ[0,1]\lambda \in [0,1]. Equivalently, g(x)0g''(x) \geq 0 if twice differentiable.

Jensen gap

The difference E[g(X)]g(E[X])0\mathbb{E}[g(X)] - g(\mathbb{E}[X]) \geq 0 for convex gg. Measures how much randomness in XX inflates the expected value of gg.

Related: {{Ref:Gloss Convex Function}}

Comparison of Probability Inequalities

InequalityRequirementBound on P(Xa)\mathbb{P}(X \geq a)Tail decayTightness
MarkovX0X \geq 0, finite E[X]\mathbb{E}[X]E[X]/a\mathbb{E}[X]/aO(1/a)O(1/a)Loosest; tight for 2-point mass
ChebyshevFinite Var(X)\text{Var}(X)σ2/(aμ)2\sigma^2/(a-\mu)^2O(1/a2)O(1/a^2)Tight for 2-point mass
ChernoffFinite MGF in neighborhood of 0inft>0etaMX(t)\inf_{t>0} e^{-ta} M_X(t)ExponentialExponentially tight (correct rate)
HoeffdingIndependent, bounded Xi[ai,bi]X_i \in [a_i, b_i]exp(2t2/(biai)2)\exp(-2t^2/\sum(b_i-a_i)^2)Exponential in nnGood for bounded sums; ignores variance
JensenConvex ggN/A (bounds expectations)N/ATight iff XX is constant

Quick Check

You know only that X0X \geq 0 and E[X]=5\mathbb{E}[X] = 5. Which inequality gives a bound on P(X100)\mathbb{P}(X \geq 100)?

Chebyshev

Markov

Chernoff

Hoeffding