Jensen's Inequality and the Evidence Lower Bound

From a Stubborn log\log to a Tractable Bound

The obstacle in the previous section was precisely the logarithm of a sum. A classical trick in convex analysis — Jensen's inequality — transforms that log\log-sum into a sum of log\log's, at the cost of a lower bound. The bound is tight when we choose the mixing distribution correctly, and this choice recovers exactly the posterior over latent variables. The resulting lower bound is called the Evidence Lower Bound, or ELBO, and it is the object that EM maximizes.

Theorem: Jensen's Inequality (for the Logarithm)

Let q(z)q(\mathbf{z}) be any probability density over the latent space such that q(z)>0q(\mathbf{z}) > 0 wherever p(y,z;θ)>0p(\mathbf{y},\mathbf{z};\boldsymbol{\theta}) > 0. Then logf(y;θ)    q(z)logp(y,z;θ)q(z)dz  =:  F(q,θ),\log f(\mathbf{y};\boldsymbol{\theta}) \;\geq\; \int q(\mathbf{z}) \log \frac{p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})}{q(\mathbf{z})}\,d\mathbf{z} \;=:\; \mathcal{F}(q,\boldsymbol{\theta}), with equality if and only if q(z)=p(zy;θ)q(\mathbf{z}) = p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta}) almost everywhere.

We rewrite the marginal as an expectation, then pull the (concave) log inside the expectation. Because log\log is concave, the move costs us — but we can tune qq to make the gap zero.

Definition:

Evidence Lower Bound (ELBO) / Free Energy

For a latent-variable model p(y,z;θ)p(\mathbf{y},\mathbf{z};\boldsymbol{\theta}) and any density q(z)q(\mathbf{z}), the evidence lower bound is F(q,θ)=EZq ⁣[logp(y,Z;θ)]  +  H(q),\mathcal{F}(q,\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{Z}\sim q}\!\big[\log p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})\big] \;+\; \mathcal{H}(q), where H(q)=q(z)logq(z)dz\mathcal{H}(q) = -\int q(\mathbf{z})\log q(\mathbf{z})\,d\mathbf{z} is the differential entropy of qq. Equivalently, F(q,θ)=logf(y;θ)D ⁣(qp(y;θ)).\mathcal{F}(q,\boldsymbol{\theta}) = \log f(\mathbf{y};\boldsymbol{\theta}) - D\!\big(q \,\Vert\, p(\cdot\mid\mathbf{y};\boldsymbol{\theta})\big).

The decomposition into an energy term and an entropy term gives F\mathcal{F} its alternative name, (negative) free energy, borrowed from statistical physics. Both viewpoints — Jensen bound on a log-marginal and free energy of a distribution — will be useful.

Key Takeaway

The ELBO is a lower bound on logf(y;θ)\log f(\mathbf{y};\boldsymbol{\theta}) that depends on a free distribution qq. Maximizing over qq at fixed θ\boldsymbol{\theta} recovers the posterior and makes the bound tight. Maximizing over θ\boldsymbol{\theta} at fixed qq is the M-step. EM is coordinate ascent on F\mathcal{F}.

The ELBO Gap: logf(y;θ)F(q,θ)\log f(\mathbf{y};\boldsymbol{\theta}) - \mathcal{F}(q,\boldsymbol{\theta})

Visualize the lower-bound relationship in a simple 1-D toy problem. A single parameter μ\mu controls the location of a 2-component Gaussian mixture; the slider varies the variational mean used for qq. The tight bound is achieved when qq matches the true posterior.

Parameters
1
0
100

The ELBO and the Bayes Modeler's Dream

Reading F(q,θ)=logf(y;θ)D(qp(y;θ))\mathcal{F}(q,\boldsymbol{\theta}) = \log f(\mathbf{y};\boldsymbol{\theta}) - D(q\Vert p(\cdot\mid\mathbf{y};\boldsymbol{\theta})) the other way around: if we could maximize the ELBO jointly in qq and θ\boldsymbol{\theta}, we would simultaneously drive qq toward the posterior (making the KL term small) and drive θ\boldsymbol{\theta} toward the MLE (making the first term large). This is the foundation of variational inference — when computing the exact posterior is intractable, restrict qq to a tractable family and optimize within it. Classical EM is the special case where the posterior is tractable, so the restriction is absent.

Common Mistake: Support Condition on qq

Mistake:

Applying the ELBO decomposition with a qq that puts zero mass on a region where the posterior has positive mass — "forgetting" a latent state.

Correction:

Jensen's inequality requires qq to dominate the posterior's support; otherwise D(qp(y))D(q\Vert p(\cdot\mid\mathbf{y})) can be finite while F\mathcal{F} has a spurious ++\infty term. In practice: initialize qq with full support (e.g., every responsibility strictly positive). Exact zeros that arise mid-iteration are absorbing — a component that loses all its mass will never recover.

Quick Check

The ELBO F(q,θ)\mathcal{F}(q,\boldsymbol{\theta}) equals logf(y;θ)\log f(\mathbf{y};\boldsymbol{\theta}) if and only if qq equals which distribution?

The prior p(z;θ)p(\mathbf{z};\boldsymbol{\theta})

The joint p(y,z;θ)p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})

The posterior p(zy;θ)p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta})

The marginal f(y;θ)f(\mathbf{y};\boldsymbol{\theta})