Jensen's Inequality and the Evidence Lower Bound
From a Stubborn to a Tractable Bound
The obstacle in the previous section was precisely the logarithm of a sum. A classical trick in convex analysis — Jensen's inequality — transforms that -sum into a sum of 's, at the cost of a lower bound. The bound is tight when we choose the mixing distribution correctly, and this choice recovers exactly the posterior over latent variables. The resulting lower bound is called the Evidence Lower Bound, or ELBO, and it is the object that EM maximizes.
Theorem: Jensen's Inequality (for the Logarithm)
Let be any probability density over the latent space such that wherever . Then with equality if and only if almost everywhere.
We rewrite the marginal as an expectation, then pull the (concave) log inside the expectation. Because is concave, the move costs us — but we can tune to make the gap zero.
Introduce into the marginal via .
Recognize the result as and apply concavity of .
For the equality case, rearrange the gap as a KL divergence.
Insert $q$
For any with ,
Apply Jensen
Since is concave, Jensen's inequality gives
Identify the gap
The gap between the log-likelihood and the ELBO is (Use and simplify.)
Conclude
The KL divergence is non-negative and vanishes exactly when . This both proves the inequality and pinpoints the unique that makes it tight.
Definition: Evidence Lower Bound (ELBO) / Free Energy
Evidence Lower Bound (ELBO) / Free Energy
For a latent-variable model and any density , the evidence lower bound is where is the differential entropy of . Equivalently,
The decomposition into an energy term and an entropy term gives its alternative name, (negative) free energy, borrowed from statistical physics. Both viewpoints — Jensen bound on a log-marginal and free energy of a distribution — will be useful.
Key Takeaway
The ELBO is a lower bound on that depends on a free distribution . Maximizing over at fixed recovers the posterior and makes the bound tight. Maximizing over at fixed is the M-step. EM is coordinate ascent on .
The ELBO Gap:
Visualize the lower-bound relationship in a simple 1-D toy problem. A single parameter controls the location of a 2-component Gaussian mixture; the slider varies the variational mean used for . The tight bound is achieved when matches the true posterior.
Parameters
The ELBO and the Bayes Modeler's Dream
Reading the other way around: if we could maximize the ELBO jointly in and , we would simultaneously drive toward the posterior (making the KL term small) and drive toward the MLE (making the first term large). This is the foundation of variational inference — when computing the exact posterior is intractable, restrict to a tractable family and optimize within it. Classical EM is the special case where the posterior is tractable, so the restriction is absent.
Common Mistake: Support Condition on
Mistake:
Applying the ELBO decomposition with a that puts zero mass on a region where the posterior has positive mass — "forgetting" a latent state.
Correction:
Jensen's inequality requires to dominate the posterior's support; otherwise can be finite while has a spurious term. In practice: initialize with full support (e.g., every responsibility strictly positive). Exact zeros that arise mid-iteration are absorbing — a component that loses all its mass will never recover.
Quick Check
The ELBO equals if and only if equals which distribution?
The prior
The joint
The posterior
The marginal
The gap equals , which vanishes iff is the posterior (almost everywhere).