Ferkans — Interactive Telecom Tutor

From a Stubborn $\log$ to a Tractable Bound

The obstacle in the previous section was precisely the logarithm of a sum. A classical trick in convex analysis — Jensen's inequality — transforms that $\log$ -sum into a sum of $\log$ 's, at the cost of a lower bound. The bound is tight when we choose the mixing distribution correctly, and this choice recovers exactly the posterior over latent variables. The resulting lower bound is called the Evidence Lower Bound, or ELBO, and it is the object that EM maximizes.

Theorem: Jensen's Inequality (for the Logarithm)

Let $q(\mathbf{z})$ be any probability density over the latent space such that $q(\mathbf{z}) > 0$ wherever $p(\mathbf{y},\mathbf{z};\boldsymbol{\theta}) > 0$ . Then $\log f(\mathbf{y};\boldsymbol{\theta}) \;\geq\; \int q(\mathbf{z}) \log \frac{p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})}{q(\mathbf{z})}\,d\mathbf{z} \;=:\; \mathcal{F}(q,\boldsymbol{\theta}),$ with equality if and only if $q(\mathbf{z}) = p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta})$ almost everywhere.

We rewrite the marginal as an expectation, then pull the (concave) log inside the expectation. Because $\log$ is concave, the move costs us — but we can tune $q$ to make the gap zero.

Show Hint

Introduce $q(\mathbf{z})$ into the marginal via $\int \frac{q}{q} p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})\,d\mathbf{z}$ .

Recognize the result as $\mathbb{E}_q[p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})/q(\mathbf{Z})]$ and apply concavity of $\log$ .

For the equality case, rearrange the gap as a KL divergence.

Proof

Insert $q$

For any $q$ with $\text{supp}(q) \supseteq \text{supp}(p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta}))$ , $f(\mathbf{y};\boldsymbol{\theta}) = \int p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})\,d\mathbf{z} = \int q(\mathbf{z})\,\frac{p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})}{q(\mathbf{z})}\,d\mathbf{z} = \mathbb{E}_{\mathbf{Z}\sim q}\!\left[\frac{p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})}{q(\mathbf{Z})}\right].$

Apply Jensen

Since $\log$ is concave, Jensen's inequality gives $\log \mathbb{E}_q\!\left[\frac{p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})}{q(\mathbf{Z})}\right] \;\geq\; \mathbb{E}_q\!\left[\log \frac{p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})}{q(\mathbf{Z})}\right] = \mathcal{F}(q,\boldsymbol{\theta}).$

Identify the gap

The gap between the log-likelihood and the ELBO is $\log f(\mathbf{y};\boldsymbol{\theta}) - \mathcal{F}(q,\boldsymbol{\theta}) = \int q(\mathbf{z}) \log \frac{q(\mathbf{z})}{p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta})}\,d\mathbf{z} = D\!\big(q(\mathbf{z}) \,\big\Vert\, p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta})\big).$ (Use $p(\mathbf{y},\mathbf{z};\boldsymbol{\theta}) = p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta})\,f(\mathbf{y};\boldsymbol{\theta})$ and simplify.)

Conclude

The KL divergence is non-negative and vanishes exactly when $q(\mathbf{z}) = p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta})$ . This both proves the inequality and pinpoints the unique $q$ that makes it tight. $\blacksquare$

Definition:
Evidence Lower Bound (ELBO) / Free Energy

For a latent-variable model $p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})$ and any density $q(\mathbf{z})$ , the evidence lower bound is $\mathcal{F}(q,\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{Z}\sim q}\!\big[\log p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})\big] \;+\; \mathcal{H}(q),$ where $\mathcal{H}(q) = -\int q(\mathbf{z})\log q(\mathbf{z})\,d\mathbf{z}$ is the differential entropy of $q$ . Equivalently, $\mathcal{F}(q,\boldsymbol{\theta}) = \log f(\mathbf{y};\boldsymbol{\theta}) - D\!\big(q \,\Vert\, p(\cdot\mid\mathbf{y};\boldsymbol{\theta})\big).$

The decomposition into an energy term and an entropy term gives $\mathcal{F}$ its alternative name, (negative) free energy, borrowed from statistical physics. Both viewpoints — Jensen bound on a log-marginal and free energy of a distribution — will be useful.

Key Takeaway

The ELBO is a lower bound on $\log f(\mathbf{y};\boldsymbol{\theta})$ that depends on a free distribution $q$ . Maximizing over $q$ at fixed $\boldsymbol{\theta}$ recovers the posterior and makes the bound tight. Maximizing over $\boldsymbol{\theta}$ at fixed $q$ is the M-step. EM is coordinate ascent on $\mathcal{F}$ .

The ELBO Gap: $\log f(\mathbf{y};\boldsymbol{\theta}) - \mathcal{F}(q,\boldsymbol{\theta})$

Visualize the lower-bound relationship in a simple 1-D toy problem. A single parameter $\mu$ controls the location of a 2-component Gaussian mixture; the slider varies the variational mean used for $q$ . The tight bound is achieved when $q$ matches the true posterior.

Parameters

True

\mu

1

Variational mean0

Samples

n

100

The ELBO and the Bayes Modeler's Dream

Reading $\mathcal{F}(q,\boldsymbol{\theta}) = \log f(\mathbf{y};\boldsymbol{\theta}) - D(q\Vert p(\cdot\mid\mathbf{y};\boldsymbol{\theta}))$ the other way around: if we could maximize the ELBO jointly in $q$ and $\boldsymbol{\theta}$ , we would simultaneously drive $q$ toward the posterior (making the KL term small) and drive $\boldsymbol{\theta}$ toward the MLE (making the first term large). This is the foundation of variational inference — when computing the exact posterior is intractable, restrict $q$ to a tractable family and optimize within it. Classical EM is the special case where the posterior is tractable, so the restriction is absent.

Common Mistake: Support Condition on $q$

Mistake:

Applying the ELBO decomposition with a $q$ that puts zero mass on a region where the posterior has positive mass — "forgetting" a latent state.

Correction:

Jensen's inequality requires $q$ to dominate the posterior's support; otherwise $D(q\Vert p(\cdot\mid\mathbf{y}))$ can be finite while $\mathcal{F}$ has a spurious $+\infty$ term. In practice: initialize $q$ with full support (e.g., every responsibility strictly positive). Exact zeros that arise mid-iteration are absorbing — a component that loses all its mass will never recover.

Quick Check

The ELBO $\mathcal{F}(q,\boldsymbol{\theta})$ equals $\log f(\mathbf{y};\boldsymbol{\theta})$ if and only if $q$ equals which distribution?

The prior $p(\mathbf{z};\boldsymbol{\theta})$

The joint $p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})$

The posterior $p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta})$

The marginal $f(\mathbf{y};\boldsymbol{\theta})$

Correction:

The posterior

p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta})

The gap equals $D(q\Vert p(\cdot\mid\mathbf{y};\boldsymbol{\theta}))$ , which vanishes iff $q$ is the posterior (almost everywhere).

Jensen's Inequality and the Evidence Lower Bound

From a Stubborn log⁡\loglog to a Tractable Bound