Ferkans — Interactive Telecom Tutor

Coordinate Ascent on $\mathcal{F}$

The ELBO $\mathcal{F}(q,\boldsymbol{\theta})$ has two arguments. Coordinate ascent is the natural strategy: fix one and maximize over the other, then swap. Fixing $\boldsymbol{\theta} = \boldsymbol{\theta}^{(t)}$ and optimizing $q$ gives the E-step; the optimum is the posterior. Fixing $q$ and optimizing $\boldsymbol{\theta}$ gives the M-step; because the entropy term does not depend on $\boldsymbol{\theta}$ , this reduces to maximizing the expected complete-data log-likelihood. Iterating produces a sequence $\boldsymbol{\theta}^{(0)},\boldsymbol{\theta}^{(1)},\ldots$ whose incomplete-data log-likelihood is monotonically non-decreasing.

EM Algorithm

Complexity: Per iteration: one posterior inference (E-step) plus one constrained optimization (M-step). In exponential-family models both steps are in closed form.

Input: observations

\mathbf{y}

, model

p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})

, initialization

\boldsymbol{\theta}^{(0)}

, tolerance

\varepsilon

Output: (local) maximizer

\hat{\boldsymbol{\theta}}

1.

t \leftarrow 0

2. repeat

3.

\quad

E-step: compute the posterior

q^{(t+1)}(\mathbf{z}) \leftarrow p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta}^{(t)})

and form

4.

\quad\quad\quad Q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{(t)}) \leftarrow \mathbb{E}_{\mathbf{Z}\mid\mathbf{y},\boldsymbol{\theta}^{(t)}}\!\big[\log p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})\big]

5.

\quad

M-step:

\boldsymbol{\theta}^{(t+1)} \leftarrow \arg\max_{\boldsymbol{\theta}\in\Lambda} Q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{(t)})

6.

\quad t \leftarrow t+1

7. until

\big|\log f(\mathbf{y};\boldsymbol{\theta}^{(t)}) - \log f(\mathbf{y};\boldsymbol{\theta}^{(t-1)})\big| < \varepsilon

8. return

\boldsymbol{\theta}^{(t)}

In practice one monitors the incomplete-data log-likelihood directly; it is cheap to evaluate given the posterior computed in the E-step.

Theorem: Monotonic Improvement of the Log-Likelihood

Let $\boldsymbol{\theta}^{(t+1)}$ be produced from $\boldsymbol{\theta}^{(t)}$ by one iteration of EM. Then $\log f(\mathbf{y};\boldsymbol{\theta}^{(t+1)}) \;\geq\; \log f(\mathbf{y};\boldsymbol{\theta}^{(t)}),$ with equality if and only if $\boldsymbol{\theta}^{(t)}$ is already a stationary point of $Q(\,\cdot\,\mid\boldsymbol{\theta}^{(t)})$ .

The E-step closes the Jensen gap at $\boldsymbol{\theta}^{(t)}$ , making $\mathcal{F}$ tangent to $\ell$ from below. The M-step then walks uphill on $\mathcal{F}$ , which necessarily means uphill on $\ell$ .

Show Hint

Use the ELBO identity $\ell(\boldsymbol{\theta}) = \mathcal{F}(q,\boldsymbol{\theta}) + D(q\Vert p(\cdot\mid\mathbf{y};\boldsymbol{\theta}))$ .

Pick $q = p(\cdot\mid\mathbf{y};\boldsymbol{\theta}^{(t)})$ (the E-step distribution).

Relate $\mathcal{F}(q,\boldsymbol{\theta})$ to $Q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{(t)})$ plus a constant in $\boldsymbol{\theta}$ .

Proof

ELBO identity

For any $\boldsymbol{\theta}$ and any valid $q$ , $\ell(\boldsymbol{\theta}) = \mathcal{F}(q,\boldsymbol{\theta}) + D\!\big(q \,\Vert\, p(\cdot\mid\mathbf{y};\boldsymbol{\theta})\big).$ The KL term is non-negative, and $\mathcal{F}(q,\boldsymbol{\theta}) = \mathbb{E}_q[\log p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})] + \mathcal{H}(q)$ .

E-step: tight bound at $\boldsymbol{\theta}^{(t)}$

Choose $q^{(t+1)}(\mathbf{z}) = p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta}^{(t)})$ . The KL term vanishes at $\boldsymbol{\theta}=\boldsymbol{\theta}^{(t)}$ , so $\ell(\boldsymbol{\theta}^{(t)}) = \mathcal{F}(q^{(t+1)},\boldsymbol{\theta}^{(t)}).$ Moreover, because the entropy $\mathcal{H}(q^{(t+1)})$ does not depend on $\boldsymbol{\theta}$ , maximizing $\mathcal{F}(q^{(t+1)},\boldsymbol{\theta})$ over $\boldsymbol{\theta}$ is the same as maximizing $Q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{(t)})$ .

M-step: ascent on $\mathcal{F}$

The M-step produces $\boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} Q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{(t)})$ , hence $\mathcal{F}(q^{(t+1)},\boldsymbol{\theta}^{(t+1)}) \;\geq\; \mathcal{F}(q^{(t+1)},\boldsymbol{\theta}^{(t)}).$

Chain the inequalities

Using the ELBO identity again at $\boldsymbol{\theta}=\boldsymbol{\theta}^{(t+1)}$ , $\ell(\boldsymbol{\theta}^{(t+1)}) = \mathcal{F}(q^{(t+1)},\boldsymbol{\theta}^{(t+1)}) + D\!\big(q^{(t+1)} \,\Vert\, p(\cdot\mid\mathbf{y};\boldsymbol{\theta}^{(t+1)})\big) \;\geq\; \mathcal{F}(q^{(t+1)},\boldsymbol{\theta}^{(t+1)}) \;\geq\; \mathcal{F}(q^{(t+1)},\boldsymbol{\theta}^{(t)}) = \ell(\boldsymbol{\theta}^{(t)}).$ Equality throughout requires both that the M-step makes no progress (so $\boldsymbol{\theta}^{(t)}$ is a stationary point of $Q(\cdot\mid\boldsymbol{\theta}^{(t)})$ ) and that the KL term vanishes at $\boldsymbol{\theta}^{(t+1)}$ . $\blacksquare$

Theorem: Stationary Points of EM

If $\boldsymbol{\theta}^{\star}$ is a fixed point of the EM update (i.e. $\boldsymbol{\theta}^{\star} \in \arg\max_{\boldsymbol{\theta}} Q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{\star})$ and the arg-max is interior to $\Lambda$ ), then $\boldsymbol{\theta}^{\star}$ is a stationary point of $\ell(\boldsymbol{\theta}) = \log f(\mathbf{y};\boldsymbol{\theta})$ : $\nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}^{\star}) = \mathbf{0}$ .

Show Hint

Interchange expectation and differentiation; assume regularity.

Differentiate Fisher's identity $\nabla \ell(\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{Z}\mid\mathbf{y},\boldsymbol{\theta}}[\nabla \log p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})]$ .

Proof

Fisher's identity

Write the marginal as $f(\mathbf{y};\boldsymbol{\theta}) = \int p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})\,d\mathbf{z}$ and differentiate under the integral sign: $\nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}) = \frac{1}{f(\mathbf{y};\boldsymbol{\theta})} \int \nabla_{\boldsymbol{\theta}} p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})\,d\mathbf{z} = \int p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta})\,\nabla_{\boldsymbol{\theta}}\log p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})\,d\mathbf{z}.$ This is Fisher's identity: the score of the incomplete data equals the posterior-averaged score of the complete data.

Connect to $Q$

Evaluated at $\boldsymbol{\theta}^{\star}$ , the right-hand side is $\nabla_{\boldsymbol{\theta}} Q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{\star})\big|_{\boldsymbol{\theta}=\boldsymbol{\theta}^{\star}}$ . By hypothesis, $\boldsymbol{\theta}^{\star}$ is an interior maximizer of $Q(\cdot\mid\boldsymbol{\theta}^{\star})$ , so this gradient is zero. Therefore $\nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}^{\star}) = \mathbf{0}$ . $\blacksquare$

Local Maxima, Saddle Points, and Plateaus

Monotonicity + boundedness of the likelihood (when the log-likelihood is bounded above on $\Lambda$ ) guarantees that $\ell(\boldsymbol{\theta}^{(t)})$ converges. But convergence of the iterates $\boldsymbol{\theta}^{(t)}$ is not automatic: the sequence can wander on a plateau, and even when it converges it may land on a local maximum, a saddle point, or (pathologically) drift to the boundary of $\Lambda$ . Classical results (Wu 1983) establish convergence to the set of stationary points under regularity conditions. The practical remedy — as for any non-convex optimization — is to run EM from multiple random initializations and keep the best local maximum.

Monotonic Log-Likelihood During EM Iterations

Run EM on a synthetic 2-component 1-D mixture and track the incomplete-data log-likelihood against iteration number. Vary the noise level and initial guess to see convergence speed and local-maximum behaviour.

Parameters

Component separation2

Iterations30

Random seed0

⚠️Engineering Note

Stopping Criteria in Practice

A relative change in log-likelihood is the standard stopping rule, but it can fire prematurely on plateaus near saddle points. Production implementations commonly use a hybrid criterion: stop when the relative log-likelihood change AND the parameter change (in some norm) both fall below tolerance, subject to a maximum iteration cap. For mixture models, also monitor the minimum responsibility mass per component — a component with $\sum_i \gamma_{ik} < 1$ is about to collapse.

Practical Constraints

•
Typical relative tolerance: $10^{-6}$ for double precision
•
Typical maximum iterations: 100-500 for well-conditioned problems
•
Log-likelihood can increase by numerical roundoff ( $\sim 10^{-12}$ ); do not alarm on this

Common Mistake: Initialization Matters — A Lot

Mistake:

Initializing EM at the all-equal parameter vector (e.g., all means at the global sample mean) and expecting it to separate components automatically.

Correction:

Symmetric initializations are stationary points of $Q$ : the updates will leave them unchanged. Break symmetry deliberately — random initialization within the data range, or $k$ -means++ seeding for GMMs. Run several restarts and keep the one with the highest log-likelihood.

Quick Check

Which statement about EM convergence is true?

EM always converges to the global maximum of the likelihood.

The log-likelihood is non-decreasing at every EM iteration.

EM converges in a finite number of iterations.

EM updates always decrease the KL divergence $D(q\Vert p(\cdot\mid\mathbf{y};\boldsymbol{\theta}))$ .

Correction:

The log-likelihood is non-decreasing at every EM iteration.

This is the monotonicity theorem — the defining feature of EM.

EM as Iterative Lower-Bound Maximization

Each EM iteration rebuilds the ELBO tangent to

\ell(\boldsymbol{\theta})

at the current iterate (

E

-step) and then climbs the bound (

M

-step). The log-likelihood

\log f(\mathbf{y};\boldsymbol{\theta})

never decreases.

The EM Algorithm and Monotonicity

Coordinate Ascent on F\mathcal{F}F