The EM Algorithm and Monotonicity
Coordinate Ascent on
The ELBO has two arguments. Coordinate ascent is the natural strategy: fix one and maximize over the other, then swap. Fixing and optimizing gives the E-step; the optimum is the posterior. Fixing and optimizing gives the M-step; because the entropy term does not depend on , this reduces to maximizing the expected complete-data log-likelihood. Iterating produces a sequence whose incomplete-data log-likelihood is monotonically non-decreasing.
EM Algorithm
Complexity: Per iteration: one posterior inference (E-step) plus one constrained optimization (M-step). In exponential-family models both steps are in closed form.In practice one monitors the incomplete-data log-likelihood directly; it is cheap to evaluate given the posterior computed in the E-step.
Theorem: Monotonic Improvement of the Log-Likelihood
Let be produced from by one iteration of EM. Then with equality if and only if is already a stationary point of .
The E-step closes the Jensen gap at , making tangent to from below. The M-step then walks uphill on , which necessarily means uphill on .
Use the ELBO identity .
Pick (the E-step distribution).
Relate to plus a constant in .
ELBO identity
For any and any valid , The KL term is non-negative, and .
E-step: tight bound at $\boldsymbol{\theta}^{(t)}$
Choose . The KL term vanishes at , so Moreover, because the entropy does not depend on , maximizing over is the same as maximizing .
M-step: ascent on $\mathcal{F}$
The M-step produces , hence
Chain the inequalities
Using the ELBO identity again at , Equality throughout requires both that the M-step makes no progress (so is a stationary point of ) and that the KL term vanishes at .
Theorem: Stationary Points of EM
If is a fixed point of the EM update (i.e. and the arg-max is interior to ), then is a stationary point of : .
Interchange expectation and differentiation; assume regularity.
Differentiate Fisher's identity .
Fisher's identity
Write the marginal as and differentiate under the integral sign: This is Fisher's identity: the score of the incomplete data equals the posterior-averaged score of the complete data.
Connect to $Q$
Evaluated at , the right-hand side is . By hypothesis, is an interior maximizer of , so this gradient is zero. Therefore .
Local Maxima, Saddle Points, and Plateaus
Monotonicity + boundedness of the likelihood (when the log-likelihood is bounded above on ) guarantees that converges. But convergence of the iterates is not automatic: the sequence can wander on a plateau, and even when it converges it may land on a local maximum, a saddle point, or (pathologically) drift to the boundary of . Classical results (Wu 1983) establish convergence to the set of stationary points under regularity conditions. The practical remedy — as for any non-convex optimization — is to run EM from multiple random initializations and keep the best local maximum.
Monotonic Log-Likelihood During EM Iterations
Run EM on a synthetic 2-component 1-D mixture and track the incomplete-data log-likelihood against iteration number. Vary the noise level and initial guess to see convergence speed and local-maximum behaviour.
Parameters
Stopping Criteria in Practice
A relative change in log-likelihood is the standard stopping rule, but it can fire prematurely on plateaus near saddle points. Production implementations commonly use a hybrid criterion: stop when the relative log-likelihood change AND the parameter change (in some norm) both fall below tolerance, subject to a maximum iteration cap. For mixture models, also monitor the minimum responsibility mass per component — a component with is about to collapse.
- •
Typical relative tolerance: for double precision
- •
Typical maximum iterations: 100-500 for well-conditioned problems
- •
Log-likelihood can increase by numerical roundoff (); do not alarm on this
Common Mistake: Initialization Matters — A Lot
Mistake:
Initializing EM at the all-equal parameter vector (e.g., all means at the global sample mean) and expecting it to separate components automatically.
Correction:
Symmetric initializations are stationary points of : the updates will leave them unchanged. Break symmetry deliberately — random initialization within the data range, or -means++ seeding for GMMs. Run several restarts and keep the one with the highest log-likelihood.
Quick Check
Which statement about EM convergence is true?
EM always converges to the global maximum of the likelihood.
The log-likelihood is non-decreasing at every EM iteration.
EM converges in a finite number of iterations.
EM updates always decrease the KL divergence .
This is the monotonicity theorem — the defining feature of EM.