The EM Algorithm and Monotonicity

Coordinate Ascent on F\mathcal{F}

The ELBO F(q,θ)\mathcal{F}(q,\boldsymbol{\theta}) has two arguments. Coordinate ascent is the natural strategy: fix one and maximize over the other, then swap. Fixing θ=θ(t)\boldsymbol{\theta} = \boldsymbol{\theta}^{(t)} and optimizing qq gives the E-step; the optimum is the posterior. Fixing qq and optimizing θ\boldsymbol{\theta} gives the M-step; because the entropy term does not depend on θ\boldsymbol{\theta}, this reduces to maximizing the expected complete-data log-likelihood. Iterating produces a sequence θ(0),θ(1),\boldsymbol{\theta}^{(0)},\boldsymbol{\theta}^{(1)},\ldots whose incomplete-data log-likelihood is monotonically non-decreasing.

EM Algorithm

Complexity: Per iteration: one posterior inference (E-step) plus one constrained optimization (M-step). In exponential-family models both steps are in closed form.
Input: observations y\mathbf{y}, model p(y,z;θ)p(\mathbf{y},\mathbf{z};\boldsymbol{\theta}), initialization θ(0)\boldsymbol{\theta}^{(0)}, tolerance ε\varepsilon
Output: (local) maximizer θ^\hat{\boldsymbol{\theta}}
1. t0t \leftarrow 0
2. repeat
3. \quad E-step: compute the posterior q(t+1)(z)p(zy;θ(t))q^{(t+1)}(\mathbf{z}) \leftarrow p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta}^{(t)}) and form
4. Q(θθ(t))EZy,θ(t) ⁣[logp(y,Z;θ)]\quad\quad\quad Q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{(t)}) \leftarrow \mathbb{E}_{\mathbf{Z}\mid\mathbf{y},\boldsymbol{\theta}^{(t)}}\!\big[\log p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})\big]
5. \quad M-step: θ(t+1)argmaxθΛQ(θθ(t))\boldsymbol{\theta}^{(t+1)} \leftarrow \arg\max_{\boldsymbol{\theta}\in\Lambda} Q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{(t)})
6. tt+1\quad t \leftarrow t+1
7. until logf(y;θ(t))logf(y;θ(t1))<ε\big|\log f(\mathbf{y};\boldsymbol{\theta}^{(t)}) - \log f(\mathbf{y};\boldsymbol{\theta}^{(t-1)})\big| < \varepsilon
8. return θ(t)\boldsymbol{\theta}^{(t)}

In practice one monitors the incomplete-data log-likelihood directly; it is cheap to evaluate given the posterior computed in the E-step.

Theorem: Monotonic Improvement of the Log-Likelihood

Let θ(t+1)\boldsymbol{\theta}^{(t+1)} be produced from θ(t)\boldsymbol{\theta}^{(t)} by one iteration of EM. Then logf(y;θ(t+1))    logf(y;θ(t)),\log f(\mathbf{y};\boldsymbol{\theta}^{(t+1)}) \;\geq\; \log f(\mathbf{y};\boldsymbol{\theta}^{(t)}), with equality if and only if θ(t)\boldsymbol{\theta}^{(t)} is already a stationary point of Q(θ(t))Q(\,\cdot\,\mid\boldsymbol{\theta}^{(t)}).

The E-step closes the Jensen gap at θ(t)\boldsymbol{\theta}^{(t)}, making F\mathcal{F} tangent to \ell from below. The M-step then walks uphill on F\mathcal{F}, which necessarily means uphill on \ell.

Theorem: Stationary Points of EM

If θ\boldsymbol{\theta}^{\star} is a fixed point of the EM update (i.e. θargmaxθQ(θθ)\boldsymbol{\theta}^{\star} \in \arg\max_{\boldsymbol{\theta}} Q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{\star}) and the arg-max is interior to Λ\Lambda), then θ\boldsymbol{\theta}^{\star} is a stationary point of (θ)=logf(y;θ)\ell(\boldsymbol{\theta}) = \log f(\mathbf{y};\boldsymbol{\theta}): θ(θ)=0\nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}^{\star}) = \mathbf{0}.

Local Maxima, Saddle Points, and Plateaus

Monotonicity + boundedness of the likelihood (when the log-likelihood is bounded above on Λ\Lambda) guarantees that (θ(t))\ell(\boldsymbol{\theta}^{(t)}) converges. But convergence of the iterates θ(t)\boldsymbol{\theta}^{(t)} is not automatic: the sequence can wander on a plateau, and even when it converges it may land on a local maximum, a saddle point, or (pathologically) drift to the boundary of Λ\Lambda. Classical results (Wu 1983) establish convergence to the set of stationary points under regularity conditions. The practical remedy — as for any non-convex optimization — is to run EM from multiple random initializations and keep the best local maximum.

Monotonic Log-Likelihood During EM Iterations

Run EM on a synthetic 2-component 1-D mixture and track the incomplete-data log-likelihood against iteration number. Vary the noise level and initial guess to see convergence speed and local-maximum behaviour.

Parameters
2
30
0
⚠️Engineering Note

Stopping Criteria in Practice

A relative change in log-likelihood is the standard stopping rule, but it can fire prematurely on plateaus near saddle points. Production implementations commonly use a hybrid criterion: stop when the relative log-likelihood change AND the parameter change (in some norm) both fall below tolerance, subject to a maximum iteration cap. For mixture models, also monitor the minimum responsibility mass per component — a component with iγik<1\sum_i \gamma_{ik} < 1 is about to collapse.

Practical Constraints
  • Typical relative tolerance: 10610^{-6} for double precision

  • Typical maximum iterations: 100-500 for well-conditioned problems

  • Log-likelihood can increase by numerical roundoff (1012\sim 10^{-12}); do not alarm on this

Common Mistake: Initialization Matters — A Lot

Mistake:

Initializing EM at the all-equal parameter vector (e.g., all means at the global sample mean) and expecting it to separate components automatically.

Correction:

Symmetric initializations are stationary points of QQ: the updates will leave them unchanged. Break symmetry deliberately — random initialization within the data range, or kk-means++ seeding for GMMs. Run several restarts and keep the one with the highest log-likelihood.

Quick Check

Which statement about EM convergence is true?

EM always converges to the global maximum of the likelihood.

The log-likelihood is non-decreasing at every EM iteration.

EM converges in a finite number of iterations.

EM updates always decrease the KL divergence D(qp(y;θ))D(q\Vert p(\cdot\mid\mathbf{y};\boldsymbol{\theta})).

EM as Iterative Lower-Bound Maximization

Each EM iteration rebuilds the ELBO tangent to (θ)\ell(\boldsymbol{\theta}) at the current iterate (EE-step) and then climbs the bound (MM-step). The log-likelihood logf(y;θ)\log f(\mathbf{y};\boldsymbol{\theta}) never decreases.