Chapter Summary

Key Points

1.
Latent-variable models make direct MLE hard. When the observed-data likelihood $f(\mathbf{y};\boldsymbol{\theta}) = \int p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})\,d\mathbf{z}$ involves a marginalization, its logarithm becomes a log-sum-of-exponentials with no closed-form maximizer. EM turns this into a sequence of tractable optimizations.
2.
Jensen's inequality yields the ELBO. For any distribution $q(\mathbf{z})$ , $\log f(\mathbf{y};\boldsymbol{\theta}) \geq \mathcal{F}(q,\boldsymbol{\theta}) = \mathbb{E}_q[\log p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})] + \mathcal{H}(q)$ , with equality exactly when $q = p(\cdot\mid\mathbf{y};\boldsymbol{\theta})$ . The ELBO is the single expression that unifies EM, variational inference, and free-energy minimization.
3.
EM is coordinate ascent on the ELBO. The E-step sets $q^{(t+1)} = p(\cdot\mid\mathbf{y};\boldsymbol{\theta}^{(t)})$ (closing the Jensen gap); the M-step maximizes $Q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{(t)}) = \mathbb{E}_{q^{(t+1)}}[\log p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})]$ . The incomplete-data log-likelihood is monotonically non-decreasing, and fixed points of EM are stationary points of the likelihood.
4.
The GMM is the prototypical EM application. The E-step computes responsibilities $\gamma_{ik} \propto \pi_k\,\mathcal{N}(\mathbf{y}_i;\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_{k})$ ; the M-step performs a responsibility-weighted Gaussian MLE. K-means is the $\sigma^2 \to 0$ hard-assignment limit, Baum-Welch is the HMM counterpart, and SBL is the linear-Gaussian latent-coefficient version with automatic relevance determination.
5.
Practical EM requires care. Local maxima and saddle points are the rule rather than the exception — always run multiple restarts. Unconstrained GMM covariances can collapse onto data points (likelihood unbounded); regularization is mandatory. Label switching is an artifact of permutation symmetry, not a convergence failure.
6.
EM is the ancestor of modern variational inference. Variational EM restricts $q$ to a tractable family; gradient-based variational EM (VAEs) replaces the per-sample E-step by an amortized encoder. All of these methods are instances of lower-bound maximization on the same ELBO.

Looking Ahead

Chapter 9 introduces the discrete-time Wiener filter and its time-domain realization via the Yule-Walker equations. We will see Levinson-Durbin recursion turn a Toeplitz linear system into an $O(p^2)$ recursive procedure — a counterpart, in linear estimation, of the computational parsimony that EM brings to non-linear estimation. Chapter 10 then takes the state-space leap: the Kalman filter, which combines Wiener-style linear MMSE with the recursive structure of a Markov model. Readers interested in how EM itself extends to dynamical models should note that Baum-Welch (from this chapter) and Kalman smoothing (next chapter) combine to give the EM algorithm for state-space parameter estimation.

Applications: K-means, HMMs, and Sparse Bayesian Learning Exercises