Chapter Summary

Chapter Summary

Key Points

  • 1.

    Latent-variable models make direct MLE hard. When the observed-data likelihood f(y;θ)=p(y,z;θ)dzf(\mathbf{y};\boldsymbol{\theta}) = \int p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})\,d\mathbf{z} involves a marginalization, its logarithm becomes a log-sum-of-exponentials with no closed-form maximizer. EM turns this into a sequence of tractable optimizations.

  • 2.

    Jensen's inequality yields the ELBO. For any distribution q(z)q(\mathbf{z}), logf(y;θ)F(q,θ)=Eq[logp(y,Z;θ)]+H(q)\log f(\mathbf{y};\boldsymbol{\theta}) \geq \mathcal{F}(q,\boldsymbol{\theta}) = \mathbb{E}_q[\log p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})] + \mathcal{H}(q), with equality exactly when q=p(y;θ)q = p(\cdot\mid\mathbf{y};\boldsymbol{\theta}). The ELBO is the single expression that unifies EM, variational inference, and free-energy minimization.

  • 3.

    EM is coordinate ascent on the ELBO. The E-step sets q(t+1)=p(y;θ(t))q^{(t+1)} = p(\cdot\mid\mathbf{y};\boldsymbol{\theta}^{(t)}) (closing the Jensen gap); the M-step maximizes Q(θθ(t))=Eq(t+1)[logp(y,Z;θ)]Q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{(t)}) = \mathbb{E}_{q^{(t+1)}}[\log p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})]. The incomplete-data log-likelihood is monotonically non-decreasing, and fixed points of EM are stationary points of the likelihood.

  • 4.

    The GMM is the prototypical EM application. The E-step computes responsibilities γikπkN(yi;μk,Σk)\gamma_{ik} \propto \pi_k\,\mathcal{N}(\mathbf{y}_i;\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_{k}); the M-step performs a responsibility-weighted Gaussian MLE. K-means is the σ20\sigma^2 \to 0 hard-assignment limit, Baum-Welch is the HMM counterpart, and SBL is the linear-Gaussian latent-coefficient version with automatic relevance determination.

  • 5.

    Practical EM requires care. Local maxima and saddle points are the rule rather than the exception — always run multiple restarts. Unconstrained GMM covariances can collapse onto data points (likelihood unbounded); regularization is mandatory. Label switching is an artifact of permutation symmetry, not a convergence failure.

  • 6.

    EM is the ancestor of modern variational inference. Variational EM restricts qq to a tractable family; gradient-based variational EM (VAEs) replaces the per-sample E-step by an amortized encoder. All of these methods are instances of lower-bound maximization on the same ELBO.

Looking Ahead

Chapter 9 introduces the discrete-time Wiener filter and its time-domain realization via the Yule-Walker equations. We will see Levinson-Durbin recursion turn a Toeplitz linear system into an O(p2)O(p^2) recursive procedure — a counterpart, in linear estimation, of the computational parsimony that EM brings to non-linear estimation. Chapter 10 then takes the state-space leap: the Kalman filter, which combines Wiener-style linear MMSE with the recursive structure of a Markov model. Readers interested in how EM itself extends to dynamical models should note that Baum-Welch (from this chapter) and Kalman smoothing (next chapter) combine to give the EM algorithm for state-space parameter estimation.