Motivation: Missing Data and Latent Variables

Why We Need EM

In Chapter 7 we studied maximum-likelihood estimation and found, with some relief, that many canonical models admit closed-form MLEs: the Gaussian mean and variance, the exponential rate, the linear-Gaussian regression. The computation reduced, in each case, to solving the score equation θlogp(y;θ)=0\nabla_{\boldsymbol{\theta}} \log p(\mathbf{y};\boldsymbol{\theta}) = \mathbf{0}.

But many important inference problems — clustering, channel estimation with unknown interference, hidden Markov models, sparse recovery with unknown sparsity pattern — share a common feature that breaks this closed-form machinery: if only we knew some hidden quantity Z\mathbf{Z}, the MLE would be trivial. The catch is that we do not observe Z\mathbf{Z}, and integrating it out makes the likelihood a sum (or integral) of exponentials whose logarithm has no closed-form maximizer.

The Expectation-Maximization (EM) algorithm is the systematic answer to this frustration. It alternates between guessing the hidden variables (using the current parameter estimate) and re-fitting the parameters (as if those guesses were truth). It does this in a way that is guaranteed to improve the likelihood at every iteration.

Definition:

Complete and Incomplete Data

Let Y\mathbf{Y} denote the observed data (the incomplete observation) and Z\mathbf{Z} the latent variables (never observed). Together they form the complete data (Y,Z)(\mathbf{Y},\mathbf{Z}). A latent-variable model specifies a joint distribution p(y,z;θ)=p(z;θ)p(yz;θ),p(\mathbf{y},\mathbf{z};\boldsymbol{\theta}) = p(\mathbf{z};\boldsymbol{\theta})\,p(\mathbf{y}\mid\mathbf{z};\boldsymbol{\theta}), parameterized by θΛ\boldsymbol{\theta}\in\Lambda. The observed-data likelihood is the marginal f(y;θ)=p(y,z;θ)dz(or z in the discrete case).f(\mathbf{y};\boldsymbol{\theta}) = \int p(\mathbf{y},\mathbf{z};\boldsymbol{\theta})\,d\mathbf{z} \quad (\text{or } \sum_{\mathbf{z}} \text{ in the discrete case}). The incomplete-data log-likelihood is (θ)=logf(y;θ)\ell(\boldsymbol{\theta}) = \log f(\mathbf{y};\boldsymbol{\theta}).

The word "incomplete" here is historical: it reflects the viewpoint of Dempster, Laird, and Rubin (1977), who framed EM as a generalization of maximum likelihood for problems with missing values. The same machinery applies equally to latent-variable models where Z\mathbf{Z} was never meant to be observed in the first place (e.g., cluster assignments).

Latent Variable

An unobserved random variable whose presence in a probabilistic model is useful for either explaining structure in the observed data (as in mixture models) or representing genuine missing information (as in censored data).

Related: Complete Data, Responsibility

Complete Data

The pair (Y,Z)(\mathbf{Y},\mathbf{Z}) of observations together with the latent variables. The complete-data log-likelihood logp(y,z;θ)\log p(\mathbf{y},\mathbf{z};\boldsymbol{\theta}) is typically much simpler than the incomplete-data version.

Related: Latent Variable

Example: Why the MLE Is Hard: Two Gaussians

Suppose nn i.i.d. samples y1,,ynRy_1,\ldots,y_n \in \mathbb{R} come from a mixture of two known-variance Gaussians: p(y;θ)=πN(y;μ1,1)+(1π)N(y;μ2,1),p(y;\boldsymbol{\theta}) = \pi\,\mathcal{N}(y;\mu_1,1) + (1-\pi)\,\mathcal{N}(y;\mu_2,1), with parameters θ=(μ1,μ2,π)\boldsymbol{\theta} = (\mu_1,\mu_2,\pi). Write down the log-likelihood and explain why direct maximization is difficult.

Graphical Structure of a Latent-Variable Model

Graphical Structure of a Latent-Variable Model
Generative model: parameters θ\boldsymbol{\theta} govern the latent variable Z\mathbf{Z}, which together with θ\boldsymbol{\theta} generates the observation Y\mathbf{Y}. EM inverts the arrows at inference time.

Why This Matters: Interference as a Latent Variable

In a wireless uplink, a receiver observes y=Hs+i+w\mathbf{y} = \mathbf{H}\,\mathbf{s} + \mathbf{i} + \mathbf{w}, where i\mathbf{i} is interference from other users whose structure is only partially known. Treating the interferer's data symbols as latent variables, EM provides a principled way to jointly estimate the channel H\mathbf{H} and the interference statistics. This viewpoint underlies turbo equalization and iterative interference cancellation — both special cases of the EM algorithm applied to communication receivers.

See full treatment in Applications: K-means, HMMs, and Sparse Bayesian Learning

Historical Note: The 1977 Paper

1926-1977

Although special cases of the E-M alternation appear as early as McKendrick (1926) and Baum et al. (1970, for HMMs), the algorithm was crystallized in its full generality by Arthur Dempster, Nan Laird, and Donald Rubin in their 1977 paper in the Journal of the Royal Statistical Society. That paper is one of the most-cited in all of statistics: it unified dozens of scattered iterative schemes under a single banner and provided the monotonicity proof that made the algorithm's convergence behavior transparent.