Motivation: Missing Data and Latent Variables
Why We Need EM
In Chapter 7 we studied maximum-likelihood estimation and found, with some relief, that many canonical models admit closed-form MLEs: the Gaussian mean and variance, the exponential rate, the linear-Gaussian regression. The computation reduced, in each case, to solving the score equation .
But many important inference problems — clustering, channel estimation with unknown interference, hidden Markov models, sparse recovery with unknown sparsity pattern — share a common feature that breaks this closed-form machinery: if only we knew some hidden quantity , the MLE would be trivial. The catch is that we do not observe , and integrating it out makes the likelihood a sum (or integral) of exponentials whose logarithm has no closed-form maximizer.
The Expectation-Maximization (EM) algorithm is the systematic answer to this frustration. It alternates between guessing the hidden variables (using the current parameter estimate) and re-fitting the parameters (as if those guesses were truth). It does this in a way that is guaranteed to improve the likelihood at every iteration.
Definition: Complete and Incomplete Data
Complete and Incomplete Data
Let denote the observed data (the incomplete observation) and the latent variables (never observed). Together they form the complete data . A latent-variable model specifies a joint distribution parameterized by . The observed-data likelihood is the marginal The incomplete-data log-likelihood is .
The word "incomplete" here is historical: it reflects the viewpoint of Dempster, Laird, and Rubin (1977), who framed EM as a generalization of maximum likelihood for problems with missing values. The same machinery applies equally to latent-variable models where was never meant to be observed in the first place (e.g., cluster assignments).
Latent Variable
An unobserved random variable whose presence in a probabilistic model is useful for either explaining structure in the observed data (as in mixture models) or representing genuine missing information (as in censored data).
Related: Complete Data, Responsibility
Complete Data
The pair of observations together with the latent variables. The complete-data log-likelihood is typically much simpler than the incomplete-data version.
Related: Latent Variable
Example: Why the MLE Is Hard: Two Gaussians
Suppose i.i.d. samples come from a mixture of two known-variance Gaussians: with parameters . Write down the log-likelihood and explain why direct maximization is difficult.
Log-likelihood
The log-likelihood is Each term is the log of a sum of two exponentials — the logarithm does not distribute over the sum.
Why closed form fails
Setting gives The weight depends on all parameters, so the score equation is not linear in . Symmetric equations hold for and .
The EM insight
If we knew the component label for each sample, we would simply compute the sample mean of the samples assigned to each component — two separate one-Gaussian MLEs. EM exploits this observation: in the E-step it computes the soft assignment , and in the M-step it performs weighted sample-mean updates as if the assignments were known.
Graphical Structure of a Latent-Variable Model
Why This Matters: Interference as a Latent Variable
In a wireless uplink, a receiver observes , where is interference from other users whose structure is only partially known. Treating the interferer's data symbols as latent variables, EM provides a principled way to jointly estimate the channel and the interference statistics. This viewpoint underlies turbo equalization and iterative interference cancellation — both special cases of the EM algorithm applied to communication receivers.
See full treatment in Applications: K-means, HMMs, and Sparse Bayesian Learning
Historical Note: The 1977 Paper
1926-1977Although special cases of the E-M alternation appear as early as McKendrick (1926) and Baum et al. (1970, for HMMs), the algorithm was crystallized in its full generality by Arthur Dempster, Nan Laird, and Donald Rubin in their 1977 paper in the Journal of the Royal Statistical Society. That paper is one of the most-cited in all of statistics: it unified dozens of scattered iterative schemes under a single banner and provided the monotonicity proof that made the algorithm's convergence behavior transparent.