References & Further Reading
References
- A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, 1977
The foundational paper that crystallized EM as a general algorithm and proved monotonicity.
- C. F. J. Wu, On the Convergence Properties of the EM Algorithm, 1983
The definitive convergence analysis of EM β stationary-point convergence under regularity.
- G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions, Wiley, 2nd ed., 2008
Comprehensive monograph covering variants, acceleration, and applications.
- G. J. McLachlan and D. Peel, Finite Mixture Models, Wiley, 2000
Standard reference for GMMs and related mixture-model fitting.
- C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
Chapter 9 gives a clear ELBO-based derivation of EM; Chapter 10 covers variational EM.
- K. P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012
Chapter 11 treats EM for mixtures, HMMs, and factor analyzers.
- R. M. Neal and G. E. Hinton, A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants, 1998
The free-energy / coordinate-ascent reformulation that made variational EM possible.
- L. E. Baum, T. Petrie, G. Soules, and N. Weiss, A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains, 1970
The Baum-Welch algorithm β EM for HMMs, pre-dating Dempster-Laird-Rubin.
- L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, 1989
The definitive engineering tutorial on HMMs and Baum-Welch training.
- M. E. Tipping, Sparse Bayesian Learning and the Relevance Vector Machine, 2001
Original SBL paper β EM on a hierarchical Gaussian prior with automatic relevance determination.
- D. P. Wipf and B. D. Rao, Sparse Bayesian Learning for Basis Selection, 2004
Connects SBL to $\ell_0$-penalized problems and establishes global-optimum conditions.
- M. Ke, Z. Gao, Y. Wu, X. Gao, and R. Schober, Compressive Sensing-Based Adaptive Active User Detection and Channel Estimation: Massive Access Meets Massive MIMO, 2020
SBL/EM applied to massive random access with massive MIMO receivers.
- S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice Hall, 1993
Chapter 7 covers EM for frequency estimation and other signal-processing problems.
- K. Pearson, Contributions to the Mathematical Theory of Evolution, 1894
Historical first Gaussian-mixture fit, via method of moments, long before EM existed.
Further Reading
Directions for readers who want to go deeper into convergence theory, variational extensions, and signal-processing applications.
Convergence rates and acceleration of EM
Meng & Rubin, 'On the Global and Componentwise Rates of Convergence of the EM Algorithm,' Linear Algebra Appl., 1994; Varadhan & Roland, SQUAREM acceleration, Scand. J. Stat., 2008
EM often converges linearly, with rate governed by the missing information ratio; simple fix-ups can restore superlinear behavior.
Variational inference and VAEs
Blei, Kucukelbir & McAuliffe, 'Variational Inference: A Review for Statisticians,' JASA, 2017; Kingma & Welling, 'Auto-Encoding Variational Bayes,' ICLR, 2014
The modern descendants of EM β amortized, gradient-based, scalable to deep models.
EM for signal-processing problems in communications
Feder & Weinstein, 'Parameter Estimation of Superimposed Signals Using the EM Algorithm,' IEEE Trans. ASSP, 1988
A classic application of EM to source separation and parameter estimation in radar and wireless.
Information-geometric view of EM
Amari, 'Information Geometry of the EM and em Algorithms for Neural Networks,' Neural Networks, 1995
Interprets EM as alternating projection between two manifolds of distributions β illuminating geometry.