References & Further Reading

References

  1. A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, 1977

    The foundational paper that crystallized EM as a general algorithm and proved monotonicity.

  2. C. F. J. Wu, On the Convergence Properties of the EM Algorithm, 1983

    The definitive convergence analysis of EM β€” stationary-point convergence under regularity.

  3. G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions, Wiley, 2nd ed., 2008

    Comprehensive monograph covering variants, acceleration, and applications.

  4. G. J. McLachlan and D. Peel, Finite Mixture Models, Wiley, 2000

    Standard reference for GMMs and related mixture-model fitting.

  5. C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006

    Chapter 9 gives a clear ELBO-based derivation of EM; Chapter 10 covers variational EM.

  6. K. P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012

    Chapter 11 treats EM for mixtures, HMMs, and factor analyzers.

  7. R. M. Neal and G. E. Hinton, A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants, 1998

    The free-energy / coordinate-ascent reformulation that made variational EM possible.

  8. L. E. Baum, T. Petrie, G. Soules, and N. Weiss, A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains, 1970

    The Baum-Welch algorithm β€” EM for HMMs, pre-dating Dempster-Laird-Rubin.

  9. L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, 1989

    The definitive engineering tutorial on HMMs and Baum-Welch training.

  10. M. E. Tipping, Sparse Bayesian Learning and the Relevance Vector Machine, 2001

    Original SBL paper β€” EM on a hierarchical Gaussian prior with automatic relevance determination.

  11. D. P. Wipf and B. D. Rao, Sparse Bayesian Learning for Basis Selection, 2004

    Connects SBL to $\ell_0$-penalized problems and establishes global-optimum conditions.

  12. M. Ke, Z. Gao, Y. Wu, X. Gao, and R. Schober, Compressive Sensing-Based Adaptive Active User Detection and Channel Estimation: Massive Access Meets Massive MIMO, 2020

    SBL/EM applied to massive random access with massive MIMO receivers.

  13. S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice Hall, 1993

    Chapter 7 covers EM for frequency estimation and other signal-processing problems.

  14. K. Pearson, Contributions to the Mathematical Theory of Evolution, 1894

    Historical first Gaussian-mixture fit, via method of moments, long before EM existed.

Further Reading

Directions for readers who want to go deeper into convergence theory, variational extensions, and signal-processing applications.

  • Convergence rates and acceleration of EM

    Meng & Rubin, 'On the Global and Componentwise Rates of Convergence of the EM Algorithm,' Linear Algebra Appl., 1994; Varadhan & Roland, SQUAREM acceleration, Scand. J. Stat., 2008

    EM often converges linearly, with rate governed by the missing information ratio; simple fix-ups can restore superlinear behavior.

  • Variational inference and VAEs

    Blei, Kucukelbir & McAuliffe, 'Variational Inference: A Review for Statisticians,' JASA, 2017; Kingma & Welling, 'Auto-Encoding Variational Bayes,' ICLR, 2014

    The modern descendants of EM β€” amortized, gradient-based, scalable to deep models.

  • EM for signal-processing problems in communications

    Feder & Weinstein, 'Parameter Estimation of Superimposed Signals Using the EM Algorithm,' IEEE Trans. ASSP, 1988

    A classic application of EM to source separation and parameter estimation in radar and wireless.

  • Information-geometric view of EM

    Amari, 'Information Geometry of the EM and em Algorithms for Neural Networks,' Neural Networks, 1995

    Interprets EM as alternating projection between two manifolds of distributions β€” illuminating geometry.