Gaussian Mixture Models
The Prototypical EM Application
No model showcases EM better than the Gaussian mixture. The latent variables are discrete (which component generated each sample), the E-step reduces to Bayes' rule on a small simplex, and the M-step is a weighted version of the closed-form Gaussian MLE. Understanding GMM-EM in depth pays compound interest: every other EM application in this chapter — K-means, HMMs, SBL — is a variation on this one.
Definition: Gaussian Mixture Model
Gaussian Mixture Model
A -component Gaussian mixture model for is with mixing weights , , component means , and component covariances (symmetric positive definite). The parameter is .
The latent variable for each sample is the discrete component label with prior . Conditionally, .
Definition: Responsibility
Responsibility
For sample and component , the responsibility is the posterior probability that component generated given the current parameters : By construction, and for each .
Responsibility
In a mixture model, the posterior probability that a specific observation was generated by a specific component. Responsibilities are the soft analogue of hard cluster assignments.
Related: Latent Variable
Theorem: EM Updates for the Gaussian Mixture Model
For i.i.d. samples , one EM iteration for the -component GMM consists of:
E-step. For every , Define the effective counts .
M-step. The updates are
The updates are exactly the sample estimators for , , , weighted by the responsibilities. If the responsibilities were hard (0/1), we would recover the per-cluster Gaussian MLE.
Write .
Take the posterior expectation: .
Maximize over with a Lagrangian for the simplex constraint, then over and componentwise.
Complete-data log-likelihood
With ,
Take the expectation (E-step)
Since ,
M-step: mixing weights
Maximize subject to . Lagrangian: . Setting yields ; the constraint gives , hence .
M-step: means
For each , , giving — the responsibility-weighted sample mean.
M-step: covariances
Using the matrix identity , setting the -gradient of to zero gives .
Example: Two Gaussians in the Plane
Consider 200 samples drawn i.i.d. from , with , , , . Initialize at two random data points and run EM. What do we expect to see at convergence?
Initialization
Random initialization breaks symmetry and assigns each sample responsibility according to its distance (Mahalanobis, weighted by ) to the two starting means.
First iteration
After one E-step and M-step, the means move toward the empirical centers of their respective assigned clusters. The covariance updates deform the component ellipses to better fit the local data shape.
Convergence
After - iterations, the estimated parameters will be close (but not identical) to the true generative parameters — within the statistical fluctuation . The log-likelihood trace is monotone; the responsibilities on the decision boundary between the two clusters remain genuinely soft (not 0/1). The final configuration depends on the initialization.
GMM-EM Trajectory on 2-D Data
Watch cluster ellipses evolve during EM iterations on a 2-D synthetic dataset. Each ellipse shows the contour of a component.
Parameters
Responsibility Heatmap
Visualize as a function of position in the plane for a two-component GMM. The colour shows how confidently each location is assigned to component 1 vs component 2.
Parameters
Common Mistake: Singular Covariance Collapse
Mistake:
Running GMM-EM with unconstrained covariances and watching one component's covariance matrix shrink toward a data point — the log-likelihood diverges to , numerics blow up.
Correction:
The GMM likelihood is unbounded above: placing one component on a single data point with sends . Standard remedies: (a) add a small diagonal regularizer at each M-step; (b) tie covariances across components; (c) place an inverse-Wishart prior on and do MAP-EM. Modern GMM codes all use one of these safeguards.
Common Mistake: Label Switching
Mistake:
Comparing estimated directly to the true and concluding EM "failed" because they are different.
Correction:
The GMM likelihood is invariant under permutation of the component labels. EM returns some labelling; compare via a matching (e.g., Hungarian algorithm) before assessing error. In Bayesian treatments this is handled by identifiability constraints (e.g., ).
GMM Covariance Structures
| Structure | Parameters per component | When to use |
|---|---|---|
| Spherical: | Clusters are isotropic; few samples | |
| Diagonal: | Features are on different scales but uncorrelated | |
| Full: unrestricted SPD | Enough data; correlations matter | |
| Tied: shared | total | Equal-shape ellipses (QDA collapses to LDA) |
k-means++ Seeding for GMM
The standard initialization for GMM-EM in production software (scikit-learn, MATLAB) is to first run a few iterations of -means with -means++ seeding, then use the resulting centroids as , the empirical cluster covariances as , and the cluster proportions as . Random initializations are kept as a diversity source: multi-restart with seeds is routine.
- •
-means++ itself is per pass — cheaper than one EM iteration
- •
Report the best log-likelihood across restarts, not the average
Historical Note: McLachlan and Peel's Finite Mixture Models
1894-2000While Dempster-Laird-Rubin introduced EM as a general tool, the application to finite mixtures has its own parallel history. Karl Pearson's 1894 Philosophical Transactions paper on asymmetric frequency curves fit a two-component Gaussian mixture (to crab measurements) using method of moments — a computational heroism of the pre-computer era. The modern unification of GMM fitting with EM is crystallized in McLachlan and Peel's Finite Mixture Models (2000), which remains the standard reference.