Gaussian Mixture Models

The Prototypical EM Application

No model showcases EM better than the Gaussian mixture. The latent variables are discrete (which component generated each sample), the E-step reduces to Bayes' rule on a small simplex, and the M-step is a weighted version of the closed-form Gaussian MLE. Understanding GMM-EM in depth pays compound interest: every other EM application in this chapter — K-means, HMMs, SBL — is a variation on this one.

Definition:

Gaussian Mixture Model

A KK-component Gaussian mixture model for yRd\mathbf{y}\in\mathbb{R}^d is f(y;θ)=k=1KπkN(y;μk,Σk),f(\mathbf{y};\boldsymbol{\theta}) = \sum_{k=1}^K \pi_k \,\mathcal{N}(\mathbf{y};\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_{k}), with mixing weights πk0\pi_k \geq 0, kπk=1\sum_k \pi_k = 1, component means μkRd\boldsymbol{\mu}_k\in\mathbb{R}^d, and component covariances ΣkRd×d\boldsymbol{\Sigma}_{k}\in\mathbb{R}^{d\times d} (symmetric positive definite). The parameter is θ={πk,μk,Σk}k=1K\boldsymbol{\theta} = \{\pi_k,\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_{k}\}_{k=1}^K.

The latent variable for each sample is the discrete component label Zi{1,,K}Z_i \in \{1,\ldots,K\} with prior Pr(Zi=k)=πk\Pr(Z_i=k) = \pi_k. Conditionally, YiZi=kN(μk,Σk)\mathbf{Y}_i\mid Z_i=k \sim \mathcal{N}(\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_{k}).

Definition:

Responsibility

For sample ii and component kk, the responsibility is the posterior probability that component kk generated yi\mathbf{y}_i given the current parameters θ(t)\boldsymbol{\theta}^{(t)}: γik(t)=Pr(Zi=kyi;θ(t))=πk(t)N(yi;μk(t),Σk(t))j=1Kπj(t)N(yi;μj(t),Σj(t)).\gamma_{ik}^{(t)} = \Pr(Z_i = k \mid \mathbf{y}_i;\boldsymbol{\theta}^{(t)}) = \frac{\pi_k^{(t)}\,\mathcal{N}(\mathbf{y}_i;\boldsymbol{\mu}_k^{(t)},\boldsymbol{\Sigma}_{k}^{(t)})} {\sum_{j=1}^K \pi_j^{(t)}\,\mathcal{N}(\mathbf{y}_i;\boldsymbol{\mu}_j^{(t)},\boldsymbol{\Sigma}_{j}^{(t)})}. By construction, γik(t)0\gamma_{ik}^{(t)} \geq 0 and kγik(t)=1\sum_k \gamma_{ik}^{(t)} = 1 for each ii.

Responsibility

In a mixture model, the posterior probability that a specific observation was generated by a specific component. Responsibilities are the soft analogue of hard cluster assignments.

Related: Latent Variable

Theorem: EM Updates for the Gaussian Mixture Model

For nn i.i.d. samples y1,,yn\mathbf{y}_1,\ldots,\mathbf{y}_n, one EM iteration for the KK-component GMM consists of:

E-step. For every i,ki,k, γik(t)=πk(t)N(yi;μk(t),Σk(t))j=1Kπj(t)N(yi;μj(t),Σj(t)).\gamma_{ik}^{(t)} = \frac{\pi_k^{(t)}\,\mathcal{N}(\mathbf{y}_i;\boldsymbol{\mu}_k^{(t)},\boldsymbol{\Sigma}_{k}^{(t)})} {\sum_{j=1}^K \pi_j^{(t)}\,\mathcal{N}(\mathbf{y}_i;\boldsymbol{\mu}_j^{(t)},\boldsymbol{\Sigma}_{j}^{(t)})}. Define the effective counts Nk(t)=i=1nγik(t)N_k^{(t)} = \sum_{i=1}^n \gamma_{ik}^{(t)}.

M-step. The updates are πk(t+1)=Nk(t)n,μk(t+1)=1Nk(t)i=1nγik(t)yi,\pi_k^{(t+1)} = \frac{N_k^{(t)}}{n},\qquad \boldsymbol{\mu}_k^{(t+1)} = \frac{1}{N_k^{(t)}}\sum_{i=1}^n \gamma_{ik}^{(t)}\mathbf{y}_i, Σk(t+1)=1Nk(t)i=1nγik(t)(yiμk(t+1))(yiμk(t+1)).\boldsymbol{\Sigma}_{k}^{(t+1)} = \frac{1}{N_k^{(t)}}\sum_{i=1}^n \gamma_{ik}^{(t)} (\mathbf{y}_i-\boldsymbol{\mu}_k^{(t+1)})(\mathbf{y}_i-\boldsymbol{\mu}_k^{(t+1)})^\top.

The updates are exactly the sample estimators for π\pi, μk\boldsymbol{\mu}_k, Σk\boldsymbol{\Sigma}_{k}, weighted by the responsibilities. If the responsibilities were hard (0/1), we would recover the per-cluster Gaussian MLE.

Example: Two Gaussians in the Plane

Consider 200 samples drawn i.i.d. from 0.4N(μ1,Σ1)+0.6N(μ2,Σ2)0.4\,\mathcal{N}(\boldsymbol{\mu}_1,\boldsymbol{\Sigma}_{1}) + 0.6\,\mathcal{N}(\boldsymbol{\mu}_2,\boldsymbol{\Sigma}_{2}), with μ1=(1,0)\boldsymbol{\mu}_1=(-1,0)^\top, μ2=(2,1)\boldsymbol{\mu}_2=(2,1)^\top, Σ1=I\boldsymbol{\Sigma}_{1}=\mathbf{I}, Σ2=diag(0.5,2)\boldsymbol{\Sigma}_{2}=\operatorname{diag}(0.5,2). Initialize μk(0)\boldsymbol{\mu}_k^{(0)} at two random data points and run EM. What do we expect to see at convergence?

GMM-EM Trajectory on 2-D Data

Watch cluster ellipses evolve during EM iterations on a 2-D synthetic dataset. Each ellipse shows the 2σ2\sigma contour of a component.

Parameters
10
2
1

Responsibility Heatmap

Visualize γik\gamma_{ik} as a function of position in the plane for a two-component GMM. The colour shows how confidently each location is assigned to component 1 vs component 2.

Parameters
2
1
0.5

Common Mistake: Singular Covariance Collapse

Mistake:

Running GMM-EM with unconstrained covariances and watching one component's covariance matrix shrink toward a data point — the log-likelihood diverges to ++\infty, numerics blow up.

Correction:

The GMM likelihood is unbounded above: placing one component on a single data point with Σk0\boldsymbol{\Sigma}_{k} \to 0 sends N(yi;yi,Σk)\mathcal{N}(\mathbf{y}_i;\mathbf{y}_i,\boldsymbol{\Sigma}_{k}) \to \infty. Standard remedies: (a) add a small diagonal regularizer ΣkΣk+ϵI\boldsymbol{\Sigma}_{k} \leftarrow \boldsymbol{\Sigma}_{k} + \epsilon \mathbf{I} at each M-step; (b) tie covariances across components; (c) place an inverse-Wishart prior on Σk\boldsymbol{\Sigma}_{k} and do MAP-EM. Modern GMM codes all use one of these safeguards.

Common Mistake: Label Switching

Mistake:

Comparing estimated μ1(t)\boldsymbol{\mu}_1^{(t)} directly to the true μ1\boldsymbol{\mu}_1 and concluding EM "failed" because they are different.

Correction:

The GMM likelihood is invariant under permutation of the KK component labels. EM returns some labelling; compare via a matching (e.g., Hungarian algorithm) before assessing error. In Bayesian treatments this is handled by identifiability constraints (e.g., μ1μ2\mu_1 \leq \mu_2 \leq \cdots).

GMM Covariance Structures

StructureParameters per componentWhen to use
Spherical: Σk=σk2I\boldsymbol{\Sigma}_{k} = \sigma_k^2 \mathbf{I}11Clusters are isotropic; few samples
Diagonal: Σk=diag(σk,12,,σk,d2)\boldsymbol{\Sigma}_{k} = \operatorname{diag}(\sigma_{k,1}^2,\ldots,\sigma_{k,d}^2)ddFeatures are on different scales but uncorrelated
Full: Σk\boldsymbol{\Sigma}_{k} unrestricted SPDd(d+1)/2d(d+1)/2Enough data; correlations matter
Tied: Σk=Σ\boldsymbol{\Sigma}_{k} = \boldsymbol{\Sigma} sharedd(d+1)/2d(d+1)/2 totalEqual-shape ellipses (QDA collapses to LDA)
🔧Engineering Note

k-means++ Seeding for GMM

The standard initialization for GMM-EM in production software (scikit-learn, MATLAB) is to first run a few iterations of kk-means with kk-means++ seeding, then use the resulting centroids as μk(0)\boldsymbol{\mu}_k^{(0)}, the empirical cluster covariances as Σk(0)\boldsymbol{\Sigma}_{k}^{(0)}, and the cluster proportions as πk(0)\pi_k^{(0)}. Random initializations are kept as a diversity source: multi-restart with 5\geq 5 seeds is routine.

Practical Constraints
  • kk-means++ itself is O(nKd)O(nKd) per pass — cheaper than one EM iteration

  • Report the best log-likelihood across restarts, not the average

Historical Note: McLachlan and Peel's Finite Mixture Models

1894-2000

While Dempster-Laird-Rubin introduced EM as a general tool, the application to finite mixtures has its own parallel history. Karl Pearson's 1894 Philosophical Transactions paper on asymmetric frequency curves fit a two-component Gaussian mixture (to crab measurements) using method of moments — a computational heroism of the pre-computer era. The modern unification of GMM fitting with EM is crystallized in McLachlan and Peel's Finite Mixture Models (2000), which remains the standard reference.