Ferkans — Interactive Telecom Tutor

The Prototypical EM Application

No model showcases EM better than the Gaussian mixture. The latent variables are discrete (which component generated each sample), the E-step reduces to Bayes' rule on a small simplex, and the M-step is a weighted version of the closed-form Gaussian MLE. Understanding GMM-EM in depth pays compound interest: every other EM application in this chapter — K-means, HMMs, SBL — is a variation on this one.

Definition:
Gaussian Mixture Model

A $K$ -component Gaussian mixture model for $\mathbf{y}\in\mathbb{R}^d$ is $f(\mathbf{y};\boldsymbol{\theta}) = \sum_{k=1}^K \pi_k \,\mathcal{N}(\mathbf{y};\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_{k}),$ with mixing weights $\pi_k \geq 0$ , $\sum_k \pi_k = 1$ , component means $\boldsymbol{\mu}_k\in\mathbb{R}^d$ , and component covariances $\boldsymbol{\Sigma}_{k}\in\mathbb{R}^{d\times d}$ (symmetric positive definite). The parameter is $\boldsymbol{\theta} = \{\pi_k,\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_{k}\}_{k=1}^K$ .

The latent variable for each sample is the discrete component label $Z_i \in \{1,\ldots,K\}$ with prior $\Pr(Z_i=k) = \pi_k$ . Conditionally, $\mathbf{Y}_i\mid Z_i=k \sim \mathcal{N}(\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_{k})$ .

Definition:
Responsibility

For sample $i$ and component $k$ , the responsibility is the posterior probability that component $k$ generated $\mathbf{y}_i$ given the current parameters $\boldsymbol{\theta}^{(t)}$ : $\gamma_{ik}^{(t)} = \Pr(Z_i = k \mid \mathbf{y}_i;\boldsymbol{\theta}^{(t)}) = \frac{\pi_k^{(t)}\,\mathcal{N}(\mathbf{y}_i;\boldsymbol{\mu}_k^{(t)},\boldsymbol{\Sigma}_{k}^{(t)})} {\sum_{j=1}^K \pi_j^{(t)}\,\mathcal{N}(\mathbf{y}_i;\boldsymbol{\mu}_j^{(t)},\boldsymbol{\Sigma}_{j}^{(t)})}.$ By construction, $\gamma_{ik}^{(t)} \geq 0$ and $\sum_k \gamma_{ik}^{(t)} = 1$ for each $i$ .

Responsibility

In a mixture model, the posterior probability that a specific observation was generated by a specific component. Responsibilities are the soft analogue of hard cluster assignments.

Related: Latent Variable

Theorem: EM Updates for the Gaussian Mixture Model

For $n$ i.i.d. samples $\mathbf{y}_1,\ldots,\mathbf{y}_n$ , one EM iteration for the $K$ -component GMM consists of:

E-step. For every $i,k$ , $\gamma_{ik}^{(t)} = \frac{\pi_k^{(t)}\,\mathcal{N}(\mathbf{y}_i;\boldsymbol{\mu}_k^{(t)},\boldsymbol{\Sigma}_{k}^{(t)})} {\sum_{j=1}^K \pi_j^{(t)}\,\mathcal{N}(\mathbf{y}_i;\boldsymbol{\mu}_j^{(t)},\boldsymbol{\Sigma}_{j}^{(t)})}.$ Define the effective counts $N_k^{(t)} = \sum_{i=1}^n \gamma_{ik}^{(t)}$ .

M-step. The updates are $\pi_k^{(t+1)} = \frac{N_k^{(t)}}{n},\qquad \boldsymbol{\mu}_k^{(t+1)} = \frac{1}{N_k^{(t)}}\sum_{i=1}^n \gamma_{ik}^{(t)}\mathbf{y}_i,$ $\boldsymbol{\Sigma}_{k}^{(t+1)} = \frac{1}{N_k^{(t)}}\sum_{i=1}^n \gamma_{ik}^{(t)} (\mathbf{y}_i-\boldsymbol{\mu}_k^{(t+1)})(\mathbf{y}_i-\boldsymbol{\mu}_k^{(t+1)})^\top.$

The updates are exactly the sample estimators for $\pi$ , $\boldsymbol{\mu}_k$ , $\boldsymbol{\Sigma}_{k}$ , weighted by the responsibilities. If the responsibilities were hard (0/1), we would recover the per-cluster Gaussian MLE.

Show Hint

Write $\log p(\mathbf{y},\mathbf{z};\boldsymbol{\theta}) = \sum_i \sum_k \mathbb{1}[Z_i=k] \log(\pi_k \,\mathcal{N}(\mathbf{y}_i;\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_{k}))$ .

Take the posterior expectation: $\mathbb{1}[Z_i=k] \to \gamma_{ik}$ .

Maximize over $\pi_k$ with a Lagrangian for the simplex constraint, then over $\boldsymbol{\mu}_k$ and $\boldsymbol{\Sigma}_{k}$ componentwise.

Proof

Complete-data log-likelihood

With $Z_i \in \{1,\ldots,K\}$ , $\log p(\mathbf{y},\mathbf{z};\boldsymbol{\theta}) = \sum_{i=1}^n \sum_{k=1}^K \mathbb{1}[Z_i = k] \Big[\log \pi_k + \log \mathcal{N}(\mathbf{y}_i;\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_{k})\Big].$

Take the expectation (E-step)

Since $\mathbb{E}_{\mathbf{Z}\mid\mathbf{y},\boldsymbol{\theta}^{(t)}}[\mathbb{1}[Z_i=k]] = \gamma_{ik}^{(t)}$ , $Q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{(t)}) = \sum_{i,k} \gamma_{ik}^{(t)} \Big[\log \pi_k - \tfrac12\log\det(2\pi\boldsymbol{\Sigma}_{k}) - \tfrac12 (\mathbf{y}_i-\boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_{k}^{-1}(\mathbf{y}_i-\boldsymbol{\mu}_k)\Big].$

M-step: mixing weights

Maximize $\sum_k N_k^{(t)}\log \pi_k$ subject to $\sum_k \pi_k = 1$ . Lagrangian: $\mathcal{L} = \sum_k N_k^{(t)}\log \pi_k - \lambda(\sum_k \pi_k - 1)$ . Setting $\partial \mathcal{L}/\partial \pi_k = 0$ yields $\pi_k = N_k^{(t)}/\lambda$ ; the constraint gives $\lambda = n$ , hence $\pi_k^{(t+1)} = N_k^{(t)}/n$ .

M-step: means

For each $k$ , $\nabla_{\boldsymbol{\mu}_k} Q = \boldsymbol{\Sigma}_{k}^{-1} \sum_i \gamma_{ik}^{(t)}(\mathbf{y}_i - \boldsymbol{\mu}_k) = \mathbf{0}$ , giving $\boldsymbol{\mu}_k^{(t+1)} = \frac{1}{N_k^{(t)}}\sum_i \gamma_{ik}^{(t)}\mathbf{y}_i$ — the responsibility-weighted sample mean.

M-step: covariances

Using the matrix identity $\nabla_{\boldsymbol{\Sigma}}[-\tfrac12\log\det\boldsymbol{\Sigma} - \tfrac12\mathbf{v}^\top\boldsymbol{\Sigma}^{-1}\mathbf{v}] = -\tfrac12\boldsymbol{\Sigma}^{-1} + \tfrac12\boldsymbol{\Sigma}^{-1}\mathbf{v}\mathbf{v}^\top\boldsymbol{\Sigma}^{-1}$ , setting the $\boldsymbol{\Sigma}_{k}$ -gradient of $Q$ to zero gives $\boldsymbol{\Sigma}_{k}^{(t+1)} = \frac{1}{N_k^{(t)}}\sum_i \gamma_{ik}^{(t)} (\mathbf{y}_i-\boldsymbol{\mu}_k^{(t+1)})(\mathbf{y}_i-\boldsymbol{\mu}_k^{(t+1)})^\top$ . $\blacksquare$

Example: Two Gaussians in the Plane

Consider 200 samples drawn i.i.d. from $0.4\,\mathcal{N}(\boldsymbol{\mu}_1,\boldsymbol{\Sigma}_{1}) + 0.6\,\mathcal{N}(\boldsymbol{\mu}_2,\boldsymbol{\Sigma}_{2})$ , with $\boldsymbol{\mu}_1=(-1,0)^\top$ , $\boldsymbol{\mu}_2=(2,1)^\top$ , $\boldsymbol{\Sigma}_{1}=\mathbf{I}$ , $\boldsymbol{\Sigma}_{2}=\operatorname{diag}(0.5,2)$ . Initialize $\boldsymbol{\mu}_k^{(0)}$ at two random data points and run EM. What do we expect to see at convergence?

Solution

Initialization

Random initialization breaks symmetry and assigns each sample responsibility according to its distance (Mahalanobis, weighted by $\boldsymbol{\Sigma}_{k}^{(0)}=\mathbf{I}$ ) to the two starting means.

First iteration

After one E-step and M-step, the means move toward the empirical centers of their respective assigned clusters. The covariance updates deform the component ellipses to better fit the local data shape.

Convergence

After $\approx 20$ - $40$ iterations, the estimated parameters will be close (but not identical) to the true generative parameters — within the statistical fluctuation $O(1/\sqrt{n})$ . The log-likelihood trace is monotone; the responsibilities on the decision boundary between the two clusters remain genuinely soft (not 0/1). The final configuration depends on the initialization.

GMM-EM Trajectory on 2-D Data

Watch cluster ellipses evolve during EM iterations on a 2-D synthetic dataset. Each ellipse shows the $2\sigma$ contour of a component.

Parameters

EM iterations10

True separation2

Init seed1

Responsibility Heatmap

Visualize $\gamma_{ik}$ as a function of position in the plane for a two-component GMM. The colour shows how confidently each location is assigned to component 1 vs component 2.

Parameters

Mean separation2

\sigma_2/\sigma_1

1

\pi_1

0.5

Common Mistake: Singular Covariance Collapse

Mistake:

Running GMM-EM with unconstrained covariances and watching one component's covariance matrix shrink toward a data point — the log-likelihood diverges to $+\infty$ , numerics blow up.

Correction:

The GMM likelihood is unbounded above: placing one component on a single data point with $\boldsymbol{\Sigma}_{k} \to 0$ sends $\mathcal{N}(\mathbf{y}_i;\mathbf{y}_i,\boldsymbol{\Sigma}_{k}) \to \infty$ . Standard remedies: (a) add a small diagonal regularizer $\boldsymbol{\Sigma}_{k} \leftarrow \boldsymbol{\Sigma}_{k} + \epsilon \mathbf{I}$ at each M-step; (b) tie covariances across components; (c) place an inverse-Wishart prior on $\boldsymbol{\Sigma}_{k}$ and do MAP-EM. Modern GMM codes all use one of these safeguards.

Common Mistake: Label Switching

Mistake:

Comparing estimated $\boldsymbol{\mu}_1^{(t)}$ directly to the true $\boldsymbol{\mu}_1$ and concluding EM "failed" because they are different.

Correction:

The GMM likelihood is invariant under permutation of the $K$ component labels. EM returns some labelling; compare via a matching (e.g., Hungarian algorithm) before assessing error. In Bayesian treatments this is handled by identifiability constraints (e.g., $\mu_1 \leq \mu_2 \leq \cdots$ ).

GMM Covariance Structures

Structure	Parameters per component	When to use
Spherical: $\boldsymbol{\Sigma}_{k} = \sigma_k^2 \mathbf{I}$	$1$	Clusters are isotropic; few samples
Diagonal: $\boldsymbol{\Sigma}_{k} = \operatorname{diag}(\sigma_{k,1}^2,\ldots,\sigma_{k,d}^2)$	$d$	Features are on different scales but uncorrelated
Full: $\boldsymbol{\Sigma}_{k}$ unrestricted SPD	$d(d+1)/2$	Enough data; correlations matter
Tied: $\boldsymbol{\Sigma}_{k} = \boldsymbol{\Sigma}$ shared	$d(d+1)/2$ total	Equal-shape ellipses (QDA collapses to LDA)

🔧Engineering Note

k-means++ Seeding for GMM

The standard initialization for GMM-EM in production software (scikit-learn, MATLAB) is to first run a few iterations of $k$ -means with $k$ -means++ seeding, then use the resulting centroids as $\boldsymbol{\mu}_k^{(0)}$ , the empirical cluster covariances as $\boldsymbol{\Sigma}_{k}^{(0)}$ , and the cluster proportions as $\pi_k^{(0)}$ . Random initializations are kept as a diversity source: multi-restart with $\geq 5$ seeds is routine.

Practical Constraints

•
$k$ -means++ itself is $O(nKd)$ per pass — cheaper than one EM iteration
•
Report the best log-likelihood across restarts, not the average

Historical Note: McLachlan and Peel's Finite Mixture Models

1894-2000

While Dempster-Laird-Rubin introduced EM as a general tool, the application to finite mixtures has its own parallel history. Karl Pearson's 1894 Philosophical Transactions paper on asymmetric frequency curves fit a two-component Gaussian mixture (to crab measurements) using method of moments — a computational heroism of the pre-computer era. The modern unification of GMM fitting with EM is crystallized in McLachlan and Peel's Finite Mixture Models (2000), which remains the standard reference.

Gaussian Mixture Models

The Prototypical EM Application

Definition: Gaussian Mixture Model

Definition: Responsibility

Responsibility

Theorem: EM Updates for the Gaussian Mixture Model

Complete-data log-likelihood

Take the expectation (E-step)

M-step: mixing weights

M-step: means

M-step: covariances

Example: Two Gaussians in the Plane

Initialization

First iteration

Convergence

GMM-EM Trajectory on 2-D Data

Parameters

Responsibility Heatmap

Parameters

Common Mistake: Singular Covariance Collapse

Common Mistake: Label Switching

GMM Covariance Structures

k-means++ Seeding for GMM

Historical Note: McLachlan and Peel's Finite Mixture Models

Definition:
Gaussian Mixture Model

Definition:
Responsibility