Applications: K-means, HMMs, and Sparse Bayesian Learning

EM Beyond the GMM

Once the GMM updates are understood, a whole family of algorithms reveals itself as special cases or close relatives of EM. We highlight three: K-means (a hard-assignment limit of GMM-EM), the Baum-Welch algorithm for hidden Markov models, and sparse Bayesian learning (EM applied to a Gaussian-variance latent-variable model). Each preserves the E-then-M structure but changes the latent-variable model.

Definition:

KK-means Clustering

Given data y1,…,yn∈Rd\mathbf{y}_1,\ldots,\mathbf{y}_n\in\mathbb{R}^d and KK, KK-means finds cluster centers ΞΌ1,…,ΞΌK\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_K and hard assignments ci∈{1,…,K}c_i \in \{1,\ldots,K\} minimizing J(ΞΌ,c)=βˆ‘i=1nβˆ₯yiβˆ’ΞΌciβˆ₯2.J(\boldsymbol{\mu},c) = \sum_{i=1}^n \|\mathbf{y}_i - \boldsymbol{\mu}_{c_i}\|^2. The standard Lloyd iteration alternates:

  • Assignment: ci←arg⁑min⁑kβˆ₯yiβˆ’ΞΌkβˆ₯2c_i \leftarrow \arg\min_k \|\mathbf{y}_i - \boldsymbol{\mu}_k\|^2.
  • Update: ΞΌk←(1/∣{i:ci=k}∣)βˆ‘i:ci=kyi\boldsymbol{\mu}_k \leftarrow (1/|\{i:c_i=k\}|)\sum_{i:c_i=k}\mathbf{y}_i.

Theorem: K-means as a Hard-Assignment Limit of GMM-EM

Consider a GMM with equal weights Ο€k=1/K\pi_k = 1/K and isotropic covariances Ξ£k=Οƒ2I\boldsymbol{\Sigma}_{k} = \sigma^2 \mathbf{I}. As Οƒ2β†’0+\sigma^2 \to 0^+, the EM responsibilities become one-hot: Ξ³ikβ†’1[k=arg⁑min⁑jβˆ₯yiβˆ’ΞΌjβˆ₯2]\gamma_{ik} \to \mathbb{1}[k = \arg\min_j \|\mathbf{y}_i - \boldsymbol{\mu}_j\|^2], and the EM updates for {ΞΌk}\{\boldsymbol{\mu}_k\} reduce to the Lloyd iteration for KK-means.

Small Οƒ2\sigma^2 makes the posterior concentrate on a single component (whichever is closest). K-means is "EM at zero temperature." Soft EM (large Οƒ2\sigma^2) respects uncertainty in the assignments.

Baum-Welch: EM for Hidden Markov Models

A hidden Markov model (HMM) has discrete hidden states S1,…,ST∈{1,…,K}S_1,\ldots,S_T \in \{1,\ldots,K\} evolving as a Markov chain with transition matrix A\mathbf{A} and emitting observations Yt\mathbf{Y}_t through an emission distribution p(y∣s;Ο•)p(\mathbf{y}\mid s;\boldsymbol{\phi}). The parameters are ΞΈ=(Ο€0,A,Ο•)\boldsymbol{\theta} = (\boldsymbol{\pi}_0, \mathbf{A}, \boldsymbol{\phi}). Because the state sequence is latent, MLE is not closed-form.

The Baum-Welch algorithm is EM applied here:

  • E-step: compute the marginals Ξ³t(k)=Pr⁑(St=k∣y;ΞΈ(t))\gamma_t(k) = \Pr(S_t=k\mid\mathbf{y};\boldsymbol{\theta}^{(t)}) and pairwise marginals ΞΎt(j,k)=Pr⁑(Stβˆ’1=j,St=k∣y;ΞΈ(t))\xi_t(j,k) = \Pr(S_{t-1}=j,S_t=k\mid\mathbf{y};\boldsymbol{\theta}^{(t)}) via the forward-backward recursion.
  • M-step: update Ο€0,k=Ξ³1(k)\pi_{0,k} = \gamma_1(k), Ajk=βˆ‘tΞΎt(j,k)βˆ‘tΞ³tβˆ’1(j)A_{jk} = \frac{\sum_t \xi_t(j,k)}{\sum_t\gamma_{t-1}(j)}, and Ο•k\boldsymbol{\phi}_k by maximizing βˆ‘tΞ³t(k)log⁑p(yt∣k;Ο•k)\sum_t \gamma_t(k)\log p(\mathbf{y}_t\mid k;\boldsymbol{\phi}_k).

Forward-backward is the exact posterior computation for the HMM β€” so Baum-Welch is ordinary EM, not variational EM.

Definition:

Sparse Bayesian Learning (SBL)

Consider the linear model y=Ax+w\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{w} with w∼N(0,Οƒ2I)\mathbf{w}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I}). SBL places a zero-mean Gaussian prior with unknown variance on each coefficient: xj∼N(0,Ξ±jβˆ’1)x_j \sim \mathcal{N}(0, \alpha_j^{-1}) independently. The hyperparameter vector Ξ±=(Ξ±1,…,Ξ±n)\boldsymbol{\alpha} = (\alpha_1,\ldots,\alpha_n) is estimated by type-II maximum likelihood (evidence maximization), and the latent variable is X\mathbf{X} itself. EM yields closed-form updates:

  • E-step (Gaussian posterior): p(x∣y;Ξ±(t),Οƒ2)=N(ΞΌ(t),Ξ£(t))p(\mathbf{x}\mid\mathbf{y};\boldsymbol{\alpha}^{(t)},\sigma^2) = \mathcal{N}(\boldsymbol{\mu}^{(t)}, \boldsymbol{\Sigma}^{(t)}) with Ξ£(t)=(Οƒβˆ’2A⊀A+diag⁑(Ξ±(t)))βˆ’1\boldsymbol{\Sigma}^{(t)} = (\sigma^{-2}\mathbf{A}^\top\mathbf{A} + \operatorname{diag}(\boldsymbol{\alpha}^{(t)}))^{-1}, ΞΌ(t)=Οƒβˆ’2Ξ£(t)A⊀y\boldsymbol{\mu}^{(t)} = \sigma^{-2}\boldsymbol{\Sigma}^{(t)}\mathbf{A}^\top\mathbf{y}.
  • M-step: Ξ±j(t+1)=1/(ΞΌj(t),2+Ξ£jj(t))\alpha_j^{(t+1)} = 1/(\mu_j^{(t),2} + \boldsymbol{\Sigma}_{jj}^{(t)}).

As iterations proceed, irrelevant Ξ±jβ†’βˆž\alpha_j \to \infty, driving the corresponding xjβ†’0x_j \to 0. This automatic relevance determination (ARD) yields sparse estimates with no hand-tuned regularization parameter.

πŸŽ“CommIT Contribution(2020)

EM-Based Sparse Bayesian Learning for Massive Random Access

M. Ke, G. Caire β€” IEEE Trans. Signal Processing, vol. 68

In grant-free massive random access, only a small subset of the KK users in a cell are active at any given time, and the base station must jointly detect activity and estimate channels. Modeling each user's channel as drawn from a Gaussian prior with unknown variance Ξ±kβˆ’1\alpha_k^{-1} (activity ⇔αk\Leftrightarrow \alpha_k finite), the problem becomes an SBL instance at massive-MIMO scale. The EM updates derived here β€” with exploitation of the Kronecker structure of the sensing matrix induced by the pilot-plus-antenna-array geometry β€” deliver orders-of-magnitude complexity savings over vanilla SBL while preserving the Bayes-optimal phase transition. This line of work is part of the CommIT group's programme on unsourced random access for 6G.

sblmassive-accessmimoemView Paper β†’

Turbo Equalization as EM

In turbo equalization, a soft-output equalizer and a soft-input/soft-output decoder exchange extrinsic log-likelihood ratios. Viewed probabilistically, the transmitted coded symbols play the role of latent variables: the decoder implements a form of E-step (posterior computation over symbols given LLRs), and the equalizer's reliance on symbol priors is the M-step analogue. The connection is not perfect β€” turbo uses extrinsic information to avoid double-counting β€” but the E-then-M rhythm is exactly the same, and so is the monotonicity-under-approximation intuition.

Sensitivity to Initialization: Multi-Restart EM

Multiple random initializations, each run to convergence, produce different local maxima. The best log-likelihood across restarts is the standard selection rule.

Parameters
8
3
7

EM and Its Relatives

AlgorithmLatent variablesE-stepM-step
EM (generic)Arbitrary Z\mathbf{Z}Exact posterior p(z∣y;θ)p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta})arg⁑max⁑θQ\arg\max_{\boldsymbol{\theta}} Q
GMM-EMComponent labels ZiZ_iResponsibilities Ξ³ik\gamma_{ik}Weighted Gaussian MLE
K-meansComponent labels ZiZ_iHard assignments (zero-temperature)Cluster means
Baum-WelchState sequence S1:TS_{1:T}Forward-backward smoothingCount-based transition + emission MLE
SBL / ARDCoefficients X\mathbf{X}Gaussian posterior mean + covarianceΞ±j←1/(ΞΌj2+Ξ£jj)\alpha_j \leftarrow 1/(\mu_j^2 + \Sigma_{jj})
Variational EMArbitrary Z\mathbf{Z}Restricted q∈Qq \in \mathcal{Q} (not the true posterior)arg⁑max⁑θF(q,θ)\arg\max_{\boldsymbol{\theta}} \mathcal{F}(q,\boldsymbol{\theta})

Why This Matters: Channel Estimation with Unknown Interference

Consider pilot-based channel estimation in a cell where neighbouring cells transmit asynchronously: the interference i\mathbf{i} has unknown statistics. A two-component Gaussian mixture β€” desired user + interferer β€” captures this neatly, with the interferer's symbol as a latent variable. EM alternates between LMMSE channel estimation (given current interference statistics) and covariance estimation (given current channel) β€” a workhorse recipe in coordinated multi-point (CoMP) and cell-free massive MIMO receivers.

See full treatment in Greedy Algorithms: OMP, CoSaMP, IHT

πŸ”§Engineering Note

EM vs. Stochastic Gradient on Latent-Variable Models

For very large datasets, a full E-step over all samples becomes prohibitive. Two practical alternatives: (i) stochastic EM performs the E-step on a mini-batch and takes a damped M-step; (ii) gradient-based EM directly applies SGD to the ELBO using the reparameterization trick. In the deep-learning era, variational autoencoders (VAEs) are precisely gradient-based variational EM. The textbook EM algorithm is the ancestor that makes sense of all these variants.

Practical Constraints
  • β€’

    Full-batch EM: exact M-step, can be slow on n≫106n \gg 10^6

  • β€’

    Stochastic EM: O(batch/n)O(\text{batch}/n) speedup, requires step-size tuning

  • β€’

    VAE / amortized EM: replace per-sample E-step with a neural encoder

Quick Check

Which of the following is not true for K-means relative to GMM-EM?

K-means assigns each point to exactly one cluster.

K-means minimizes the sum of squared distances to cluster centers.

K-means estimates per-cluster covariance matrices.

K-means can be obtained as a Οƒ2β†’0\sigma^2 \to 0 limit of GMM-EM.

Quick Check

In sparse Bayesian learning, what happens to coefficients xjx_j for which the estimated hyperparameter Ξ±j\alpha_j grows very large?

They acquire large posterior variance.

They are pushed toward zero (pruned).

They dominate the posterior mean.

They cause the noise variance Οƒ2\sigma^2 to diverge.

Example: Two-State HMM via Baum-Welch

Consider a hidden Markov model with two states {1,2}\{1,2\}, transition matrix A\mathbf{A} with entries aij=P(Zn+1=j∣Zn=i)a_{ij}=P(Z_{n+1}=j \mid Z_n=i), initial distribution Ο€\pi, and Gaussian emission densities f(yn∣Zn=i)=N(yn;ΞΌi,Οƒ2)f(y_n \mid Z_n=i) = \mathcal{N}(y_n; \mu_i, \sigma^2). Given a sequence y1:Ny_{1:N}, state the E-step and M-step updates for the parameters ΞΈ=(Ο€,A,ΞΌ1,ΞΌ2,Οƒ2)\boldsymbol{\theta} = (\pi, \mathbf{A}, \mu_1, \mu_2, \sigma^2).

,

Baum-Welch algorithm

The EM algorithm specialized to hidden Markov models. Its E-step is the forward-backward recursion, which computes posterior state marginals; its M-step updates the initial distribution, transition matrix, and emission parameters in closed form from the responsibilities. It predates general EM by nearly a decade and is the workhorse training algorithm for HMMs in speech, bioinformatics, and time-series modeling.

Automatic relevance determination (ARD)

A hierarchical-Bayes mechanism in which each coefficient xjx_j carries its own prior precision hyperparameter Ξ±j\alpha_j, and the Ξ±j\alpha_j are themselves estimated (typically by EM). Irrelevant features drive their Ξ±jβ†’βˆž\alpha_j \to \infty, shrinking the corresponding posterior onto zero and effectively pruning them. ARD underlies the relevance vector machine and is the canonical Bayesian route to sparsity without an explicit β„“1\ell_1 penalty.