Applications: K-means, HMMs, and Sparse Bayesian Learning
EM Beyond the GMM
Once the GMM updates are understood, a whole family of algorithms reveals itself as special cases or close relatives of EM. We highlight three: K-means (a hard-assignment limit of GMM-EM), the Baum-Welch algorithm for hidden Markov models, and sparse Bayesian learning (EM applied to a Gaussian-variance latent-variable model). Each preserves the E-then-M structure but changes the latent-variable model.
Definition: -means Clustering
-means Clustering
Given data and , -means finds cluster centers and hard assignments minimizing The standard Lloyd iteration alternates:
- Assignment: .
- Update: .
Theorem: K-means as a Hard-Assignment Limit of GMM-EM
Consider a GMM with equal weights and isotropic covariances . As , the EM responsibilities become one-hot: , and the EM updates for reduce to the Lloyd iteration for -means.
Small makes the posterior concentrate on a single component (whichever is closest). K-means is "EM at zero temperature." Soft EM (large ) respects uncertainty in the assignments.
Write .
As , the softmax becomes a hard-max (argmin).
Responsibility limit
With the stated restrictions, . As , the exponent with the smallest (most negative) value dominates, so .
Mean update
Plugging one-hot into gives the mean of the samples hard-assigned to cluster β the Lloyd update.
Baum-Welch: EM for Hidden Markov Models
A hidden Markov model (HMM) has discrete hidden states evolving as a Markov chain with transition matrix and emitting observations through an emission distribution . The parameters are . Because the state sequence is latent, MLE is not closed-form.
The Baum-Welch algorithm is EM applied here:
- E-step: compute the marginals and pairwise marginals via the forward-backward recursion.
- M-step: update , , and by maximizing .
Forward-backward is the exact posterior computation for the HMM β so Baum-Welch is ordinary EM, not variational EM.
Definition: Sparse Bayesian Learning (SBL)
Sparse Bayesian Learning (SBL)
Consider the linear model with . SBL places a zero-mean Gaussian prior with unknown variance on each coefficient: independently. The hyperparameter vector is estimated by type-II maximum likelihood (evidence maximization), and the latent variable is itself. EM yields closed-form updates:
- E-step (Gaussian posterior): with , .
- M-step: .
As iterations proceed, irrelevant , driving the corresponding . This automatic relevance determination (ARD) yields sparse estimates with no hand-tuned regularization parameter.
EM-Based Sparse Bayesian Learning for Massive Random Access
In grant-free massive random access, only a small subset of the users in a cell are active at any given time, and the base station must jointly detect activity and estimate channels. Modeling each user's channel as drawn from a Gaussian prior with unknown variance (activity finite), the problem becomes an SBL instance at massive-MIMO scale. The EM updates derived here β with exploitation of the Kronecker structure of the sensing matrix induced by the pilot-plus-antenna-array geometry β deliver orders-of-magnitude complexity savings over vanilla SBL while preserving the Bayes-optimal phase transition. This line of work is part of the CommIT group's programme on unsourced random access for 6G.
Turbo Equalization as EM
In turbo equalization, a soft-output equalizer and a soft-input/soft-output decoder exchange extrinsic log-likelihood ratios. Viewed probabilistically, the transmitted coded symbols play the role of latent variables: the decoder implements a form of E-step (posterior computation over symbols given LLRs), and the equalizer's reliance on symbol priors is the M-step analogue. The connection is not perfect β turbo uses extrinsic information to avoid double-counting β but the E-then-M rhythm is exactly the same, and so is the monotonicity-under-approximation intuition.
Sensitivity to Initialization: Multi-Restart EM
Multiple random initializations, each run to convergence, produce different local maxima. The best log-likelihood across restarts is the standard selection rule.
Parameters
EM and Its Relatives
| Algorithm | Latent variables | E-step | M-step |
|---|---|---|---|
| EM (generic) | Arbitrary | Exact posterior | |
| GMM-EM | Component labels | Responsibilities | Weighted Gaussian MLE |
| K-means | Component labels | Hard assignments (zero-temperature) | Cluster means |
| Baum-Welch | State sequence | Forward-backward smoothing | Count-based transition + emission MLE |
| SBL / ARD | Coefficients | Gaussian posterior mean + covariance | |
| Variational EM | Arbitrary | Restricted (not the true posterior) |
Why This Matters: Channel Estimation with Unknown Interference
Consider pilot-based channel estimation in a cell where neighbouring cells transmit asynchronously: the interference has unknown statistics. A two-component Gaussian mixture β desired user + interferer β captures this neatly, with the interferer's symbol as a latent variable. EM alternates between LMMSE channel estimation (given current interference statistics) and covariance estimation (given current channel) β a workhorse recipe in coordinated multi-point (CoMP) and cell-free massive MIMO receivers.
See full treatment in Greedy Algorithms: OMP, CoSaMP, IHT
EM vs. Stochastic Gradient on Latent-Variable Models
For very large datasets, a full E-step over all samples becomes prohibitive. Two practical alternatives: (i) stochastic EM performs the E-step on a mini-batch and takes a damped M-step; (ii) gradient-based EM directly applies SGD to the ELBO using the reparameterization trick. In the deep-learning era, variational autoencoders (VAEs) are precisely gradient-based variational EM. The textbook EM algorithm is the ancestor that makes sense of all these variants.
- β’
Full-batch EM: exact M-step, can be slow on
- β’
Stochastic EM: speedup, requires step-size tuning
- β’
VAE / amortized EM: replace per-sample E-step with a neural encoder
Quick Check
Which of the following is not true for K-means relative to GMM-EM?
K-means assigns each point to exactly one cluster.
K-means minimizes the sum of squared distances to cluster centers.
K-means estimates per-cluster covariance matrices.
K-means can be obtained as a limit of GMM-EM.
K-means uses only cluster means; covariance is implicitly isotropic and identical across clusters (and in the limit, zero).
Quick Check
In sparse Bayesian learning, what happens to coefficients for which the estimated hyperparameter grows very large?
They acquire large posterior variance.
They are pushed toward zero (pruned).
They dominate the posterior mean.
They cause the noise variance to diverge.
means the prior variance , so the posterior concentrates at zero β automatic relevance determination.
Example: Two-State HMM via Baum-Welch
Consider a hidden Markov model with two states , transition matrix with entries , initial distribution , and Gaussian emission densities . Given a sequence , state the E-step and M-step updates for the parameters .
E-step: forward-backward
Run the forward recursion for and the backward recursion for . Form the single-node and pair responsibilities and . These are the posterior marginals of the latent states under the current parameters.
M-step: closed-form updates
Setting gradients of the expected complete-data log-likelihood to zero gives , , , and .
Monotonic improvement
Because each update performs the exact M-maximization against the posterior computed in the E-step, the general EM monotonicity theorem guarantees . This is the Baum-Welch algorithm β EM specialized to HMMs.
Operational reading
Baum-Welch decouples neatly: the forward-backward recursion handles temporal coupling in time, and the M-step is as simple as fitting independent Gaussians weighted by responsibilities. This decoupling is the reason EM remains the default HMM training procedure despite the availability of direct gradient methods.
Baum-Welch algorithm
The EM algorithm specialized to hidden Markov models. Its E-step is the forward-backward recursion, which computes posterior state marginals; its M-step updates the initial distribution, transition matrix, and emission parameters in closed form from the responsibilities. It predates general EM by nearly a decade and is the workhorse training algorithm for HMMs in speech, bioinformatics, and time-series modeling.
Automatic relevance determination (ARD)
A hierarchical-Bayes mechanism in which each coefficient carries its own prior precision hyperparameter , and the are themselves estimated (typically by EM). Irrelevant features drive their , shrinking the corresponding posterior onto zero and effectively pruning them. ARD underlies the relevance vector machine and is the canonical Bayesian route to sparsity without an explicit penalty.
References
- G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions, Wiley, 2nd ed., 2008β Comprehensive monograph covering variants, acceleration, and applications.
- L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, 1989β The definitive engineering tutorial on HMMs and Baum-Welch training.
- M. E. Tipping, Sparse Bayesian Learning and the Relevance Vector Machine, 2001β Original SBL paper β EM on a hierarchical Gaussian prior with automatic relevance determination.