Ferkans — Interactive Telecom Tutor

EM Beyond the GMM

Once the GMM updates are understood, a whole family of algorithms reveals itself as special cases or close relatives of EM. We highlight three: K-means (a hard-assignment limit of GMM-EM), the Baum-Welch algorithm for hidden Markov models, and sparse Bayesian learning (EM applied to a Gaussian-variance latent-variable model). Each preserves the E-then-M structure but changes the latent-variable model.

Definition:
$K$ -means Clustering

Given data $\mathbf{y}_1,\ldots,\mathbf{y}_n\in\mathbb{R}^d$ and $K$ , $K$ -means finds cluster centers $\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_K$ and hard assignments $c_i \in \{1,\ldots,K\}$ minimizing $J(\boldsymbol{\mu},c) = \sum_{i=1}^n \|\mathbf{y}_i - \boldsymbol{\mu}_{c_i}\|^2.$ The standard Lloyd iteration alternates:

Assignment: $c_i \leftarrow \arg\min_k \|\mathbf{y}_i - \boldsymbol{\mu}_k\|^2$ .
Update: $\boldsymbol{\mu}_k \leftarrow (1/|\{i:c_i=k\}|)\sum_{i:c_i=k}\mathbf{y}_i$ .

Theorem: K-means as a Hard-Assignment Limit of GMM-EM

Consider a GMM with equal weights $\pi_k = 1/K$ and isotropic covariances $\boldsymbol{\Sigma}_{k} = \sigma^2 \mathbf{I}$ . As $\sigma^2 \to 0^+$ , the EM responsibilities become one-hot: $\gamma_{ik} \to \mathbb{1}[k = \arg\min_j \|\mathbf{y}_i - \boldsymbol{\mu}_j\|^2]$ , and the EM updates for $\{\boldsymbol{\mu}_k\}$ reduce to the Lloyd iteration for $K$ -means.

Small $\sigma^2$ makes the posterior concentrate on a single component (whichever is closest). K-means is "EM at zero temperature." Soft EM (large $\sigma^2$ ) respects uncertainty in the assignments.

Show Hint

Write $\gamma_{ik} \propto \exp(-\|\mathbf{y}_i - \boldsymbol{\mu}_k\|^2/(2\sigma^2))$ .

As $\sigma^2 \to 0$ , the softmax becomes a hard-max (argmin).

Proof

Responsibility limit

With the stated restrictions, $\gamma_{ik} = \frac{\exp(-\|\mathbf{y}_i-\boldsymbol{\mu}_k\|^2/(2\sigma^2))}{\sum_j \exp(-\|\mathbf{y}_i-\boldsymbol{\mu}_j\|^2/(2\sigma^2))}$ . As $\sigma^2 \to 0$ , the exponent with the smallest (most negative) value dominates, so $\gamma_{ik} \to \mathbb{1}[k=\arg\min_j \|\mathbf{y}_i-\boldsymbol{\mu}_j\|^2]$ .

Mean update

Plugging one-hot $\gamma_{ik}$ into $\boldsymbol{\mu}_k^{(t+1)} = \sum_i \gamma_{ik}\mathbf{y}_i/\sum_i \gamma_{ik}$ gives the mean of the samples hard-assigned to cluster $k$ — the Lloyd update. $\blacksquare$

Baum-Welch: EM for Hidden Markov Models

A hidden Markov model (HMM) has discrete hidden states $S_1,\ldots,S_T \in \{1,\ldots,K\}$ evolving as a Markov chain with transition matrix $\mathbf{A}$ and emitting observations $\mathbf{Y}_t$ through an emission distribution $p(\mathbf{y}\mid s;\boldsymbol{\phi})$ . The parameters are $\boldsymbol{\theta} = (\boldsymbol{\pi}_0, \mathbf{A}, \boldsymbol{\phi})$ . Because the state sequence is latent, MLE is not closed-form.

The Baum-Welch algorithm is EM applied here:

E-step: compute the marginals $\gamma_t(k) = \Pr(S_t=k\mid\mathbf{y};\boldsymbol{\theta}^{(t)})$ and pairwise marginals $\xi_t(j,k) = \Pr(S_{t-1}=j,S_t=k\mid\mathbf{y};\boldsymbol{\theta}^{(t)})$ via the forward-backward recursion.
M-step: update $\pi_{0,k} = \gamma_1(k)$ , $A_{jk} = \frac{\sum_t \xi_t(j,k)}{\sum_t\gamma_{t-1}(j)}$ , and $\boldsymbol{\phi}_k$ by maximizing $\sum_t \gamma_t(k)\log p(\mathbf{y}_t\mid k;\boldsymbol{\phi}_k)$ .

Forward-backward is the exact posterior computation for the HMM — so Baum-Welch is ordinary EM, not variational EM.

Definition:
Sparse Bayesian Learning (SBL)

Consider the linear model $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{w}$ with $\mathbf{w}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})$ . SBL places a zero-mean Gaussian prior with unknown variance on each coefficient: $x_j \sim \mathcal{N}(0, \alpha_j^{-1})$ independently. The hyperparameter vector $\boldsymbol{\alpha} = (\alpha_1,\ldots,\alpha_n)$ is estimated by type-II maximum likelihood (evidence maximization), and the latent variable is $\mathbf{X}$ itself. EM yields closed-form updates:

E-step (Gaussian posterior): $p(\mathbf{x}\mid\mathbf{y};\boldsymbol{\alpha}^{(t)},\sigma^2) = \mathcal{N}(\boldsymbol{\mu}^{(t)}, \boldsymbol{\Sigma}^{(t)})$ with $\boldsymbol{\Sigma}^{(t)} = (\sigma^{-2}\mathbf{A}^\top\mathbf{A} + \operatorname{diag}(\boldsymbol{\alpha}^{(t)}))^{-1}$ , $\boldsymbol{\mu}^{(t)} = \sigma^{-2}\boldsymbol{\Sigma}^{(t)}\mathbf{A}^\top\mathbf{y}$ .
M-step: $\alpha_j^{(t+1)} = 1/(\mu_j^{(t),2} + \boldsymbol{\Sigma}_{jj}^{(t)})$ .

As iterations proceed, irrelevant $\alpha_j \to \infty$ , driving the corresponding $x_j \to 0$ . This automatic relevance determination (ARD) yields sparse estimates with no hand-tuned regularization parameter.

🎓CommIT Contribution(2020)

EM-Based Sparse Bayesian Learning for Massive Random Access

M. Ke, G. Caire — IEEE Trans. Signal Processing, vol. 68

In grant-free massive random access, only a small subset of the $K$ users in a cell are active at any given time, and the base station must jointly detect activity and estimate channels. Modeling each user's channel as drawn from a Gaussian prior with unknown variance $\alpha_k^{-1}$ (activity $\Leftrightarrow \alpha_k$ finite), the problem becomes an SBL instance at massive-MIMO scale. The EM updates derived here — with exploitation of the Kronecker structure of the sensing matrix induced by the pilot-plus-antenna-array geometry — deliver orders-of-magnitude complexity savings over vanilla SBL while preserving the Bayes-optimal phase transition. This line of work is part of the CommIT group's programme on unsourced random access for 6G.

sblmassive-accessmimoemView Paper →

Turbo Equalization as EM

In turbo equalization, a soft-output equalizer and a soft-input/soft-output decoder exchange extrinsic log-likelihood ratios. Viewed probabilistically, the transmitted coded symbols play the role of latent variables: the decoder implements a form of E-step (posterior computation over symbols given LLRs), and the equalizer's reliance on symbol priors is the M-step analogue. The connection is not perfect — turbo uses extrinsic information to avoid double-counting — but the E-then-M rhythm is exactly the same, and so is the monotonicity-under-approximation intuition.

Sensitivity to Initialization: Multi-Restart EM

Multiple random initializations, each run to convergence, produce different local maxima. The best log-likelihood across restarts is the standard selection rule.

Parameters

Number of restarts8

K

3

Data seed7

EM and Its Relatives

Algorithm	Latent variables	E-step	M-step
EM (generic)	Arbitrary $\mathbf{Z}$	Exact posterior $p(\mathbf{z}\mid\mathbf{y};\boldsymbol{\theta})$	$\arg\max_{\boldsymbol{\theta}} Q$
GMM-EM	Component labels $Z_i$	Responsibilities $\gamma_{ik}$	Weighted Gaussian MLE
K-means	Component labels $Z_i$	Hard assignments (zero-temperature)	Cluster means
Baum-Welch	State sequence $S_{1:T}$	Forward-backward smoothing	Count-based transition + emission MLE
SBL / ARD	Coefficients $\mathbf{X}$	Gaussian posterior mean + covariance	$\alpha_j \leftarrow 1/(\mu_j^2 + \Sigma_{jj})$
Variational EM	Arbitrary $\mathbf{Z}$	Restricted $q \in \mathcal{Q}$ (not the true posterior)	$\arg\max_{\boldsymbol{\theta}} \mathcal{F}(q,\boldsymbol{\theta})$

Why This Matters: Channel Estimation with Unknown Interference

Consider pilot-based channel estimation in a cell where neighbouring cells transmit asynchronously: the interference $\mathbf{i}$ has unknown statistics. A two-component Gaussian mixture — desired user + interferer — captures this neatly, with the interferer's symbol as a latent variable. EM alternates between LMMSE channel estimation (given current interference statistics) and covariance estimation (given current channel) — a workhorse recipe in coordinated multi-point (CoMP) and cell-free massive MIMO receivers.

See full treatment in Greedy Algorithms: OMP, CoSaMP, IHT

🔧Engineering Note

EM vs. Stochastic Gradient on Latent-Variable Models

For very large datasets, a full E-step over all samples becomes prohibitive. Two practical alternatives: (i) stochastic EM performs the E-step on a mini-batch and takes a damped M-step; (ii) gradient-based EM directly applies SGD to the ELBO using the reparameterization trick. In the deep-learning era, variational autoencoders (VAEs) are precisely gradient-based variational EM. The textbook EM algorithm is the ancestor that makes sense of all these variants.

Practical Constraints

•
Full-batch EM: exact M-step, can be slow on $n \gg 10^6$
•
Stochastic EM: $O(\text{batch}/n)$ speedup, requires step-size tuning
•
VAE / amortized EM: replace per-sample E-step with a neural encoder

Quick Check

Which of the following is not true for K-means relative to GMM-EM?

K-means assigns each point to exactly one cluster.

K-means minimizes the sum of squared distances to cluster centers.

K-means estimates per-cluster covariance matrices.

K-means can be obtained as a $\sigma^2 \to 0$ limit of GMM-EM.

Correction:

K-means estimates per-cluster covariance matrices.

K-means uses only cluster means; covariance is implicitly isotropic and identical across clusters (and in the limit, zero).

Quick Check

In sparse Bayesian learning, what happens to coefficients $x_j$ for which the estimated hyperparameter $\alpha_j$ grows very large?

They acquire large posterior variance.

They are pushed toward zero (pruned).

They dominate the posterior mean.

They cause the noise variance $\sigma^2$ to diverge.

Correction:

They are pushed toward zero (pruned).

$\alpha_j \to \infty$ means the prior variance $\alpha_j^{-1} \to 0$ , so the posterior concentrates at zero — automatic relevance determination.

Example: Two-State HMM via Baum-Welch

Consider a hidden Markov model with two states $\{1,2\}$ , transition matrix $\mathbf{A}$ with entries $a_{ij}=P(Z_{n+1}=j \mid Z_n=i)$ , initial distribution $\pi$ , and Gaussian emission densities $f(y_n \mid Z_n=i) = \mathcal{N}(y_n; \mu_i, \sigma^2)$ . Given a sequence $y_{1:N}$ , state the E-step and M-step updates for the parameters $\boldsymbol{\theta} = (\pi, \mathbf{A}, \mu_1, \mu_2, \sigma^2)$ .

Solution

E-step: forward-backward

Run the forward recursion for $\alpha_n(i) = f(y_{1:n}, Z_n=i;\boldsymbol{\theta}^{(t)})$ and the backward recursion for $\beta_n(i) = f(y_{n+1:N}\mid Z_n=i;\boldsymbol{\theta}^{(t)})$ . Form the single-node and pair responsibilities $\gamma_n(i) = \alpha_n(i)\beta_n(i)/f(y_{1:N})$ and $\xi_n(i,j) = \alpha_n(i) a_{ij}^{(t)} f(y_{n+1}\mid Z_{n+1}=j) \beta_{n+1}(j)/f(y_{1:N})$ . These are the posterior marginals of the latent states under the current parameters.

M-step: closed-form updates

Setting gradients of the expected complete-data log-likelihood to zero gives $\pi_i^{(t+1)} = \gamma_1(i)$ , $a_{ij}^{(t+1)} = \dfrac{\sum_{n=1}^{N-1} \xi_n(i,j)}{\sum_{n=1}^{N-1} \gamma_n(i)}$ , $\mu_i^{(t+1)} = \dfrac{\sum_n \gamma_n(i) y_n}{\sum_n \gamma_n(i)}$ , and $\sigma^{2,(t+1)} = \dfrac{1}{N}\sum_{i,n} \gamma_n(i)(y_n-\mu_i^{(t+1)})^2$ .

Monotonic improvement

Because each update performs the exact M-maximization against the posterior computed in the E-step, the general EM monotonicity theorem guarantees $\log f(y_{1:N};\boldsymbol{\theta}^{(t+1)}) \geq \log f(y_{1:N};\boldsymbol{\theta}^{(t)})$ . This is the Baum-Welch algorithm — EM specialized to HMMs. $\blacksquare$

Operational reading

Baum-Welch decouples neatly: the forward-backward recursion handles temporal coupling in $O(K^2 N)$ time, and the M-step is as simple as fitting $K$ independent Gaussians weighted by responsibilities. This decoupling is the reason EM remains the default HMM training procedure despite the availability of direct gradient methods.

,

Baum-Welch algorithm

The EM algorithm specialized to hidden Markov models. Its E-step is the forward-backward recursion, which computes posterior state marginals; its M-step updates the initial distribution, transition matrix, and emission parameters in closed form from the responsibilities. It predates general EM by nearly a decade and is the workhorse training algorithm for HMMs in speech, bioinformatics, and time-series modeling.

Automatic relevance determination (ARD)

A hierarchical-Bayes mechanism in which each coefficient $x_j$ carries its own prior precision hyperparameter $\alpha_j$ , and the $\alpha_j$ are themselves estimated (typically by EM). Irrelevant features drive their $\alpha_j \to \infty$ , shrinking the corresponding posterior onto zero and effectively pruning them. ARD underlies the relevance vector machine and is the canonical Bayesian route to sparsity without an explicit $\ell_1$ penalty.

Applications: K-means, HMMs, and Sparse Bayesian Learning

EM Beyond the GMM

Definition:
$K$ -means Clustering

Theorem: K-means as a Hard-Assignment Limit of GMM-EM

Responsibility limit

Mean update

Baum-Welch: EM for Hidden Markov Models

Definition:
Sparse Bayesian Learning (SBL)

EM-Based Sparse Bayesian Learning for Massive Random Access

Turbo Equalization as EM

Sensitivity to Initialization: Multi-Restart EM

Parameters

EM and Its Relatives

Why This Matters: Channel Estimation with Unknown Interference

EM vs. Stochastic Gradient on Latent-Variable Models

Quick Check

Quick Check

Example: Two-State HMM via Baum-Welch

E-step: forward-backward

M-step: closed-form updates

Monotonic improvement

Operational reading

Baum-Welch algorithm

Automatic relevance determination (ARD)

References

Applications: K-means, HMMs, and Sparse Bayesian Learning

EM Beyond the GMM

Definition: KKK-means Clustering

Theorem: K-means as a Hard-Assignment Limit of GMM-EM

Responsibility limit

Mean update

Baum-Welch: EM for Hidden Markov Models

Definition: Sparse Bayesian Learning (SBL)

EM-Based Sparse Bayesian Learning for Massive Random Access

Turbo Equalization as EM

Sensitivity to Initialization: Multi-Restart EM

Parameters

EM and Its Relatives

Why This Matters: Channel Estimation with Unknown Interference

EM vs. Stochastic Gradient on Latent-Variable Models

Quick Check

Quick Check

Example: Two-State HMM via Baum-Welch

E-step: forward-backward

M-step: closed-form updates

Monotonic improvement

Operational reading

Baum-Welch algorithm

Automatic relevance determination (ARD)

References

Definition:
$K$ -means Clustering

Definition:
Sparse Bayesian Learning (SBL)