Bayesian Sparse Recovery

Why a Probabilistic Model for Sparsity?

Sections 14.1-14.3 treated sparsity as a deterministic structural constraint. A richer viewpoint places a prior on x\mathbf{x} that expresses sparsity probabilistically. The point is that a prior gives us more than a single point estimate: it yields a posterior p(x∣y)p(\mathbf{x} \mid \mathbf{y}) whose mean is the MMSE estimator, whose mode is the MAP estimator, and whose marginals quantify support uncertainty. In Bayesian language, LASSO is the MAP under a Laplace prior, β„“0\ell_0-minimization is the MAP under a spike prior, and sparse Bayesian learning supplies a tractable approximation to the otherwise combinatorial marginalisation over supports.

Definition:

Spike-and-Slab Prior

The spike-and-slab prior on xi∈Rx_i \in \mathbb{R} is the mixture p(xi)=(1βˆ’Ο) δ0(xi)+ρ N(xi; 0,Οƒx2),p(x_i) = (1-\rho)\,\delta_0(x_i) + \rho\,\mathcal{N}(x_i;\,0,\sigma_x^2), where ρ∈(0,1)\rho \in (0,1) is the activity probability and Οƒx2\sigma_x^2 the nonzero-component variance. The xix_i are assumed independent across ii.

The Dirac mass at 00 (the "spike") encodes exact sparsity; the Gaussian (the "slab") models nonzero magnitudes. Marginalising over the latent activity indicator si=1{xi≠0}s_i = \mathbb{1}\{x_i \neq 0\} recovers the spike-and-slab distribution.

Definition:

Bernoulli-Gaussian Model

The Bernoulli-Gaussian (BG) model is the hierarchical formulation of the spike-and-slab: si∼Bernoulli(ρ),xi∣si∼siβ‹…N(0,Οƒx2),s_i \sim \text{Bernoulli}(\rho),\qquad x_i \mid s_i \sim s_i\cdot\mathcal{N}(0,\sigma_x^2), with observation y=Ax+w\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{w}, w∼N(0,Οƒ2I)\mathbf{w}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I}).

Theorem: MAP under Spike-and-Slab Recovers β„“0\ell_0-Regularized Estimation

For the Bernoulli-Gaussian model with observation noise Οƒ2\sigma^2, the MAP estimator x^MAP=arg⁑max⁑x p(x∣y)\hat{\mathbf{x}}_{\text{MAP}} = \arg\max_{\mathbf{x}}\,p(\mathbf{x}\mid\mathbf{y}) solves min⁑xΒ 12Οƒ2βˆ₯yβˆ’Axβˆ₯22+τ βˆ₯xβˆ₯0,\min_{\mathbf{x}}\ \frac{1}{2\sigma^2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 + \tau\,\|\mathbf{x}\|_0, where Ο„=log⁑ ⁣(1βˆ’ΟΟ)+12log⁑(2πσx2)\tau = \log\!\left(\frac{1-\rho}{\rho}\right) + \frac{1}{2}\log(2\pi\sigma_x^2). In the limit Οƒx2β†’βˆž\sigma_x^2 \to \infty, the penalty reduces to Ο„βˆ₯xβˆ₯0\tau\|\mathbf{x}\|_0.

Sparser priors (ρ→0\rho \to 0) increase Ο„\tau and force more entries to zero. MAP under a Laplace prior gives LASSO; MAP under spike-and-slab gives the NP-hard β„“0\ell_0 problem. The probabilistic hierarchy is what separates computationally tractable (Laplace) from intractable (spike) MAP.

Posterior Mean is the MMSE Estimator

Under the Bernoulli-Gaussian model, the posterior mean E[x∣y]\mathbb{E}[\mathbf{x}\mid\mathbf{y}] is the MMSE estimator β€” it minimizes the Bayesian MSE Eβˆ₯xβˆ’x^βˆ₯22\mathbb{E}\|\mathbf{x}-\hat{\mathbf{x}}\|_2^2. Unlike MAP, the posterior mean averages over all 2N2^N support configurations weighted by posterior probability, producing a soft estimate that is generally not exactly sparse. For recovery tasks where the MSE is the operational metric (e.g.\ channel estimation), the posterior mean outperforms MAP; for support identification, MAP or thresholded posterior mean is preferred.

Definition:

Sparse Bayesian Learning (SBL)

SBL replaces the Bernoulli mixture with a continuous automatic-relevance-determination (ARD) prior: each xix_i has its own variance Ξ³iβ‰₯0\gamma_i \geq 0, and xi∣γi∼N(0,Ξ³i),Ξ³i∼GammaΒ hyper-prior.x_i \mid \gamma_i \sim \mathcal{N}(0,\gamma_i),\qquad \gamma_i\sim\text{Gamma hyper-prior}. SBL estimates Ξ³=(Ξ³1,…,Ξ³N)\boldsymbol{\gamma}=(\gamma_1,\ldots,\gamma_N) by type-II maximum likelihood (marginal-likelihood maximization), then computes the posterior mean of x\mathbf{x} conditioned on Ξ³^\hat{\boldsymbol{\gamma}}. Components with Ξ³^iβ†’0\hat{\gamma}_i\to 0 are pruned from the model.

,

SBL via EM

Complexity: Per-iteration cost O(N3)O(N^3) for the posterior covariance; can be reduced to O(MN2)O(MN^2) via Woodbury.
Input: sensing matrix A (M x N), observation y (M), noise variance sigma2
Output: sparse estimate x_hat
Initialize gamma_i = 1 for i = 1,...,N
repeat
# E-step: posterior mean and covariance
Gamma = diag(gamma)
Sigma_x = (A^H A / sigma2 + Gamma^{-1})^{-1}
mu_x = Sigma_x (A^H y / sigma2)
# M-step: update hyperparameters
for i = 1..N:
gamma_i = mu_x[i]^2 + Sigma_x[i,i]
until convergence
return x_hat = mu_x

Theorem: SBL Fixed Points are Sparse

Every fixed point Ξ³^\hat{\boldsymbol{\gamma}} of the SBL EM iteration has at most MM nonzero entries, where MM is the number of observations. Equivalently, the recovered x^\hat{\mathbf{x}} is supported on at most MM indices.

The reader should verify this against the LASSO solution: LASSO too has at most MM nonzero entries (under mild genericity). SBL inherits the same "at-most-MM active components" structure because both reduce to a rank-MM subspace selection.

Example: SBL vs LASSO on a Small Problem

Consider N=100N=100, M=40M=40, true sparsity s=5s=5, i.i.d.\ Gaussian sensing matrix, and SNR =20= 20 dB. Compare support recovery and MSE of SBL versus LASSO (cross-validated Ξ»\lambda) over 50 trials.

SBL vs LASSO: Support Recovery and MSE

Vary the sparsity level ss, the undersampling ratio M/NM/N, and the SNR. Observe the gap between SBL and LASSO in probability of exact support recovery and in MSE.

Parameters
5
0.4
20

Posterior Marginals for a Toy Bernoulli-Gaussian Model

Exhaustively compute the posterior over supports for N=8N=8, s≀3s\leq 3, and display the posterior inclusion probability per index.

Parameters
0.2
15
πŸ”§Engineering Note

When to Use SBL in Practice

SBL is most attractive when: (i) the sensing matrix is highly coherent (LASSO's guarantees degrade, SBL's do not), (ii) the signal has decaying-magnitude structure where shrinkage is harmful, (iii) uncertainty quantification per component is needed. It is less attractive when NN is very large (cubic per-iteration cost) or when a single convex solve suffices. For massive-MIMO channel estimation with coherent pilot sequences, SBL consistently beats LASSO β€” see Ch 15.

Common Mistake: Confusing MAP with Posterior Mean

Mistake:

Reporting the MAP estimate as "the Bayesian answer" when the task requires minimum MSE.

Correction:

MAP minimizes the 00-11 loss over x\mathbf{x}; the posterior mean minimizes the squared-error loss. They coincide only for unimodal posteriors (e.g.\ Gaussian). Under a spike-and-slab posterior, the two can differ substantially: MAP prefers a single support, while the mean averages over supports.

Spike-and-slab prior

A two-component mixture with a point mass at zero (the spike) and a continuous component (the slab). The canonical prior for exact-sparsity Bayesian models.

Related: Spike-and-Slab Prior

Sparse Bayesian Learning (SBL)

A hierarchical Bayesian approach using per-component variance hyperparameters (ARD prior) estimated by type-II maximum likelihood. Tipping (2001); Wipf-Rao (2004).

Related: Sparse Bayesian Learning (SBL)

Historical Note: From RVM to SBL to Deep Sparse Models

1996-2020

Tipping's relevance-vector machine (2001) introduced the ARD prior as a sparsity-inducing mechanism in Bayesian regression. Wipf and Rao (2004) showed that SBL outperforms LASSO in support recovery under adverse dictionary conditions β€” an observation that motivated two decades of follow-up work (T-MSBL for temporally correlated sources, PC-SBL for pattern-coupled sparsity, and their unrolling into deep-learning architectures). The method also reappears in massive-random-access detectors (Ch 15).

Key Takeaway

The Bernoulli-Gaussian / spike-and-slab framework unifies sparse estimation under a probabilistic lens: MAP under a Laplace prior gives LASSO; MAP under spike-and-slab gives β„“0\ell_0; the posterior mean is the MMSE estimator. Sparse Bayesian Learning replaces the intractable support-marginalization by an ARD hyperparameter optimization solved via EM. SBL delivers sparser, less-biased, and typically more accurate estimates than LASSO under coherent dictionaries, at the price of cubic per-iteration cost.