Ferkans — Interactive Telecom Tutor

Why a Probabilistic Model for Sparsity?

Sections 14.1-14.3 treated sparsity as a deterministic structural constraint. A richer viewpoint places a prior on $\mathbf{x}$ that expresses sparsity probabilistically. The point is that a prior gives us more than a single point estimate: it yields a posterior $p(\mathbf{x} \mid \mathbf{y})$ whose mean is the MMSE estimator, whose mode is the MAP estimator, and whose marginals quantify support uncertainty. In Bayesian language, LASSO is the MAP under a Laplace prior, $\ell_0$ -minimization is the MAP under a spike prior, and sparse Bayesian learning supplies a tractable approximation to the otherwise combinatorial marginalisation over supports.

Definition:
Spike-and-Slab Prior

The spike-and-slab prior on $x_i \in \mathbb{R}$ is the mixture $p(x_i) = (1-\rho)\,\delta_0(x_i) + \rho\,\mathcal{N}(x_i;\,0,\sigma_x^2),$ where $\rho \in (0,1)$ is the activity probability and $\sigma_x^2$ the nonzero-component variance. The $x_i$ are assumed independent across $i$ .

The Dirac mass at $0$ (the "spike") encodes exact sparsity; the Gaussian (the "slab") models nonzero magnitudes. Marginalising over the latent activity indicator $s_i = \mathbb{1}\{x_i \neq 0\}$ recovers the spike-and-slab distribution.

Definition:
Bernoulli-Gaussian Model

The Bernoulli-Gaussian (BG) model is the hierarchical formulation of the spike-and-slab: $s_i \sim \text{Bernoulli}(\rho),\qquad x_i \mid s_i \sim s_i\cdot\mathcal{N}(0,\sigma_x^2),$ with observation $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{w}$ , $\mathbf{w}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})$ .

Theorem: MAP under Spike-and-Slab Recovers $\ell_0$ -Regularized Estimation

For the Bernoulli-Gaussian model with observation noise $\sigma^2$ , the MAP estimator $\hat{\mathbf{x}}_{\text{MAP}} = \arg\max_{\mathbf{x}}\,p(\mathbf{x}\mid\mathbf{y})$ solves $\min_{\mathbf{x}}\ \frac{1}{2\sigma^2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 + \tau\,\|\mathbf{x}\|_0,$ where $\tau = \log\!\left(\frac{1-\rho}{\rho}\right) + \frac{1}{2}\log(2\pi\sigma_x^2)$ . In the limit $\sigma_x^2 \to \infty$ , the penalty reduces to $\tau\|\mathbf{x}\|_0$ .

Sparser priors ( $\rho \to 0$ ) increase $\tau$ and force more entries to zero. MAP under a Laplace prior gives LASSO; MAP under spike-and-slab gives the NP-hard $\ell_0$ problem. The probabilistic hierarchy is what separates computationally tractable (Laplace) from intractable (spike) MAP.

Proof

Write the log-posterior

By Bayes' rule, $\log p(\mathbf{x}\mid\mathbf{y}) = \log p(\mathbf{y}\mid\mathbf{x})+\log p(\mathbf{x})+\text{const}$ . The Gaussian likelihood gives $-\tfrac{1}{2\sigma^2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2$ .

Evaluate the prior on the nonzero set

For each $i$ , $\log p(x_i) = \log(1-\rho)$ if $x_i=0$ and $\log\rho - \tfrac{1}{2}\log(2\pi\sigma_x^2) - \tfrac{x_i^2}{2\sigma_x^2}$ otherwise. Summing, $\log p(\mathbf{x}) = \|\mathbf{x}\|_0\log\frac{\rho}{1-\rho} - \frac{1}{2}\|\mathbf{x}\|_0\log(2\pi\sigma_x^2) - \frac{1}{2\sigma_x^2}\|\mathbf{x}\|_2^2 + N\log(1-\rho)$ .

Identify the regularizer

Maximizing the log-posterior is equivalent to minimizing $\tfrac{1}{2\sigma^2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 + \tau\|\mathbf{x}\|_0 + \tfrac{1}{2\sigma_x^2}\|\mathbf{x}\|_2^2$ , with $\tau$ as stated. In the weak-prior limit $\sigma_x^2\to\infty$ , the $\ell_2$ term vanishes and we obtain the $\ell_0$ program.

Posterior Mean is the MMSE Estimator

Under the Bernoulli-Gaussian model, the posterior mean $\mathbb{E}[\mathbf{x}\mid\mathbf{y}]$ is the MMSE estimator — it minimizes the Bayesian MSE $\mathbb{E}\|\mathbf{x}-\hat{\mathbf{x}}\|_2^2$ . Unlike MAP, the posterior mean averages over all $2^N$ support configurations weighted by posterior probability, producing a soft estimate that is generally not exactly sparse. For recovery tasks where the MSE is the operational metric (e.g.\ channel estimation), the posterior mean outperforms MAP; for support identification, MAP or thresholded posterior mean is preferred.

Definition:
Sparse Bayesian Learning (SBL)

SBL replaces the Bernoulli mixture with a continuous automatic-relevance-determination (ARD) prior: each $x_i$ has its own variance $\gamma_i \geq 0$ , and $x_i \mid \gamma_i \sim \mathcal{N}(0,\gamma_i),\qquad \gamma_i\sim\text{Gamma hyper-prior}.$ SBL estimates $\boldsymbol{\gamma}=(\gamma_1,\ldots,\gamma_N)$ by type-II maximum likelihood (marginal-likelihood maximization), then computes the posterior mean of $\mathbf{x}$ conditioned on $\hat{\boldsymbol{\gamma}}$ . Components with $\hat{\gamma}_i\to 0$ are pruned from the model.

,

SBL via EM

Complexity: Per-iteration cost

O(N^3)

for the posterior covariance; can be reduced to

O(MN^2)

via Woodbury.

Input: sensing matrix A (M x N), observation y (M), noise variance sigma2

Output: sparse estimate x_hat

Initialize gamma_i = 1 for i = 1,...,N

repeat

# E-step: posterior mean and covariance

Gamma = diag(gamma)

Sigma_x = (A^H A / sigma2 + Gamma^{-1})^{-1}

mu_x = Sigma_x (A^H y / sigma2)

# M-step: update hyperparameters

for i = 1..N:

gamma_i = mu_x[i]^2 + Sigma_x[i,i]

until convergence

return x_hat = mu_x

Theorem: SBL Fixed Points are Sparse

Every fixed point $\hat{\boldsymbol{\gamma}}$ of the SBL EM iteration has at most $M$ nonzero entries, where $M$ is the number of observations. Equivalently, the recovered $\hat{\mathbf{x}}$ is supported on at most $M$ indices.

The reader should verify this against the LASSO solution: LASSO too has at most $M$ nonzero entries (under mild genericity). SBL inherits the same "at-most- $M$ active components" structure because both reduce to a rank- $M$ subspace selection.

Proof

Characterize the type-II likelihood

The marginal log-likelihood of $\mathbf{y}$ under ARD is, up to constants, $\mathcal{L}(\boldsymbol{\gamma}) = -\log\det(\mathbf{C}) - \mathbf{y}^H\mathbf{C}^{-1}\mathbf{y}$ with $\mathbf{C} = \sigma^2\mathbf{I}_M + \mathbf{A}\,\text{diag}(\boldsymbol{\gamma})\,\mathbf{A}^{H}$ .

Exploit rank arguments

Any stationary point satisfies $\partial\mathcal{L}/\partial\gamma_i = 0$ . Wipf & Rao show that $\mathcal{L}$ is convex on faces of the positive orthant and that its unconstrained maxima lie on faces where at most $M$ coordinates are positive.

Support reduction

Indices with $\hat{\gamma}_i=0$ force $\hat{x}_i=0$ exactly. Thus $|\text{supp}(\hat{\mathbf{x}})|\leq M$ .

Example: SBL vs LASSO on a Small Problem

Consider $N=100$ , $M=40$ , true sparsity $s=5$ , i.i.d.\ Gaussian sensing matrix, and SNR $= 20$ dB. Compare support recovery and MSE of SBL versus LASSO (cross-validated $\lambda$ ) over 50 trials.

Solution

Experimental setup

Generate $\mathbf{A} \in \mathbb{R}^{40\times 100}$ with i.i.d.\ $\mathcal{N}(0,1/M)$ entries. Draw a random $s$ -sparse $\mathbf{x}$ with nonzero entries $\sim\mathcal{N}(0,1)$ , form $\mathbf{y}=\mathbf{A}\mathbf{x}+\mathbf{w}$ , $\mathbf{w}\sim\mathcal{N}(\mathbf{0},10^{-2}\mathbf{I})$ .

Observed outcome

In typical runs, SBL identifies the true support with higher probability ( $\approx 0.9$ vs $\approx 0.75$ for LASSO) and achieves $20$ - $30\%$ lower reconstruction MSE. The gap widens as $s/M$ grows.

Interpretation

LASSO's bias toward shrinking large coefficients (a consequence of the Laplace prior) hurts magnitude reconstruction. SBL's ARD prior is approximately scale-invariant once a component is deemed relevant, avoiding the shrinkage penalty on active entries.

SBL vs LASSO: Support Recovery and MSE

Vary the sparsity level $s$ , the undersampling ratio $M/N$ , and the SNR. Observe the gap between SBL and LASSO in probability of exact support recovery and in MSE.

Parameters

Sparsity

s

5

Undersampling

M/N

0.4

SNR (dB)20

Posterior Marginals for a Toy Bernoulli-Gaussian Model

Exhaustively compute the posterior over supports for $N=8$ , $s\leq 3$ , and display the posterior inclusion probability per index.

Parameters

Activity prob.

\rho

0.2

SNR (dB)15

🔧Engineering Note

When to Use SBL in Practice

SBL is most attractive when: (i) the sensing matrix is highly coherent (LASSO's guarantees degrade, SBL's do not), (ii) the signal has decaying-magnitude structure where shrinkage is harmful, (iii) uncertainty quantification per component is needed. It is less attractive when $N$ is very large (cubic per-iteration cost) or when a single convex solve suffices. For massive-MIMO channel estimation with coherent pilot sequences, SBL consistently beats LASSO — see Ch 15.

Common Mistake: Confusing MAP with Posterior Mean

Mistake:

Reporting the MAP estimate as "the Bayesian answer" when the task requires minimum MSE.

Correction:

MAP minimizes the $0$ - $1$ loss over $\mathbf{x}$ ; the posterior mean minimizes the squared-error loss. They coincide only for unimodal posteriors (e.g.\ Gaussian). Under a spike-and-slab posterior, the two can differ substantially: MAP prefers a single support, while the mean averages over supports.

Spike-and-slab prior

A two-component mixture with a point mass at zero (the spike) and a continuous component (the slab). The canonical prior for exact-sparsity Bayesian models.

Related: Spike-and-Slab Prior

Sparse Bayesian Learning (SBL)

A hierarchical Bayesian approach using per-component variance hyperparameters (ARD prior) estimated by type-II maximum likelihood. Tipping (2001); Wipf-Rao (2004).

Related: Sparse Bayesian Learning (SBL)

Historical Note: From RVM to SBL to Deep Sparse Models

1996-2020

Tipping's relevance-vector machine (2001) introduced the ARD prior as a sparsity-inducing mechanism in Bayesian regression. Wipf and Rao (2004) showed that SBL outperforms LASSO in support recovery under adverse dictionary conditions — an observation that motivated two decades of follow-up work (T-MSBL for temporally correlated sources, PC-SBL for pattern-coupled sparsity, and their unrolling into deep-learning architectures). The method also reappears in massive-random-access detectors (Ch 15).

Key Takeaway

The Bernoulli-Gaussian / spike-and-slab framework unifies sparse estimation under a probabilistic lens: MAP under a Laplace prior gives LASSO; MAP under spike-and-slab gives $\ell_0$ ; the posterior mean is the MMSE estimator. Sparse Bayesian Learning replaces the intractable support-marginalization by an ARD hyperparameter optimization solved via EM. SBL delivers sparser, less-biased, and typically more accurate estimates than LASSO under coherent dictionaries, at the price of cubic per-iteration cost.

Bayesian Sparse Recovery

Why a Probabilistic Model for Sparsity?

Definition: Spike-and-Slab Prior

Definition: Bernoulli-Gaussian Model

Theorem: MAP under Spike-and-Slab Recovers ℓ0\ell_0ℓ0​-Regularized Estimation

Write the log-posterior

Evaluate the prior on the nonzero set

Identify the regularizer

Posterior Mean is the MMSE Estimator

Definition: Sparse Bayesian Learning (SBL)

SBL via EM

Theorem: SBL Fixed Points are Sparse

Characterize the type-II likelihood

Exploit rank arguments

Support reduction

Example: SBL vs LASSO on a Small Problem

Experimental setup

Observed outcome

Interpretation

SBL vs LASSO: Support Recovery and MSE

Parameters

Posterior Marginals for a Toy Bernoulli-Gaussian Model

Parameters

When to Use SBL in Practice

Common Mistake: Confusing MAP with Posterior Mean

Spike-and-slab prior

Sparse Bayesian Learning (SBL)

Historical Note: From RVM to SBL to Deep Sparse Models

Key Takeaway

Definition:
Spike-and-Slab Prior

Definition:
Bernoulli-Gaussian Model

Theorem: MAP under Spike-and-Slab Recovers $\ell_0$ -Regularized Estimation

Definition:
Sparse Bayesian Learning (SBL)