Ferkans — Interactive Telecom Tutor

The Hyperparameter Problem

Every Bayesian model contains hyperparameters — the noise variance $\sigma^2$ , prior variance $\gamma^2$ , Laplace rate $\lambda$ , sparsity level $w$ . In variational regularization, these are tuned by cross-validation or the discrepancy principle (§Parameter Choice Rules). Hierarchical Bayes places priors on the hyperparameters themselves, allowing the data to inform their values automatically.

In RF imaging, the sparsity level (how many targets are present) and target reflectivity variance are typically unknown. Hierarchical models avoid the need to guess these quantities beforehand — the posterior over hyperparameters reflects our uncertainty about the scene's statistical structure.

Definition:
Hierarchical Bayesian Model

A hierarchical Bayesian model introduces latent hyperparameters $\boldsymbol{\alpha}$ with their own prior $\pi(\boldsymbol{\alpha})$ :

$\mathbf{y} \mid \boldsymbol{\gamma} \sim p(\mathbf{y} \mid \boldsymbol{\gamma}), \qquad \boldsymbol{\gamma} \mid \boldsymbol{\alpha} \sim \pi(\boldsymbol{\gamma} \mid \boldsymbol{\alpha}), \qquad \boldsymbol{\alpha} \sim \pi(\boldsymbol{\alpha}).$

The joint posterior over $(\boldsymbol{\gamma}, \boldsymbol{\alpha})$ is

$p(\boldsymbol{\gamma}, \boldsymbol{\alpha} \mid \mathbf{y}) \propto p(\mathbf{y} \mid \boldsymbol{\gamma})\, \pi(\boldsymbol{\gamma} \mid \boldsymbol{\alpha})\,\pi(\boldsymbol{\alpha}).$

The marginal posterior for $\boldsymbol{\gamma}$ alone integrates out $\boldsymbol{\alpha}$ :

$p(\boldsymbol{\gamma} \mid \mathbf{y}) = \int p(\boldsymbol{\gamma}, \boldsymbol{\alpha} \mid \mathbf{y})\, \mathrm{d}\boldsymbol{\alpha}.$

This integration automatically accounts for uncertainty in the hyperparameters and avoids overfitting to a single $\boldsymbol{\alpha}$ value.

Theorem: Empirical Bayes and Evidence Maximization

In empirical Bayes (also called type-II maximum likelihood or evidence maximization), the hyperparameters are estimated by maximizing the marginal likelihood (evidence):

$\hat{\boldsymbol{\alpha}} = \arg\max_{\boldsymbol{\alpha}} \; \mathcal{Z}(\mathbf{y} \mid \boldsymbol{\alpha}) = \arg\max_{\boldsymbol{\alpha}} \int p(\mathbf{y} \mid \boldsymbol{\gamma})\, \pi(\boldsymbol{\gamma} \mid \boldsymbol{\alpha})\,\mathrm{d}\boldsymbol{\gamma}.$

For the Gaussian model $\mathbf{y} = \mathbf{A}\boldsymbol{\gamma} + \mathbf{w}$ with $\mathbf{w} \sim \mathcal{N}(0, \sigma^2 \mathbf{I})$ and $\boldsymbol{\gamma} \sim \mathcal{N}(\mathbf{0}, \alpha^{-1} \mathbf{I})$ , the log-evidence is

$\log\mathcal{Z}(\mathbf{y} \mid \alpha) = -\frac{1}{2}\Bigl[ \mathbf{y}^T \mathbf{C}_\alpha^{-1} \mathbf{y} + \log\det \mathbf{C}_\alpha + m\log(2\pi)\Bigr],$

where $\mathbf{C}_\alpha = \sigma^2 \mathbf{I} + \alpha^{-1}\mathbf{A}\mathbf{A}^H$ .

Proof

Marginalizing the Gaussian model

Since $\boldsymbol{\gamma} \sim \mathcal{N}(\mathbf{0}, \alpha^{-1}\mathbf{I})$ and $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$ independently, the marginal is $\mathbf{y} \sim \mathcal{N}(\mathbf{0}, \mathbf{C}_\alpha)$ with $\mathbf{C}_\alpha = \sigma^2\mathbf{I} + \alpha^{-1}\mathbf{A}\mathbf{A}^H$ . The log-density gives the stated expression. $\blacksquare$

Definition:
Automatic Relevance Determination (ARD)

Automatic relevance determination assigns a separate precision parameter $\alpha_i$ to each component of $\boldsymbol{\gamma}$ :

$\gamma_i \mid \alpha_i \sim \mathcal{N}(0, \alpha_i^{-1}), \qquad i = 1, \ldots, n.$

The prior on $\boldsymbol{\gamma}$ becomes $\pi(\boldsymbol{\gamma} \mid \boldsymbol{\alpha}) = \prod_{i=1}^n \mathcal{N}(\gamma_i; 0, \alpha_i^{-1})$ .

Maximizing the evidence over $\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_n)$ drives most $\alpha_i \to \infty$ , effectively pruning the corresponding components to zero. Only components that are supported by the data retain finite $\alpha_i$ and nonzero $\hat{\gamma}_i$ . This provides Bayesian sparsification without explicitly specifying $\ell_1$ or $\ell_0$ penalties.

Definition:
Sparse Bayesian Learning (SBL)

Sparse Bayesian Learning (Tipping, 2001) is the ARD framework applied to sparse signal recovery:

$\mathbf{y} = \mathbf{A}\boldsymbol{\gamma} + \mathbf{w}, \qquad \gamma_i \mid \alpha_i \sim \mathcal{N}(0, \alpha_i^{-1}), \qquad \mathbf{w} \sim \mathcal{N}(0, \sigma^2\mathbf{I}).$

The posterior for $\boldsymbol{\gamma}$ given $\boldsymbol{\alpha}$ is Gaussian (Theorem TGaussian Prior Yields Gaussian Posterior):

$\boldsymbol{\gamma} \mid \mathbf{y}, \boldsymbol{\alpha} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}),$

where $\boldsymbol{\Sigma} = (\sigma^{-2}\mathbf{A}^H\mathbf{A} + \mathbf{D}_\alpha)^{-1}$ , $\mathbf{D}_\alpha = \text{diag}(\alpha_1, \ldots, \alpha_n)$ , and $\boldsymbol{\mu} = \sigma^{-2}\boldsymbol{\Sigma}\mathbf{A}^H\mathbf{y}$ .

The hyperparameters $\boldsymbol{\alpha}$ and $\sigma^2$ are estimated by maximizing the evidence $\mathcal{Z}(\mathbf{y} \mid \boldsymbol{\alpha}, \sigma^2)$ . SBL is a bridge between MAP (single point estimate) and full Bayesian inference (full posterior) — it provides posterior uncertainty for the estimated support.

EM Algorithm for Sparse Bayesian Learning

Complexity:

O(n^2 m + n^3)

per iteration (dominant cost: posterior covariance update)

Input: Sensing matrix

\mathbf{A}

, measurements

\mathbf{y}

, noise variance

\sigma^2

,

max iterations

T

, convergence threshold

\varepsilon

Output: Posterior mean

\boldsymbol{\mu}

, posterior covariance

\boldsymbol{\Sigma}

,

hyperparameters

\boldsymbol{\alpha}

Initialize:

\alpha_i = 1

for all

i

, iteration

t = 0

Repeat:

1. (E-step) Compute posterior given current

\boldsymbol{\alpha}

:

-

\boldsymbol{\Sigma} \leftarrow (\sigma^{-2}\mathbf{A}^H\mathbf{A} + \text{diag}(\boldsymbol{\alpha}))^{-1}

-

\boldsymbol{\mu} \leftarrow \sigma^{-2}\boldsymbol{\Sigma}\mathbf{A}^H\mathbf{y}

2. (M-step) Update hyperparameters by maximizing evidence:

\alpha_i^{\text{new}} \leftarrow \frac{1 - \alpha_i [\boldsymbol{\Sigma}]_{ii}}{\mu_i^2}

3. Prune: Set

\alpha_i \leftarrow \infty

(equivalently

\mu_i \leftarrow 0

)

for any

i

where

\alpha_i^{\text{new}} > \alpha_{\max}

(e.g.,

\alpha_{\max} = 10^6

)

4.

t \leftarrow t + 1

Until

\|\boldsymbol{\alpha}^{\text{new}} - \boldsymbol{\alpha}\|_\infty < \varepsilon

or

t \geq T

The M-step update has a fixed-point interpretation: $\alpha_i$ drives to infinity whenever $\mu_i^2 \leq \alpha_i[\boldsymbol{\Sigma}]_{ii}$ , i.e., when the mean squared value of component $i$ is dominated by its posterior variance — meaning the data provides no evidence for that component being nonzero.

SBL Convergence — ARD Weight Pruning

Watch how the SBL EM algorithm prunes irrelevant components over iterations.

Left panel: Log of precision parameters $\log\alpha_i$ vs EM iteration. Active components (true nonzeros) stabilize at moderate values; inactive components diverge to infinity (shown as pruned at iteration of divergence).

Right panel: Reconstructed signal at each iteration — watch sparsification happen progressively as irrelevant components are pruned.

Parameters

True number of targets

k

3

Noise

\sigma

0.1

EM iterations20

SBL vs LASSO — What Makes SBL Different?

Both SBL and LASSO promote sparsity, but they differ fundamentally:

Property	LASSO (Laplace MAP)	SBL (ARD + type-II ML)
Sparsity mechanism	Explicit $\ell_1$ penalty	Evidence-driven precision divergence
Uncertainty	None (point estimate)	Full posterior Gaussian on active components
Bias on active components	Yes (shrinkage bias)	Smaller (posterior mean debiased)
Hyperparameter	$\lambda$ (requires tuning)	$\boldsymbol{\alpha}, \sigma^2$ (learned from data)
Computational cost	$O(nm\log(1/\varepsilon))$ via ISTA	$O(n^2m + n^3)$ per EM iteration
Super-resolution	Empirically good	Empirically excellent (promotes ultra-sparse)

SBL tends to find sparser solutions than LASSO for the same data, making it particularly effective for super-resolution radar imaging (§The Point-Spread Function (PSF)).

Common Mistake: SBL EM Can Converge to Non-Sparse Local Optima

Mistake:

The SBL EM algorithm is guaranteed to converge, so practitioners assume the converged solution is the globally sparse optimum.

Correction:

The evidence $\mathcal{Z}(\mathbf{y} \mid \boldsymbol{\alpha})$ is generally non-convex in $\boldsymbol{\alpha}$ , and EM finds only a local maximum. Multiple initializations with different starting $\boldsymbol{\alpha}$ (e.g., all-ones, random, initialized from LASSO support) are recommended. The global optimum is known to be maximally sparse (Wipf and Rao, 2004), but EM may get stuck in non-sparse local maxima, especially at low SNR or when the sensing matrix columns are nearly collinear.

Historical Note: From Relevance Vector Machines to SBL

2001-2010

Sparse Bayesian Learning emerged from Michael Tipping's 2001 paper on the Relevance Vector Machine (RVM) — a Bayesian alternative to the support vector machine (SVM). Tipping observed that by placing ARD priors on the kernel expansion weights, the EM algorithm would prune nearly all basis vectors, yielding a much sparser model than the SVM. The SBL framework was soon recognized as a powerful tool for compressed sensing, appearing in the signal processing literature from around 2005.

The connection to compressed sensing and RF imaging was made explicit by Wipf and Rao (2004) who proved that SBL converges to the globally sparsest solution under certain conditions on $\mathbf{A}$ — stronger sparsity recovery guarantees than $\ell_1$ minimization under some conditions. This theoretical insight motivated the deployment of SBL for radar and SAR imaging, where its near-minimax sparsity recovery properties are particularly valuable.

,

Key Takeaway

Hierarchical Bayes places priors on hyperparameters, avoiding the need for manual tuning of $\lambda$ , $\sigma^2$ , or sparsity levels.
Empirical Bayes (evidence maximization) selects hyperparameters by maximizing the marginal likelihood — an automatic Occam's razor.
ARD assigns per-component precisions; the EM algorithm drives most $\alpha_i \to \infty$ , achieving automatic Bayesian sparsification.
SBL combines ARD with the Gaussian forward model, providing sparse estimates with posterior uncertainty bounds and no need to pre-specify sparsity level $k$ .
SBL tends to produce sparser solutions than LASSO but requires $O(n^3)$ linear algebra per iteration — tractable for $n \leq 10^4$ , expensive for larger scenes.

Quick Check

In the SBL EM algorithm, component $i$ is pruned (set to zero) when:

$\mu_i = 0$ and $[\boldsymbol{\Sigma}]_{ii} = 0$

$\alpha_i \to \infty$ , which happens when the posterior variance dominates the posterior mean squared value

$|\mu_i| < \lambda$ for some threshold $\lambda$

The LASSO soft-threshold condition $|\mathbf{a}_i^H(\mathbf{y} - \mathbf{A}\boldsymbol{\mu})| < \sigma^2\lambda$