Hierarchical and Empirical Bayes
The Hyperparameter Problem
Every Bayesian model contains hyperparameters — the noise variance , prior variance , Laplace rate , sparsity level . In variational regularization, these are tuned by cross-validation or the discrepancy principle (§Parameter Choice Rules). Hierarchical Bayes places priors on the hyperparameters themselves, allowing the data to inform their values automatically.
In RF imaging, the sparsity level (how many targets are present) and target reflectivity variance are typically unknown. Hierarchical models avoid the need to guess these quantities beforehand — the posterior over hyperparameters reflects our uncertainty about the scene's statistical structure.
Definition: Hierarchical Bayesian Model
Hierarchical Bayesian Model
A hierarchical Bayesian model introduces latent hyperparameters with their own prior :
The joint posterior over is
The marginal posterior for alone integrates out :
This integration automatically accounts for uncertainty in the hyperparameters and avoids overfitting to a single value.
Theorem: Empirical Bayes and Evidence Maximization
In empirical Bayes (also called type-II maximum likelihood or evidence maximization), the hyperparameters are estimated by maximizing the marginal likelihood (evidence):
For the Gaussian model with and , the log-evidence is
where .
Marginalizing the Gaussian model
Since and independently, the marginal is with . The log-density gives the stated expression.
Definition: Automatic Relevance Determination (ARD)
Automatic Relevance Determination (ARD)
Automatic relevance determination assigns a separate precision parameter to each component of :
The prior on becomes .
Maximizing the evidence over drives most , effectively pruning the corresponding components to zero. Only components that are supported by the data retain finite and nonzero . This provides Bayesian sparsification without explicitly specifying or penalties.
Definition: Sparse Bayesian Learning (SBL)
Sparse Bayesian Learning (SBL)
Sparse Bayesian Learning (Tipping, 2001) is the ARD framework applied to sparse signal recovery:
The posterior for given is Gaussian (Theorem TGaussian Prior Yields Gaussian Posterior):
where , , and .
The hyperparameters and are estimated by maximizing the evidence . SBL is a bridge between MAP (single point estimate) and full Bayesian inference (full posterior) — it provides posterior uncertainty for the estimated support.
EM Algorithm for Sparse Bayesian Learning
Complexity: per iteration (dominant cost: posterior covariance update)The M-step update has a fixed-point interpretation: drives to infinity whenever , i.e., when the mean squared value of component is dominated by its posterior variance — meaning the data provides no evidence for that component being nonzero.
SBL Convergence — ARD Weight Pruning
Watch how the SBL EM algorithm prunes irrelevant components over iterations.
Left panel: Log of precision parameters vs EM iteration. Active components (true nonzeros) stabilize at moderate values; inactive components diverge to infinity (shown as pruned at iteration of divergence).
Right panel: Reconstructed signal at each iteration — watch sparsification happen progressively as irrelevant components are pruned.
Parameters
SBL vs LASSO — What Makes SBL Different?
Both SBL and LASSO promote sparsity, but they differ fundamentally:
| Property | LASSO (Laplace MAP) | SBL (ARD + type-II ML) |
|---|---|---|
| Sparsity mechanism | Explicit penalty | Evidence-driven precision divergence |
| Uncertainty | None (point estimate) | Full posterior Gaussian on active components |
| Bias on active components | Yes (shrinkage bias) | Smaller (posterior mean debiased) |
| Hyperparameter | (requires tuning) | (learned from data) |
| Computational cost | via ISTA | per EM iteration |
| Super-resolution | Empirically good | Empirically excellent (promotes ultra-sparse) |
SBL tends to find sparser solutions than LASSO for the same data, making it particularly effective for super-resolution radar imaging (§The Point-Spread Function (PSF)).
Common Mistake: SBL EM Can Converge to Non-Sparse Local Optima
Mistake:
The SBL EM algorithm is guaranteed to converge, so practitioners assume the converged solution is the globally sparse optimum.
Correction:
The evidence is generally non-convex in , and EM finds only a local maximum. Multiple initializations with different starting (e.g., all-ones, random, initialized from LASSO support) are recommended. The global optimum is known to be maximally sparse (Wipf and Rao, 2004), but EM may get stuck in non-sparse local maxima, especially at low SNR or when the sensing matrix columns are nearly collinear.
Historical Note: From Relevance Vector Machines to SBL
2001-2010Sparse Bayesian Learning emerged from Michael Tipping's 2001 paper on the Relevance Vector Machine (RVM) — a Bayesian alternative to the support vector machine (SVM). Tipping observed that by placing ARD priors on the kernel expansion weights, the EM algorithm would prune nearly all basis vectors, yielding a much sparser model than the SVM. The SBL framework was soon recognized as a powerful tool for compressed sensing, appearing in the signal processing literature from around 2005.
The connection to compressed sensing and RF imaging was made explicit by Wipf and Rao (2004) who proved that SBL converges to the globally sparsest solution under certain conditions on — stronger sparsity recovery guarantees than minimization under some conditions. This theoretical insight motivated the deployment of SBL for radar and SAR imaging, where its near-minimax sparsity recovery properties are particularly valuable.
Key Takeaway
-
Hierarchical Bayes places priors on hyperparameters, avoiding the need for manual tuning of , , or sparsity levels.
-
Empirical Bayes (evidence maximization) selects hyperparameters by maximizing the marginal likelihood — an automatic Occam's razor.
-
ARD assigns per-component precisions; the EM algorithm drives most , achieving automatic Bayesian sparsification.
-
SBL combines ARD with the Gaussian forward model, providing sparse estimates with posterior uncertainty bounds and no need to pre-specify sparsity level .
-
SBL tends to produce sparser solutions than LASSO but requires linear algebra per iteration — tractable for , expensive for larger scenes.
Quick Check
In the SBL EM algorithm, component is pruned (set to zero) when:
and
, which happens when the posterior variance dominates the posterior mean squared value
for some threshold
The LASSO soft-threshold condition
Correct. The M-step update diverges when , meaning the data provides no evidence for a nonzero component — the uncertainty exceeds the estimated signal.