Hierarchical and Empirical Bayes

The Hyperparameter Problem

Every Bayesian model contains hyperparameters — the noise variance σ2\sigma^2, prior variance γ2\gamma^2, Laplace rate λ\lambda, sparsity level ww. In variational regularization, these are tuned by cross-validation or the discrepancy principle (§Parameter Choice Rules). Hierarchical Bayes places priors on the hyperparameters themselves, allowing the data to inform their values automatically.

In RF imaging, the sparsity level (how many targets are present) and target reflectivity variance are typically unknown. Hierarchical models avoid the need to guess these quantities beforehand — the posterior over hyperparameters reflects our uncertainty about the scene's statistical structure.

Definition:

Hierarchical Bayesian Model

A hierarchical Bayesian model introduces latent hyperparameters α\boldsymbol{\alpha} with their own prior π(α)\pi(\boldsymbol{\alpha}):

yγp(yγ),γαπ(γα),απ(α).\mathbf{y} \mid \boldsymbol{\gamma} \sim p(\mathbf{y} \mid \boldsymbol{\gamma}), \qquad \boldsymbol{\gamma} \mid \boldsymbol{\alpha} \sim \pi(\boldsymbol{\gamma} \mid \boldsymbol{\alpha}), \qquad \boldsymbol{\alpha} \sim \pi(\boldsymbol{\alpha}).

The joint posterior over (γ,α)(\boldsymbol{\gamma}, \boldsymbol{\alpha}) is

p(γ,αy)p(yγ)π(γα)π(α).p(\boldsymbol{\gamma}, \boldsymbol{\alpha} \mid \mathbf{y}) \propto p(\mathbf{y} \mid \boldsymbol{\gamma})\, \pi(\boldsymbol{\gamma} \mid \boldsymbol{\alpha})\,\pi(\boldsymbol{\alpha}).

The marginal posterior for γ\boldsymbol{\gamma} alone integrates out α\boldsymbol{\alpha}:

p(γy)=p(γ,αy)dα.p(\boldsymbol{\gamma} \mid \mathbf{y}) = \int p(\boldsymbol{\gamma}, \boldsymbol{\alpha} \mid \mathbf{y})\, \mathrm{d}\boldsymbol{\alpha}.

This integration automatically accounts for uncertainty in the hyperparameters and avoids overfitting to a single α\boldsymbol{\alpha} value.

Theorem: Empirical Bayes and Evidence Maximization

In empirical Bayes (also called type-II maximum likelihood or evidence maximization), the hyperparameters are estimated by maximizing the marginal likelihood (evidence):

α^=argmaxα  Z(yα)=argmaxαp(yγ)π(γα)dγ.\hat{\boldsymbol{\alpha}} = \arg\max_{\boldsymbol{\alpha}} \; \mathcal{Z}(\mathbf{y} \mid \boldsymbol{\alpha}) = \arg\max_{\boldsymbol{\alpha}} \int p(\mathbf{y} \mid \boldsymbol{\gamma})\, \pi(\boldsymbol{\gamma} \mid \boldsymbol{\alpha})\,\mathrm{d}\boldsymbol{\gamma}.

For the Gaussian model y=Aγ+w\mathbf{y} = \mathbf{A}\boldsymbol{\gamma} + \mathbf{w} with wN(0,σ2I)\mathbf{w} \sim \mathcal{N}(0, \sigma^2 \mathbf{I}) and γN(0,α1I)\boldsymbol{\gamma} \sim \mathcal{N}(\mathbf{0}, \alpha^{-1} \mathbf{I}), the log-evidence is

logZ(yα)=12[yTCα1y+logdetCα+mlog(2π)],\log\mathcal{Z}(\mathbf{y} \mid \alpha) = -\frac{1}{2}\Bigl[ \mathbf{y}^T \mathbf{C}_\alpha^{-1} \mathbf{y} + \log\det \mathbf{C}_\alpha + m\log(2\pi)\Bigr],

where Cα=σ2I+α1AAH\mathbf{C}_\alpha = \sigma^2 \mathbf{I} + \alpha^{-1}\mathbf{A}\mathbf{A}^H.

Definition:

Automatic Relevance Determination (ARD)

Automatic relevance determination assigns a separate precision parameter αi\alpha_i to each component of γ\boldsymbol{\gamma}:

γiαiN(0,αi1),i=1,,n.\gamma_i \mid \alpha_i \sim \mathcal{N}(0, \alpha_i^{-1}), \qquad i = 1, \ldots, n.

The prior on γ\boldsymbol{\gamma} becomes π(γα)=i=1nN(γi;0,αi1)\pi(\boldsymbol{\gamma} \mid \boldsymbol{\alpha}) = \prod_{i=1}^n \mathcal{N}(\gamma_i; 0, \alpha_i^{-1}).

Maximizing the evidence over α=(α1,,αn)\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_n) drives most αi\alpha_i \to \infty, effectively pruning the corresponding components to zero. Only components that are supported by the data retain finite αi\alpha_i and nonzero γ^i\hat{\gamma}_i. This provides Bayesian sparsification without explicitly specifying 1\ell_1 or 0\ell_0 penalties.

Definition:

Sparse Bayesian Learning (SBL)

Sparse Bayesian Learning (Tipping, 2001) is the ARD framework applied to sparse signal recovery:

y=Aγ+w,γiαiN(0,αi1),wN(0,σ2I).\mathbf{y} = \mathbf{A}\boldsymbol{\gamma} + \mathbf{w}, \qquad \gamma_i \mid \alpha_i \sim \mathcal{N}(0, \alpha_i^{-1}), \qquad \mathbf{w} \sim \mathcal{N}(0, \sigma^2\mathbf{I}).

The posterior for γ\boldsymbol{\gamma} given α\boldsymbol{\alpha} is Gaussian (Theorem TGaussian Prior Yields Gaussian Posterior):

γy,αN(μ,Σ),\boldsymbol{\gamma} \mid \mathbf{y}, \boldsymbol{\alpha} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}),

where Σ=(σ2AHA+Dα)1\boldsymbol{\Sigma} = (\sigma^{-2}\mathbf{A}^H\mathbf{A} + \mathbf{D}_\alpha)^{-1}, Dα=diag(α1,,αn)\mathbf{D}_\alpha = \text{diag}(\alpha_1, \ldots, \alpha_n), and μ=σ2ΣAHy\boldsymbol{\mu} = \sigma^{-2}\boldsymbol{\Sigma}\mathbf{A}^H\mathbf{y}.

The hyperparameters α\boldsymbol{\alpha} and σ2\sigma^2 are estimated by maximizing the evidence Z(yα,σ2)\mathcal{Z}(\mathbf{y} \mid \boldsymbol{\alpha}, \sigma^2). SBL is a bridge between MAP (single point estimate) and full Bayesian inference (full posterior) — it provides posterior uncertainty for the estimated support.

EM Algorithm for Sparse Bayesian Learning

Complexity: O(n2m+n3)O(n^2 m + n^3) per iteration (dominant cost: posterior covariance update)
Input: Sensing matrix A\mathbf{A}, measurements y\mathbf{y}, noise variance σ2\sigma^2,
max iterations TT, convergence threshold ε\varepsilon
Output: Posterior mean μ\boldsymbol{\mu}, posterior covariance Σ\boldsymbol{\Sigma},
hyperparameters α\boldsymbol{\alpha}
Initialize: αi=1\alpha_i = 1 for all ii, iteration t=0t = 0
Repeat:
1. (E-step) Compute posterior given current α\boldsymbol{\alpha}:
- Σ(σ2AHA+diag(α))1\boldsymbol{\Sigma} \leftarrow (\sigma^{-2}\mathbf{A}^H\mathbf{A} + \text{diag}(\boldsymbol{\alpha}))^{-1}
- μσ2ΣAHy\boldsymbol{\mu} \leftarrow \sigma^{-2}\boldsymbol{\Sigma}\mathbf{A}^H\mathbf{y}
2. (M-step) Update hyperparameters by maximizing evidence:
αinew1αi[Σ]iiμi2\alpha_i^{\text{new}} \leftarrow \frac{1 - \alpha_i [\boldsymbol{\Sigma}]_{ii}}{\mu_i^2}
3. Prune: Set αi\alpha_i \leftarrow \infty (equivalently μi0\mu_i \leftarrow 0)
for any ii where αinew>αmax\alpha_i^{\text{new}} > \alpha_{\max} (e.g., αmax=106\alpha_{\max} = 10^6)
4. tt+1t \leftarrow t + 1
Until αnewα<ε\|\boldsymbol{\alpha}^{\text{new}} - \boldsymbol{\alpha}\|_\infty < \varepsilon or tTt \geq T

The M-step update has a fixed-point interpretation: αi\alpha_i drives to infinity whenever μi2αi[Σ]ii\mu_i^2 \leq \alpha_i[\boldsymbol{\Sigma}]_{ii}, i.e., when the mean squared value of component ii is dominated by its posterior variance — meaning the data provides no evidence for that component being nonzero.

SBL Convergence — ARD Weight Pruning

Watch how the SBL EM algorithm prunes irrelevant components over iterations.

Left panel: Log of precision parameters logαi\log\alpha_i vs EM iteration. Active components (true nonzeros) stabilize at moderate values; inactive components diverge to infinity (shown as pruned at iteration of divergence).

Right panel: Reconstructed signal at each iteration — watch sparsification happen progressively as irrelevant components are pruned.

Parameters
3
0.1
20

SBL vs LASSO — What Makes SBL Different?

Both SBL and LASSO promote sparsity, but they differ fundamentally:

Property LASSO (Laplace MAP) SBL (ARD + type-II ML)
Sparsity mechanism Explicit 1\ell_1 penalty Evidence-driven precision divergence
Uncertainty None (point estimate) Full posterior Gaussian on active components
Bias on active components Yes (shrinkage bias) Smaller (posterior mean debiased)
Hyperparameter λ\lambda (requires tuning) α,σ2\boldsymbol{\alpha}, \sigma^2 (learned from data)
Computational cost O(nmlog(1/ε))O(nm\log(1/\varepsilon)) via ISTA O(n2m+n3)O(n^2m + n^3) per EM iteration
Super-resolution Empirically good Empirically excellent (promotes ultra-sparse)

SBL tends to find sparser solutions than LASSO for the same data, making it particularly effective for super-resolution radar imaging (§The Point-Spread Function (PSF)).

Common Mistake: SBL EM Can Converge to Non-Sparse Local Optima

Mistake:

The SBL EM algorithm is guaranteed to converge, so practitioners assume the converged solution is the globally sparse optimum.

Correction:

The evidence Z(yα)\mathcal{Z}(\mathbf{y} \mid \boldsymbol{\alpha}) is generally non-convex in α\boldsymbol{\alpha}, and EM finds only a local maximum. Multiple initializations with different starting α\boldsymbol{\alpha} (e.g., all-ones, random, initialized from LASSO support) are recommended. The global optimum is known to be maximally sparse (Wipf and Rao, 2004), but EM may get stuck in non-sparse local maxima, especially at low SNR or when the sensing matrix columns are nearly collinear.

Historical Note: From Relevance Vector Machines to SBL

2001-2010

Sparse Bayesian Learning emerged from Michael Tipping's 2001 paper on the Relevance Vector Machine (RVM) — a Bayesian alternative to the support vector machine (SVM). Tipping observed that by placing ARD priors on the kernel expansion weights, the EM algorithm would prune nearly all basis vectors, yielding a much sparser model than the SVM. The SBL framework was soon recognized as a powerful tool for compressed sensing, appearing in the signal processing literature from around 2005.

The connection to compressed sensing and RF imaging was made explicit by Wipf and Rao (2004) who proved that SBL converges to the globally sparsest solution under certain conditions on A\mathbf{A} — stronger sparsity recovery guarantees than 1\ell_1 minimization under some conditions. This theoretical insight motivated the deployment of SBL for radar and SAR imaging, where its near-minimax sparsity recovery properties are particularly valuable.

,

Key Takeaway

  1. Hierarchical Bayes places priors on hyperparameters, avoiding the need for manual tuning of λ\lambda, σ2\sigma^2, or sparsity levels.

  2. Empirical Bayes (evidence maximization) selects hyperparameters by maximizing the marginal likelihood — an automatic Occam's razor.

  3. ARD assigns per-component precisions; the EM algorithm drives most αi\alpha_i \to \infty, achieving automatic Bayesian sparsification.

  4. SBL combines ARD with the Gaussian forward model, providing sparse estimates with posterior uncertainty bounds and no need to pre-specify sparsity level kk.

  5. SBL tends to produce sparser solutions than LASSO but requires O(n3)O(n^3) linear algebra per iteration — tractable for n104n \leq 10^4, expensive for larger scenes.

Quick Check

In the SBL EM algorithm, component ii is pruned (set to zero) when:

μi=0\mu_i = 0 and [Σ]ii=0[\boldsymbol{\Sigma}]_{ii} = 0

αi\alpha_i \to \infty, which happens when the posterior variance dominates the posterior mean squared value

μi<λ|\mu_i| < \lambda for some threshold λ\lambda

The LASSO soft-threshold condition aiH(yAμ)<σ2λ|\mathbf{a}_i^H(\mathbf{y} - \mathbf{A}\boldsymbol{\mu})| < \sigma^2\lambda