Parameter Choice Rules

The Central Practical Challenge — Choosing α\alpha

The regularization theory of Sections 2.3–2.4 shows that good reconstructions are possible if α\alpha is chosen correctly. But how does one choose α\alpha in practice?

Two regimes exist. When the noise level δ\delta is known (or reliably estimated), the Morozov discrepancy principle provides an order-optimal, principled choice. When δ\delta is unknown — which is the case in many experimental setups — the L-curve and generalized cross-validation (GCV) provide data-driven alternatives.

Stein's Unbiased Risk Estimator (SURE) provides yet another approach: it directly estimates the mean squared error without knowing the truth xx^\dagger, using only the measured data.

Definition:

Morozov's Discrepancy Principle

Morozov's discrepancy principle selects α>0\alpha > 0 such that the residual matches the noise level:

Axαδyδ=τδ,\|\mathcal{A}x_\alpha^\delta - y^\delta\| = \tau\,\delta,

where τ>1\tau > 1 is a safety factor (typically τ[1,2]\tau \in [1, 2]).

The rationale: we should fit the data only to the accuracy warranted by the noise — fitting more closely than δ\delta means fitting noise.

Under source conditions, the discrepancy principle is provably order-optimal:

xα(δ)δx=O(δ2μ/(2μ+1))\|x_{\alpha(\delta)}^\delta - x^\dagger\| = O\bigl(\delta^{2\mu/(2\mu+1)}\bigr)

for μμ0\mu \leq \mu_0 (the method's qualification). In practice, δ\delta may be estimated from the data (e.g., from a noise-only region of the measurement, or from repeated measurements).

,

Theorem: Existence and Monotonicity for the Discrepancy Principle

For Tikhonov regularization, the discrepancy function

φ(α)=Axαδyδ2=k=1n(ασk2+α)2yδ,uk2\varphi(\alpha) = \|\mathcal{A}x_\alpha^\delta - y^\delta\|^2 = \sum_{k=1}^{n} \left(\frac{\alpha}{\sigma_k^2 + \alpha}\right)^2 |\langle \mathbf{y}^\delta, \mathbf{u}_k \rangle|^2

is monotonically increasing in α\alpha, with φ(0)=0\varphi(0) = 0 (if yδR(A)y^\delta \in \mathcal{R}(\mathcal{A})) and φ()=yδ2\varphi(\infty) = \|y^\delta\|^2.

Therefore the equation φ(α)=τ2δ2\varphi(\alpha) = \tau^2\delta^2 has a unique solution α>0\alpha^* > 0 whenever τδ<yδ\tau\delta < \|y^\delta\|, which holds if τ2δ2<yδ2\tau^2\delta^2 < \|y^\delta\|^2 (the data contains signal above the noise).

,

Definition:

The L-Curve Method

The L-curve is the parametric plot of

(logAxαδyδ,  logxαδ)\bigl(\log\|\mathcal{A}x_\alpha^\delta - y^\delta\|,\; \log\|x_\alpha^\delta\|\bigr)

as α\alpha varies over (0,)(0, \infty). This curve typically has an "L" shape:

  • Horizontal branch (small α\alpha): The residual is small but the solution norm is large (overfitting, noise amplification).
  • Vertical branch (large α\alpha): The solution norm is small but the residual is large (oversmoothing).
  • Corner (optimal α\alpha): The best trade-off between fidelity and regularity.

The L-curve criterion selects α\alpha at the point of maximum curvature of the L-curve.

The L-curve is a heuristic without rigorous convergence guarantees in full generality, but it is extremely popular in practice because it requires no knowledge of the noise level δ\delta. It provides an intuitive visual diagnostic and often gives good results for moderately ill-posed problems. For severely ill-posed problems or very small noise, the corner can be ill-defined.

,

Definition:

Generalized Cross-Validation (GCV)

Generalized cross-validation selects α\alpha by minimising the GCV functional

V(α)=Axαδyδ2(tr(IA(AA+αI)1A))2.V(\alpha) = \frac{\|\mathcal{A}x_\alpha^\delta - y^\delta\|^2} {\bigl(\operatorname{tr}(I - \mathcal{A}(\mathcal{A}^*\mathcal{A} + \alpha I)^{-1}\mathcal{A}^*)\bigr)^2}.

For the Tikhonov solution in the finite-dimensional case with SVD A=UΣVT\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T:

V(α)=k(α/(σk2+α))2yδ,uk2(kα/(σk2+α))2.V(\alpha) = \frac{\sum_k (\alpha/(\sigma_k^2+\alpha))^2 |\langle y^\delta, u_k\rangle|^2} {\left(\sum_k \alpha/(\sigma_k^2+\alpha)\right)^2}.

GCV has a statistical interpretation: it estimates the expected prediction error when one data point is left out. Under standard conditions, minimising V(α)V(\alpha) is asymptotically optimal in the same sense as the discrepancy principle, but without requiring knowledge of δ\delta. In practice, GCV can be sensitive to noise correlations and tends to underestimate α\alpha for severely ill-posed problems.

,

Definition:

Stein's Unbiased Risk Estimator (SURE)

For a linear estimator x^=Hαyδ\hat{x} = H_\alpha y^\delta with noise ηN(0,σn2I)\eta \sim \mathcal{N}(0, \sigma_n^2 I), Stein's Unbiased Risk Estimator (SURE) provides an unbiased estimate of the mean squared error EHαyδx2\mathbb{E}\|H_\alpha y^\delta - x^\dagger\|^2:

MSE^(α)=mσn2+Hαyδyδ2+2σn2tr(Hα),\widehat{\mathrm{MSE}}(\alpha) = -m\sigma_n^2 + \|H_\alpha y^\delta - y^\delta\|^2 + 2\sigma_n^2\,\mathrm{tr}(H_\alpha),

where m=dim(Y)m = \dim(\mathcal{Y}). For Tikhonov regularization, Hα=A(AA+αI)1AH_\alpha = \mathcal{A}(\mathcal{A}^*\mathcal{A}+\alpha I)^{-1}\mathcal{A}^* and tr(Hα)=kσk2/(σk2+α)\mathrm{tr}(H_\alpha) = \sum_k \sigma_k^2/(\sigma_k^2+\alpha).

The SURE-optimal α\alpha minimises MSE^(α)\widehat{\mathrm{MSE}}(\alpha).

SURE is unbiased: E[MSE^(α)]=EHαyδx2\mathbb{E}[\widehat{\mathrm{MSE}}(\alpha)] = \mathbb{E}\|H_\alpha y^\delta - x^\dagger\|^2. This makes it particularly attractive for problems with known noise variance. It is the theoretically rigorous analogue of cross-validation for Gaussian noise, and it is closely related to GCV.

,

Example: Computing the L-Curve for a Discrete Problem

For the discrete system Ax=yδ\mathbf{A}\mathbf{x} = \mathbf{y}^\delta with SVD A=k=1nσkukvkT\mathbf{A} = \sum_{k=1}^n \sigma_k \mathbf{u}_k \mathbf{v}_k^T, express the residual norm and solution norm as functions of α\alpha in closed form, and derive the curvature formula for the L-curve corner.

,

Parameter Choice Methods for Tikhonov Regularization

Compares Morozov's discrepancy principle, the L-curve, and GCV for selecting the Tikhonov parameter α\alpha.

Left panel: The L-curve in log-log coordinates. The red dot marks the corner (maximum curvature point). The blue dot marks the discrepancy principle selection. The green dot marks the GCV minimum.

Right panel: The corresponding reconstruction at the selected α\alpha compared to the true solution.

Bottom panel: The GCV functional V(α)V(\alpha) vs. α\alpha (log scale).

Try different noise levels. At low noise, all three methods agree. At high noise, the methods diverge — the discrepancy principle is most reliable when δ\delta is known accurately; the L-curve is most interpretable visually; GCV has the best theoretical properties when the noise model is Gaussian.

Parameters
0.05

Parameter Choice Methods Compared

MethodRequires δ\delta?Order-optimal?Computationally cheap?Best for
Discrepancy principleYesYes (provably)Yes (one 1D root)When δ\delta is reliably known
L-curveNoNo (heuristic)Moderate (need curvature)Visual diagnostics; moderate ill-posedness
GCVNoAsymptoticallyYes (one 1D min)Gaussian noise; independent observations
SUREσn2\sigma_n^2 requiredYes (under Gaussian model)YesGaussian noise with known variance

Common Mistake: The L-Curve Can Fail for Severely Ill-Posed Problems

Mistake:

Relying solely on the L-curve to select α\alpha for severely ill-posed problems (e.g., Gaussian deblurring with σkeck2\sigma_k \sim e^{-ck^2}) or at very small noise levels.

Correction:

For severely ill-posed problems, the L-curve often does not have a well-defined corner — both the horizontal and vertical branches may meet at a smooth curve with no pronounced kink. In this regime:

  1. If δ\delta is known, use the discrepancy principle.
  2. If δ\delta is unknown, use GCV or SURE.
  3. Examine the L-curve visually across a broad range of α\alpha to check whether a corner exists before trusting the maximum-curvature selection.

References in the literature report L-curve failures for severely ill-posed problems at noise levels below 10510^{-5}.

⚠️Engineering Note

Estimating the Noise Level δ\delta in Practice

The discrepancy principle requires knowledge of δ=η\delta = \|\eta\|. In RF imaging systems, noise level estimation can be performed by:

  1. Noise-only measurements: Record NavgN_{\text{avg}} samples with the transmitter off (system noise) or pointing away from the scene. Estimate σn2=1mynoise2\sigma_n^2 = \frac{1}{m}\|y_{\text{noise}}\|^2.

  2. Residual-based estimation: If an initial rough reconstruction x^\hat{x} is available, estimate δ\delta from yAx^\|y - \mathcal{A}\hat{x}\|.

  3. Cramer–Rao bound: For a calibrated radar with known transmit power and receiver noise figure, the noise variance is determined by the link budget equation.

A mismatch between the true δ\delta and the estimated δ^\hat{\delta} shifts the discrepancy selection: overestimating δ\delta leads to oversmoothing; underestimating leads to noise-contaminated reconstructions. The safety factor τ>1\tau > 1 provides robustness against underestimation.

Key Takeaway

Four practical methods exist for choosing the regularization parameter: Morozov's discrepancy principle (order-optimal when δ\delta is known), the L-curve (visual, no δ\delta needed, heuristic), GCV (asymptotically optimal, no δ\delta needed), and SURE (unbiased risk estimate under Gaussian noise). For the discrepancy principle, the residual equation φ(α)=τ2δ2\varphi(\alpha) = \tau^2\delta^2 has a unique solution because the Tikhonov residual is monotonically increasing in α\alpha. In practical RF imaging, the discrepancy principle with τ[1.1,1.5]\tau \in [1.1, 1.5] is the standard choice when the noise level can be measured from calibration data.