Convergence Diagnostics in Practice

Knowing When to Stop

An iterative reconstruction algorithm without a principled stopping criterion is incomplete. Running too few iterations yields a poor reconstruction; running too many wastes computation and may even degrade quality through noise amplification (semi-convergence). This section develops practical diagnostics that:

  1. Certify solution quality β€” providing an upper bound on the distance to optimality.
  2. Detect pathological behavior β€” stagnation, oscillation, or divergence.
  3. Enable fair comparison β€” across different algorithms and parameter settings.

These diagnostics apply to every reconstruction algorithm in Parts IV--VI of this book. The tools are algorithm-agnostic: we present each diagnostic once and explain which algorithms it applies to.

Definition:

Primal Residual

For ADMM applied to the problem min⁑cf(c)+g(z)\min_{\mathbf{c}} f(\mathbf{c}) + g(\mathbf{z}) subject to Bc=z\mathbf{B}\mathbf{c} = \mathbf{z}, the primal residual at iteration tt is

r(t)=Bc(t)βˆ’z(t).r^{(t)} = \mathbf{B}\mathbf{c}^{(t)} - \mathbf{z}^{(t)}.

It measures the violation of the equality constraint. At the solution, r(⋆)=0r^{(\star)} = \mathbf{0}.

For TV-regularized imaging with f(c)=12βˆ₯Acβˆ’yβˆ₯2f(\mathbf{c}) = \frac{1}{2}\|\mathbf{A}\mathbf{c} - \mathbf{y}\|^2 and g(z)=Ξ»βˆ₯zβˆ₯1g(\mathbf{z}) = \lambda\|\mathbf{z}\|_1, the constraint is βˆ‡c=z\nabla \mathbf{c} = \mathbf{z} (discrete gradient equals auxiliary variable). The primal residual measures how well z\mathbf{z} agrees with the actual gradient of the current image.

Definition:

Dual Residual

The dual residual at iteration tt is

s(t)=ρ BH(z(t)βˆ’z(tβˆ’1)),s^{(t)} = \rho\,\mathbf{B}^H(\mathbf{z}^{(t)} - \mathbf{z}^{(t-1)}),

where ρ>0\rho > 0 is the augmented Lagrangian penalty parameter. It measures the violation of the dual feasibility condition (stationarity of the Lagrangian with respect to c\mathbf{c}).

The standard ADMM stopping criterion terminates when both residuals are small:

βˆ₯r(t)βˆ₯≀Ρabs+Ξ΅relmax⁑(βˆ₯Bc(t)βˆ₯,βˆ₯z(t)βˆ₯),\|r^{(t)}\| \leq \varepsilon_{\text{abs}} + \varepsilon_{\text{rel}} \max(\|\mathbf{B}\mathbf{c}^{(t)}\|, \|\mathbf{z}^{(t)}\|),

βˆ₯s(t)βˆ₯≀Ρabs+Ξ΅relβˆ₯ρ BHu(t)βˆ₯,\|s^{(t)}\| \leq \varepsilon_{\text{abs}} + \varepsilon_{\text{rel}} \|\rho\,\mathbf{B}^H \mathbf{u}^{(t)}\|,

with typical values Ξ΅abs=10βˆ’4\varepsilon_{\text{abs}} = 10^{-4} and Ξ΅rel=10βˆ’3\varepsilon_{\text{rel}} = 10^{-3}.

Definition:

Fixed-Point Residual

For a proximal algorithm with update operator T\mathcal{T} (i.e., c(t+1)=T(c(t))\mathbf{c}^{(t+1)} = \mathcal{T}(\mathbf{c}^{(t)})), the fixed-point residual at iteration tt is

FPR(t)=βˆ₯c(t+1)βˆ’c(t)βˆ₯.\text{FPR}^{(t)} = \|\mathbf{c}^{(t+1)} - \mathbf{c}^{(t)}\|.

This is a universal diagnostic: it applies to ISTA, FISTA, ADMM, Chambolle--Pock, Douglas--Rachford, and any other fixed-point iteration. If T\mathcal{T} is nonexpansive, FPR(t)β†’0\text{FPR}^{(t)} \to 0 is necessary (though not sufficient) for convergence to a fixed point.

The fixed-point residual is cheap to compute (one vector difference and norm) and requires no knowledge of the algorithm's internal structure. It is the first quantity to monitor for any new algorithm.

Theorem: Morozov's Discrepancy Principle

For the noisy inverse problem y=Actrue+w\mathbf{y} = \mathbf{A}\mathbf{c}_{\text{true}} + \mathbf{w} with βˆ₯wβˆ₯≀δ\|\mathbf{w}\| \leq \delta, Morozov's discrepancy principle selects the regularization parameter Ξ»\lambda (or the iteration count tt) such that the data-fidelity residual matches the noise level:

βˆ₯Ac(Ξ»)βˆ’yβˆ₯β‰ˆΞ΄.\|\mathbf{A}\mathbf{c}^{(\lambda)} - \mathbf{y}\| \approx \delta.

Specifically, one chooses the largest Ξ»\lambda (or the earliest tt) satisfying

βˆ₯Ac(Ξ»)βˆ’yβˆ₯≀τ δ\|\mathbf{A}\mathbf{c}^{(\lambda)} - \mathbf{y}\| \leq \tau\,\delta

for a safety factor Ο„>1\tau > 1 (typically Ο„=1.1\tau = 1.1--1.51.5).

We should fit the data only as well as the noise allows. Fitting below the noise level means fitting noise β€” overfitting in the statistical sense. The discrepancy principle provides a principled way to stop without requiring ground truth.

Definition:

Warm-Starting from the Matched-Filter Image

The matched-filter image (backpropagation image) is

c^BP=AHDβˆ’1y,\hat{\mathbf{c}}^{\text{BP}} = \mathbf{A}^{H} \mathbf{D}^{-1} \mathbf{y},

where D=diag(AHA)\mathbf{D} = \text{diag}(\mathbf{A}^{H} \mathbf{A}) is the diagonal of the normal matrix. This is a single adjoint application followed by a normalization and costs O(MQ)O(MQ).

Warm-starting uses c^BP\hat{\mathbf{c}}^{\text{BP}} as the initial iterate c(0)\mathbf{c}^{(0)} for any iterative algorithm, rather than the default c(0)=0\mathbf{c}^{(0)} = \mathbf{0}.

The matched-filter image is far from optimal β€” it suffers from sidelobes and lacks sparsity β€” but it is already correlated with the true scene. Starting from it rather than zero typically saves 20--50% of the iterations needed to reach a given reconstruction quality.

In our RF imaging system, c^BP\hat{\mathbf{c}}^{\text{BP}} is the standard baseline (Matched-Filter Imaging).

Example: Stopping Criterion for ADMM in TV-Regularized Imaging

Consider ADMM applied to the TV-regularized imaging problem

min⁑c12βˆ₯Acβˆ’yβˆ₯2+λ TV(c)\min_{\mathbf{c}} \frac{1}{2}\|\mathbf{A}\mathbf{c} - \mathbf{y}\|^2 + \lambda\,\text{TV}(\mathbf{c})

with Q=1282Q = 128^2 voxels, noise standard deviation Οƒ=0.05\sigma = 0.05, and ρ=1.0\rho = 1.0. Design a stopping criterion using primal and dual residuals.

Convergence Diagnostics Dashboard

Visualize the convergence behavior of iterative algorithms on a sparse recovery problem. Panel 1: objective value vs iteration. Panel 2: fixed-point residual vs iteration. Panel 3: data-fidelity residual with the discrepancy level τδ\tau\delta marked.

Compare warm-started (from matched filter) vs cold-started (from zero) initialization.

Parameters
0.05

Noise standard deviation

⚠️Engineering Note

Adaptive Penalty Parameter Selection in ADMM

The convergence speed of ADMM depends strongly on the penalty parameter ρ\rho. A practical adaptive scheme (Boyd et al., 2011) adjusts ρ\rho at each iteration based on the ratio of primal and dual residuals:

ρ(t+1)={Ο„inc ρ(t)ifΒ βˆ₯r(t)βˆ₯>μ βˆ₯s(t)βˆ₯,ρ(t)/Ο„decifΒ βˆ₯s(t)βˆ₯>μ βˆ₯r(t)βˆ₯,ρ(t)otherwise,\rho^{(t+1)} = \begin{cases} \tau_{\text{inc}}\,\rho^{(t)} & \text{if } \|r^{(t)}\| > \mu\,\|s^{(t)}\|, \\ \rho^{(t)} / \tau_{\text{dec}} & \text{if } \|s^{(t)}\| > \mu\,\|r^{(t)}\|, \\ \rho^{(t)} & \text{otherwise}, \end{cases}

with ΞΌ=10\mu = 10, Ο„inc=Ο„dec=2\tau_{\text{inc}} = \tau_{\text{dec}} = 2.

The intuition: if the primal residual is much larger than the dual residual, increase ρ\rho to penalize constraint violation more heavily; conversely, if the dual residual dominates, decrease ρ\rho to allow more flexibility in the primal update.

Stopping Criteria for Imaging Algorithms

CriterionApplicable ToRequires Ground Truth?Provides Certificate?
Fixed-point residualAll iterative algorithmsNoNecessary but not sufficient
Primal + dual residualADMM and variantsNoYes (primal-dual optimality)
Primal-dual gapChambolle--PockNoYes (absolute Ξ΅\varepsilon-optimality)
Discrepancy principleAll (with known Ξ΄\delta)No (needs noise level)Yes (regularization theory)
PSNR vs iterationSimulation onlyYesDetects semi-convergence
Objective decreaseAll with known objectiveNoSanity check only
,

Common Mistake: Semi-Convergence in Unregularized Iterations

Mistake:

Running an iterative algorithm (such as Landweber iteration or conjugate gradient on the normal equations) until the fixed-point residual is very small, without explicit regularization.

Correction:

For unregularized iterative methods, the number of iterations itself acts as a regularization parameter. PSNR initially improves as the algorithm fits the signal component, then decreases as it starts fitting the noise. This is semi-convergence.

Detection: plot PSNR vs iteration (in simulation) or monitor the data residual. If βˆ₯Ac(t)βˆ’yβˆ₯\|\mathbf{A}\mathbf{c}^{(t)} - \mathbf{y}\| drops significantly below the noise level Ξ΄\delta, the algorithm is overfitting.

Prevention: use the discrepancy principle to stop, or add explicit regularization (TV, β„“1\ell_1, etc.) so that the regularized objective has a unique minimum and the algorithm converges to it monotonically.

Quick Check

In Morozov's discrepancy principle with noise level Ξ΄\delta and safety factor Ο„=1.2\tau = 1.2, the algorithm should stop when the data residual satisfies:

βˆ₯Ac^βˆ’yβˆ₯≀1.2 δ\|\mathbf{A}\hat{\mathbf{c}} - \mathbf{y}\| \leq 1.2\,\delta

βˆ₯Ac^βˆ’yβˆ₯≀δ/1.2\|\mathbf{A}\hat{\mathbf{c}} - \mathbf{y}\| \leq \delta / 1.2

βˆ₯Ac^βˆ’yβˆ₯=0\|\mathbf{A}\hat{\mathbf{c}} - \mathbf{y}\| = 0

βˆ₯c^βˆ’ctrueβˆ₯≀δ\|\hat{\mathbf{c}} - \mathbf{c}_{\text{true}}\| \leq \delta

Historical Note: Vladimir Morozov and the Discrepancy Principle

1966

Vladimir Morozov proposed the discrepancy principle in 1966 while working on ill-posed problems at Moscow State University. The idea β€” that one should trust the data only to the level of the noise β€” was revolutionary in an era when many practitioners believed that more iterations (or smaller regularization) always meant better results.

Morozov's principle formalized the intuition that every inverse problem has an information limit set by the noise. Together with Tikhonov's regularization theory (developed at the same institution), it established the mathematical foundation for what we now call regularization theory β€” a foundation that every modern imaging algorithm builds upon.

Example: Warm-Starting FISTA for RF Image Reconstruction

Compare the convergence of FISTA for β„“1\ell_1-regularized RF image reconstruction when initialized with: (a) c(0)=0\mathbf{c}^{(0)} = \mathbf{0} (cold start), and (b) c(0)=c^BP\mathbf{c}^{(0)} = \hat{\mathbf{c}}^{\text{BP}} (matched-filter warm start). The problem has Q=1282Q = 128^2 voxels, M=2048M = 2048 measurements, and SNR=20\text{SNR} = 20 dB.

Common Mistake: Relative Change Can Be Misleading

Mistake:

Using only the relative change βˆ₯c(t+1)βˆ’c(t)βˆ₯/βˆ₯c(t)βˆ₯\|\mathbf{c}^{(t+1)} - \mathbf{c}^{(t)}\| / \|\mathbf{c}^{(t)}\| as a stopping criterion, without checking the data residual.

Correction:

Relative change can be small for two very different reasons:

  1. The algorithm has converged to a good solution.
  2. The algorithm is stuck at a bad solution (e.g., a local minimum in nonconvex problems, or a very flat objective landscape).

Always combine relative change with at least one of:

  • Data residual βˆ₯Ac(t)βˆ’yβˆ₯\|\mathbf{A}\mathbf{c}^{(t)} - \mathbf{y}\| (compared to the noise level via the discrepancy principle).
  • Objective value (should be decreasing).
  • Algorithm-specific residuals (primal/dual for ADMM).
πŸ”§Engineering Note

Practical Monitoring Checklist

Always monitor (regardless of algorithm):

  1. Objective value F(c(t))F(\mathbf{c}^{(t)}) β€” should decrease monotonically for descent methods (ISTA, ADMM). FISTA may oscillate slightly due to momentum.

  2. Fixed-point residual βˆ₯c(t+1)βˆ’c(t)βˆ₯\|\mathbf{c}^{(t+1)} - \mathbf{c}^{(t)}\| β€” stagnation below 10βˆ’510^{-5} usually indicates convergence.

  3. Data residual βˆ₯Ac(t)βˆ’yβˆ₯\|\mathbf{A}\mathbf{c}^{(t)} - \mathbf{y}\| β€” compare against the noise level Ξ΄\delta.

Warning signs:

  • Objective increasing: step size too large or bug.
  • Data residual dropping far below Ξ΄\delta: overfitting (stop earlier or increase regularization).
  • Primal residual stagnating (ADMM): ρ\rho too small.
  • Dual residual stagnating (ADMM): ρ\rho too large.

Primal residual

In ADMM, the norm of the constraint violation r(t)=Bc(t)βˆ’z(t)r^{(t)} = \mathbf{B}\mathbf{c}^{(t)} - \mathbf{z}^{(t)}. Measures how well the auxiliary variable agrees with the constraint.

Warm starting

Initializing an iterative algorithm from an approximate solution (e.g., the matched-filter image) rather than from zero. Reduces the number of iterations to convergence.

Key Takeaway

Convergence diagnostics are not optional β€” they are integral to every reconstruction pipeline. The fixed-point residual is universal and cheap. Primal and dual residuals provide ADMM-specific optimality certificates. The discrepancy principle sets the noise-appropriate stopping level without requiring ground truth. Warm-starting from the matched-filter image saves 40--60% of iterations. Always monitor the objective and data residual; any increase signals a step-size error or implementation bug.