Ferkans — Interactive Telecom Tutor

Knowing When to Stop

An iterative reconstruction algorithm without a principled stopping criterion is incomplete. Running too few iterations yields a poor reconstruction; running too many wastes computation and may even degrade quality through noise amplification (semi-convergence). This section develops practical diagnostics that:

Certify solution quality — providing an upper bound on the distance to optimality.
Detect pathological behavior — stagnation, oscillation, or divergence.
Enable fair comparison — across different algorithms and parameter settings.

These diagnostics apply to every reconstruction algorithm in Parts IV--VI of this book. The tools are algorithm-agnostic: we present each diagnostic once and explain which algorithms it applies to.

Definition:
Primal Residual

For ADMM applied to the problem $\min_{\mathbf{c}} f(\mathbf{c}) + g(\mathbf{z})$ subject to $\mathbf{B}\mathbf{c} = \mathbf{z}$ , the primal residual at iteration $t$ is

$r^{(t)} = \mathbf{B}\mathbf{c}^{(t)} - \mathbf{z}^{(t)}.$

It measures the violation of the equality constraint. At the solution, $r^{(\star)} = \mathbf{0}$ .

For TV-regularized imaging with $f(\mathbf{c}) = \frac{1}{2}\|\mathbf{A}\mathbf{c} - \mathbf{y}\|^2$ and $g(\mathbf{z}) = \lambda\|\mathbf{z}\|_1$ , the constraint is $\nabla \mathbf{c} = \mathbf{z}$ (discrete gradient equals auxiliary variable). The primal residual measures how well $\mathbf{z}$ agrees with the actual gradient of the current image.

Definition:
Dual Residual

The dual residual at iteration $t$ is

$s^{(t)} = \rho\,\mathbf{B}^H(\mathbf{z}^{(t)} - \mathbf{z}^{(t-1)}),$

where $\rho > 0$ is the augmented Lagrangian penalty parameter. It measures the violation of the dual feasibility condition (stationarity of the Lagrangian with respect to $\mathbf{c}$ ).

The standard ADMM stopping criterion terminates when both residuals are small:

$\|r^{(t)}\| \leq \varepsilon_{\text{abs}} + \varepsilon_{\text{rel}} \max(\|\mathbf{B}\mathbf{c}^{(t)}\|, \|\mathbf{z}^{(t)}\|),$

$\|s^{(t)}\| \leq \varepsilon_{\text{abs}} + \varepsilon_{\text{rel}} \|\rho\,\mathbf{B}^H \mathbf{u}^{(t)}\|,$

with typical values $\varepsilon_{\text{abs}} = 10^{-4}$ and $\varepsilon_{\text{rel}} = 10^{-3}$ .

Definition:
Fixed-Point Residual

For a proximal algorithm with update operator $\mathcal{T}$ (i.e., $\mathbf{c}^{(t+1)} = \mathcal{T}(\mathbf{c}^{(t)})$ ), the fixed-point residual at iteration $t$ is

$\text{FPR}^{(t)} = \|\mathbf{c}^{(t+1)} - \mathbf{c}^{(t)}\|.$

This is a universal diagnostic: it applies to ISTA, FISTA, ADMM, Chambolle--Pock, Douglas--Rachford, and any other fixed-point iteration. If $\mathcal{T}$ is nonexpansive, $\text{FPR}^{(t)} \to 0$ is necessary (though not sufficient) for convergence to a fixed point.

The fixed-point residual is cheap to compute (one vector difference and norm) and requires no knowledge of the algorithm's internal structure. It is the first quantity to monitor for any new algorithm.

Theorem: Morozov's Discrepancy Principle

For the noisy inverse problem $\mathbf{y} = \mathbf{A}\mathbf{c}_{\text{true}} + \mathbf{w}$ with $\|\mathbf{w}\| \leq \delta$ , Morozov's discrepancy principle selects the regularization parameter $\lambda$ (or the iteration count $t$ ) such that the data-fidelity residual matches the noise level:

$\|\mathbf{A}\mathbf{c}^{(\lambda)} - \mathbf{y}\| \approx \delta.$

Specifically, one chooses the largest $\lambda$ (or the earliest $t$ ) satisfying

$\|\mathbf{A}\mathbf{c}^{(\lambda)} - \mathbf{y}\| \leq \tau\,\delta$

for a safety factor $\tau > 1$ (typically $\tau = 1.1$ -- $1.5$ ).

We should fit the data only as well as the noise allows. Fitting below the noise level means fitting noise — overfitting in the statistical sense. The discrepancy principle provides a principled way to stop without requiring ground truth.

Proof

Lower bound on the residual

Since $\mathbf{y} = \mathbf{A}\mathbf{c}_{\text{true}} + \mathbf{w}$ , any estimator $\hat{\mathbf{c}}$ satisfies

$\|\mathbf{A}\hat{\mathbf{c}} - \mathbf{y}\| = \|\mathbf{A}(\hat{\mathbf{c}} - \mathbf{c}_{\text{true}}) - \mathbf{w}\| \geq |\|\mathbf{w}\| - \|\mathbf{A}(\hat{\mathbf{c}} - \mathbf{c}_{\text{true}})\||.$

For an exact recovery $\hat{\mathbf{c}} = \mathbf{c}_{\text{true}}$ , the residual equals $\|\mathbf{w}\| \approx \delta$ .

Optimality of the discrepancy level

Under standard regularity conditions on $\mathbf{A}$ and the true solution, choosing $\lambda$ by the discrepancy principle yields a regularized solution that converges to $\mathbf{c}_{\text{true}}$ as $\delta \to 0$ (Engl, Hanke, and Neubauer, 1996). $\blacksquare$

Definition:
Warm-Starting from the Matched-Filter Image

The matched-filter image (backpropagation image) is

$\hat{\mathbf{c}}^{\text{BP}} = \mathbf{A}^{H} \mathbf{D}^{-1} \mathbf{y},$

where $\mathbf{D} = \text{diag}(\mathbf{A}^{H} \mathbf{A})$ is the diagonal of the normal matrix. This is a single adjoint application followed by a normalization and costs $O(MQ)$ .

Warm-starting uses $\hat{\mathbf{c}}^{\text{BP}}$ as the initial iterate $\mathbf{c}^{(0)}$ for any iterative algorithm, rather than the default $\mathbf{c}^{(0)} = \mathbf{0}$ .

The matched-filter image is far from optimal — it suffers from sidelobes and lacks sparsity — but it is already correlated with the true scene. Starting from it rather than zero typically saves 20--50% of the iterations needed to reach a given reconstruction quality.

In our RF imaging system, $\hat{\mathbf{c}}^{\text{BP}}$ is the standard baseline (Matched-Filter Imaging).

Example: Stopping Criterion for ADMM in TV-Regularized Imaging

Consider ADMM applied to the TV-regularized imaging problem

$\min_{\mathbf{c}} \frac{1}{2}\|\mathbf{A}\mathbf{c} - \mathbf{y}\|^2 + \lambda\,\text{TV}(\mathbf{c})$

with $Q = 128^2$ voxels, noise standard deviation $\sigma = 0.05$ , and $\rho = 1.0$ . Design a stopping criterion using primal and dual residuals.

Solution

Set tolerance levels

The absolute tolerance scales with the problem dimension: $\varepsilon_{\text{abs}} = \sqrt{Q} \times 10^{-4} = 128 \times 10^{-4} = 0.0128$ . The relative tolerance: $\varepsilon_{\text{rel}} = 10^{-3}$ .

Compute residuals at each iteration

At iteration $t$ :

$\|r^{(t)}\| = \|\nabla \mathbf{c}^{(t)} - \mathbf{z}^{(t)}\|,$ $\|s^{(t)}\| = \rho\,\|\nabla^T(\mathbf{z}^{(t)} - \mathbf{z}^{(t-1)})\|.$

Adaptive threshold

Primal threshold: $\varepsilon_{\text{pri}}^{(t)} = \varepsilon_{\text{abs}} + \varepsilon_{\text{rel}} \max(\|\nabla \mathbf{c}^{(t)}\|, \|\mathbf{z}^{(t)}\|)$ .

Dual threshold: $\varepsilon_{\text{dual}}^{(t)} = \varepsilon_{\text{abs}} + \varepsilon_{\text{rel}} \|\rho \nabla^T \mathbf{u}^{(t)}\|$ .

Stop when $\|r^{(t)}\| \leq \varepsilon_{\text{pri}}^{(t)}$ and $\|s^{(t)}\| \leq \varepsilon_{\text{dual}}^{(t)}$ .

Convergence Diagnostics Dashboard

Visualize the convergence behavior of iterative algorithms on a sparse recovery problem. Panel 1: objective value vs iteration. Panel 2: fixed-point residual vs iteration. Panel 3: data-fidelity residual with the discrepancy level $\tau\delta$ marked.

Compare warm-started (from matched filter) vs cold-started (from zero) initialization.

Parameters

Algorithm

\sigma

0.05

Noise standard deviation

Warm start from matched filter

⚠️Engineering Note

Adaptive Penalty Parameter Selection in ADMM

The convergence speed of ADMM depends strongly on the penalty parameter $\rho$ . A practical adaptive scheme (Boyd et al., 2011) adjusts $\rho$ at each iteration based on the ratio of primal and dual residuals:

$\rho^{(t+1)} = \begin{cases} \tau_{\text{inc}}\,\rho^{(t)} & \text{if } \|r^{(t)}\| > \mu\,\|s^{(t)}\|, \\ \rho^{(t)} / \tau_{\text{dec}} & \text{if } \|s^{(t)}\| > \mu\,\|r^{(t)}\|, \\ \rho^{(t)} & \text{otherwise}, \end{cases}$

with $\mu = 10$ , $\tau_{\text{inc}} = \tau_{\text{dec}} = 2$ .

The intuition: if the primal residual is much larger than the dual residual, increase $\rho$ to penalize constraint violation more heavily; conversely, if the dual residual dominates, decrease $\rho$ to allow more flexibility in the primal update.

Stopping Criteria for Imaging Algorithms

Criterion	Applicable To	Requires Ground Truth?	Provides Certificate?
Fixed-point residual	All iterative algorithms	No	Necessary but not sufficient
Primal + dual residual	ADMM and variants	No	Yes (primal-dual optimality)
Primal-dual gap	Chambolle--Pock	No	Yes (absolute $\varepsilon$ -optimality)
Discrepancy principle	All (with known $\delta$ )	No (needs noise level)	Yes (regularization theory)
PSNR vs iteration	Simulation only	Yes	Detects semi-convergence
Objective decrease	All with known objective	No	Sanity check only

,

Common Mistake: Semi-Convergence in Unregularized Iterations

Mistake:

Running an iterative algorithm (such as Landweber iteration or conjugate gradient on the normal equations) until the fixed-point residual is very small, without explicit regularization.

Correction:

For unregularized iterative methods, the number of iterations itself acts as a regularization parameter. PSNR initially improves as the algorithm fits the signal component, then decreases as it starts fitting the noise. This is semi-convergence.

Detection: plot PSNR vs iteration (in simulation) or monitor the data residual. If $\|\mathbf{A}\mathbf{c}^{(t)} - \mathbf{y}\|$ drops significantly below the noise level $\delta$ , the algorithm is overfitting.

Prevention: use the discrepancy principle to stop, or add explicit regularization (TV, $\ell_1$ , etc.) so that the regularized objective has a unique minimum and the algorithm converges to it monotonically.

Quick Check

In Morozov's discrepancy principle with noise level $\delta$ and safety factor $\tau = 1.2$ , the algorithm should stop when the data residual satisfies:

$\|\mathbf{A}\hat{\mathbf{c}} - \mathbf{y}\| \leq 1.2\,\delta$

$\|\mathbf{A}\hat{\mathbf{c}} - \mathbf{y}\| \leq \delta / 1.2$

$\|\mathbf{A}\hat{\mathbf{c}} - \mathbf{y}\| = 0$

$\|\hat{\mathbf{c}} - \mathbf{c}_{\text{true}}\| \leq \delta$

Correction:

\|\mathbf{A}\hat{\mathbf{c}} - \mathbf{y}\| \leq 1.2\,\delta

The discrepancy principle stops when the residual reaches $\tau\delta$ , fitting the data to the level justified by the noise.

Historical Note: Vladimir Morozov and the Discrepancy Principle

1966

Vladimir Morozov proposed the discrepancy principle in 1966 while working on ill-posed problems at Moscow State University. The idea — that one should trust the data only to the level of the noise — was revolutionary in an era when many practitioners believed that more iterations (or smaller regularization) always meant better results.

Morozov's principle formalized the intuition that every inverse problem has an information limit set by the noise. Together with Tikhonov's regularization theory (developed at the same institution), it established the mathematical foundation for what we now call regularization theory — a foundation that every modern imaging algorithm builds upon.

Example: Warm-Starting FISTA for RF Image Reconstruction

Compare the convergence of FISTA for $\ell_1$ -regularized RF image reconstruction when initialized with: (a) $\mathbf{c}^{(0)} = \mathbf{0}$ (cold start), and (b) $\mathbf{c}^{(0)} = \hat{\mathbf{c}}^{\text{BP}}$ (matched-filter warm start). The problem has $Q = 128^2$ voxels, $M = 2048$ measurements, and $\text{SNR} = 20$ dB.

Solution

Cold-start behavior

Starting from zero, FISTA must first identify the support of the sparse scene (which voxels are nonzero). The early iterations produce large objective decreases as the major scatterers are identified, followed by slow refinement. Typical convergence: $\sim 200$ iterations to reach relative change $< 10^{-4}$ .

Warm-start behavior

The matched-filter image $\hat{\mathbf{c}}^{\text{BP}} = \mathbf{A}^{H} \mathbf{D}^{-1} \mathbf{y}$ already identifies the approximate support and amplitudes of the major scatterers. FISTA refines the amplitudes and suppresses sidelobes. Typical convergence: $\sim 80$ -- $120$ iterations — a $40$ -- $60\%$ reduction.

Cost analysis

The matched-filter image costs one adjoint evaluation ( $O(MQ)$ ), which is the same as one iteration of FISTA. The net savings: $80$ -- $120$ fewer iterations at the cost of one additional adjoint — a substantial improvement.

Common Mistake: Relative Change Can Be Misleading

Mistake:

Using only the relative change $\|\mathbf{c}^{(t+1)} - \mathbf{c}^{(t)}\| / \|\mathbf{c}^{(t)}\|$ as a stopping criterion, without checking the data residual.

Correction:

Relative change can be small for two very different reasons:

The algorithm has converged to a good solution.
The algorithm is stuck at a bad solution (e.g., a local minimum in nonconvex problems, or a very flat objective landscape).

Always combine relative change with at least one of:

Data residual $\|\mathbf{A}\mathbf{c}^{(t)} - \mathbf{y}\|$ (compared to the noise level via the discrepancy principle).
Objective value (should be decreasing).
Algorithm-specific residuals (primal/dual for ADMM).

🔧Engineering Note

Practical Monitoring Checklist

Always monitor (regardless of algorithm):

Objective value $F(\mathbf{c}^{(t)})$ — should decrease monotonically for descent methods (ISTA, ADMM). FISTA may oscillate slightly due to momentum.
Fixed-point residual $\|\mathbf{c}^{(t+1)} - \mathbf{c}^{(t)}\|$ — stagnation below $10^{-5}$ usually indicates convergence.
Data residual $\|\mathbf{A}\mathbf{c}^{(t)} - \mathbf{y}\|$ — compare against the noise level $\delta$ .

Warning signs:

Objective increasing: step size too large or bug.
Data residual dropping far below $\delta$ : overfitting (stop earlier or increase regularization).
Primal residual stagnating (ADMM): $\rho$ too small.
Dual residual stagnating (ADMM): $\rho$ too large.

Primal residual

In ADMM, the norm of the constraint violation $r^{(t)} = \mathbf{B}\mathbf{c}^{(t)} - \mathbf{z}^{(t)}$ . Measures how well the auxiliary variable agrees with the constraint.

Warm starting

Initializing an iterative algorithm from an approximate solution (e.g., the matched-filter image) rather than from zero. Reduces the number of iterations to convergence.

Key Takeaway

Convergence diagnostics are not optional — they are integral to every reconstruction pipeline. The fixed-point residual is universal and cheap. Primal and dual residuals provide ADMM-specific optimality certificates. The discrepancy principle sets the noise-appropriate stopping level without requiring ground truth. Warm-starting from the matched-filter image saves 40--60% of iterations. Always monitor the objective and data residual; any increase signals a step-size error or implementation bug.

Convergence Diagnostics in Practice

Knowing When to Stop

Definition: Primal Residual

Definition: Dual Residual

Definition: Fixed-Point Residual

Theorem: Morozov's Discrepancy Principle

Lower bound on the residual

Optimality of the discrepancy level

Definition: Warm-Starting from the Matched-Filter Image

Example: Stopping Criterion for ADMM in TV-Regularized Imaging

Set tolerance levels

Compute residuals at each iteration

Adaptive threshold

Convergence Diagnostics Dashboard

Parameters

Adaptive Penalty Parameter Selection in ADMM

Stopping Criteria for Imaging Algorithms

Common Mistake: Semi-Convergence in Unregularized Iterations

Quick Check

Historical Note: Vladimir Morozov and the Discrepancy Principle

Example: Warm-Starting FISTA for RF Image Reconstruction

Cold-start behavior

Warm-start behavior

Cost analysis

Common Mistake: Relative Change Can Be Misleading

Practical Monitoring Checklist

Primal residual

Warm starting

Key Takeaway

Definition:
Primal Residual

Definition:
Dual Residual

Definition:
Fixed-Point Residual

Definition:
Warm-Starting from the Matched-Filter Image