Ferkans — Interactive Telecom Tutor

The Convergence Challenge: When Is PnP Safe?

Standard ADMM and PGD convergence theory requires the proximal operator to be the exact proximal of a convex function. Deep denoisers violate this assumption in two ways:

They may not be the proximal operator of any function.
Even if they are, the implicit function may not be convex.

Without convergence guarantees, PnP iterations can oscillate, diverge, or converge to artefact-laden images. This section develops three lines of theory with increasing structural requirements:

Non-expansive denoisers (Ryu et al., 2019): weakest condition, convergence to a fixed point
Gradient-step denoisers (Hurault et al., 2022): denoiser is a gradient, convergence to an objective minimum
ICNN-based denoisers (Pesquet et al., 2021): strongest, global convergence guarantees

,

Definition:
Lipschitz Condition for Denoisers

A denoiser $\mathcal{D}_\sigma$ has Lipschitz constant $L$ if

$\|\mathcal{D}_\sigma(\mathbf{a}) - \mathcal{D}_\sigma(\mathbf{b})\| \leq L\,\|\mathbf{a} - \mathbf{b}\| \qquad \forall\, \mathbf{a}, \mathbf{b}.$

Key regimes:

Condition	Meaning	PnP implication
$L \leq 1$	Non-expansive	PnP-PGD convergence (fixed point)
$L < 1$	Contractive	PnP-ADMM convergence + uniqueness
$L > 1$	Expansive	PnP may diverge

For a $D$ -layer network: $L \leq \prod_{\ell=1}^D \|\mathbf{W}_\ell\|$ .

Most trained DnCNN variants have $L \approx 1.5$ – $3$ without spectral normalisation. This is why unconstrained PnP can diverge for small step sizes or weak penalty parameters.

Theorem: PnP-PGD Convergence via Non-Expansiveness

If the denoiser $\mathcal{D}_\sigma$ is non-expansive (Lipschitz constant $L \leq 1$ ) and the gradient step $\mathcal{G}_\alpha(\mathbf{x}) = \mathbf{x} - \alpha\mathbf{A}^{H}(\mathbf{A}\mathbf{x} - \mathbf{y})$ is non-expansive ( $\alpha \leq 2/\|\mathbf{A}\|^2$ ), then the PnP-PGD iterates $\mathbf{c}^{(k+1)} = \mathcal{D}_\sigma(\mathcal{G}_\alpha(\mathbf{c}^{(k)}))$ converge to a fixed point.

PnP-PGD is the composition of two non-expansive maps. When one of them is averaged (strictly contractive in a relaxed sense), the composition is also averaged, and the Krasnoselskii-Mann iteration theorem guarantees convergence to a fixed point.

Proof

Averaged operator formulation

The gradient step $\mathcal{G}_\alpha$ is $\beta$ -averaged with $\beta = \alpha\|\mathbf{A}\|^2/2 \in (0,1)$ when $\alpha \leq 2/\|\mathbf{A}\|^2$ .

An averaged operator $T$ satisfies: $\|T\mathbf{a} - T\mathbf{b}\|^2 \leq \|\mathbf{a} - \mathbf{b}\|^2 - \frac{1-\beta}{\beta}\|(\mathbf{I} - T)\mathbf{a} - (\mathbf{I} - T)\mathbf{b}\|^2.$

Composition is averaged

Since $\mathcal{D}_\sigma$ is non-expansive and $\mathcal{G}_\alpha$ is averaged, their composition $\mathcal{D}_\sigma \circ \mathcal{G}_\alpha$ is averaged by Bauschke–Combettes composition theory.

Apply Krasnoselskii-Mann

The Krasnoselskii-Mann theorem states that iterates of an averaged operator converge weakly to a fixed point, provided the fixed-point set is non-empty. $\blacksquare$

,

Theorem: PnP-ADMM Convergence via Strong Monotonicity

PnP-ADMM converges to a unique fixed point if the denoiser $\mathcal{D}_\sigma$ satisfies:

Boundedness: $\|\mathcal{D}_\sigma(\mathbf{v})\| \leq C(1 + \|\mathbf{v}\|)$
Strong monotonicity of the residual: $\langle (\mathbf{I} - \mathcal{D}_\sigma)(\mathbf{a}) - (\mathbf{I} - \mathcal{D}_\sigma)(\mathbf{b}),\; \mathbf{a} - \mathbf{b}\rangle \geq \gamma\|\mathbf{a} - \mathbf{b}\|^2$ for some $\gamma > 0$
Positive penalty: $\rho > 0$

Strong monotonicity of $\mathbf{I} - \mathcal{D}_\sigma$ is equivalent to $\mathcal{D}_\sigma$ being contractive in a certain sense. It prevents the denoiser from moving iterates "too far," stopping oscillation.

Proof

Operator splitting interpretation

PnP-ADMM is Douglas–Rachford splitting applied to $\partial f$ (subdifferential of data fidelity) and $\mathcal{J}_\mathcal{D} = \mathbf{I} - \mathcal{D}_\sigma$ (the implicit regularisation operator).

Convergence via monotonicity

If $\mathcal{J}_\mathcal{D}$ is $\gamma$ -strongly monotone and $\partial f$ is maximal monotone, then $\partial f + \mathcal{J}_\mathcal{D}$ is $\gamma$ -strongly monotone, guaranteeing a unique zero and convergence of Douglas–Rachford. $\blacksquare$

Definition:
Gradient-Step Denoiser (Hurault et al., 2022)

A gradient-step denoiser is trained so that the denoising correction is the gradient of a learned scalar potential $g_\sigma$ :

$\mathcal{D}_\sigma(\mathbf{v}) = \mathbf{v} - \nabla g_\sigma(\mathbf{v}).$

The potential $g_\sigma$ is parameterised by a neural network and trained to satisfy $\nabla g_\sigma \approx \mathbf{v} - \mathbb{E}[\mathbf{x}|\mathbf{v}]$ (the noise residual under the MMSE denoiser).

Consequence: PnP-PGD with a gradient-step denoiser provably minimises: $\min_{\mathbf{x}} \; \frac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{x}\|^2 + g_\sigma(\mathbf{x}).$

Gradient-step denoisers close the gap between PnP and variational methods: they achieve near-DRUNet denoising quality while providing the convergence guarantees of a proper proximal algorithm. The key training cost is computing $\nabla g_\sigma$ via automatic differentiation.

Definition:
ICNN-Based Denoiser

An Input-Convex Neural Network (ICNN) $\Phi_\theta \colon \mathbb{R}^N \to \mathbb{R}$ is convex in its input by construction:

Non-negative passthrough weights: $\mathbf{W}_\ell^{(z)} \geq 0$
Convex, non-decreasing activations: ReLU or softplus
Arbitrary skip connections from input to each layer

The ICNN-based denoiser is the proximal operator of $\Phi_\theta$ : $\mathcal{D}_\text{ICNN}(\mathbf{v}) = \operatorname{prox}_{\sigma^2\Phi_\theta}(\mathbf{v}) = \arg\min_{\mathbf{x}} \frac{1}{2}\|\mathbf{x} - \mathbf{v}\|^2 + \sigma^2\Phi_\theta(\mathbf{x}).$

By construction, $\Phi_\theta$ is convex, so $\mathcal{D}_\text{ICNN}$ is the exact proximal of a convex function.

ICNN denoisers guarantee global convergence of PnP-ADMM and PnP-PGD to a unique minimiser of $f + \sigma^2\Phi_\theta$ . The price is an expressivity gap of approximately 1–1.5 dB PSNR compared to unconstrained networks.

Theorem: Global Convergence of PnP with ICNN Denoisers

If $\mathcal{D}_\sigma = \operatorname{prox}_{\sigma^2\Phi_\theta}$ where $\Phi_\theta$ is an ICNN, then PnP-ADMM and PnP-PGD converge to the unique global minimiser of: $\min_{\mathbf{x}} \; \frac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{x}\|^2 + \sigma^2\Phi_\theta(\mathbf{x}).$

The sum of two convex functions is convex. ADMM and PGD with exact proximal operators of convex functions converge to the global minimum by standard theory. Since $\mathcal{D}_\text{ICNN}$ is exactly the proximal of $\sigma^2\Phi_\theta$ , all standard guarantees apply.

Proof

Convexity of the composite objective

$f(\mathbf{x}) = \tfrac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{x}\|^2$ is convex (quadratic). $\Phi_\theta(\mathbf{x})$ is convex by the ICNN architecture. Their sum is convex with a unique minimiser (when $\mathbf{A}$ has full column rank).

Apply standard ADMM convergence

ADMM with exact proximal operators of convex $f$ and convex $\sigma^2\Phi_\theta$ converges to a minimiser by the standard ADMM convergence theorem (Boyd et al., 2011). $\blacksquare$

,

Convergence Properties of PnP Variants

Denoiser type	Convergence guarantee	Objective minimised	PSNR (relative)
BM3D / unconstrained DnCNN	None in general	None (fixed-point iteration)	Best
Non-expansive DnCNN ( $L \leq 1$ )	Fixed point (PnP-PGD)	None explicit	Good ( $\sim$ 1 dB below BM3D)
Gradient-step denoiser	Convergence to objective minimum	$f(\mathbf{x}) + g_\sigma(\mathbf{x})$	Near-best
ICNN denoiser	Global minimum (unique)	$f(\mathbf{x}) + \sigma^2\Phi_\theta(\mathbf{x})$	Moderate ( $\sim$ 1.5 dB below best)

Example: Training a Lipschitz-Constrained DnCNN

Describe how to train a DnCNN with Lipschitz constant $L \leq 1$ using spectral normalisation.

Solution

Spectral normalisation per layer

For each convolutional layer with weight $\mathbf{W}_\ell$ , compute the spectral norm $\sigma_1(\mathbf{W}_\ell)$ via power iteration and normalise: $\bar{\mathbf{W}}_\ell = \frac{\mathbf{W}_\ell}{\sigma_1(\mathbf{W}_\ell)}.$ This ensures $\|\bar{\mathbf{W}}_\ell\| \leq 1$ .

Global Lipschitz bound

For a $D$ -layer network with ReLU activations (1-Lipschitz): $L \leq \prod_{\ell=1}^D \|\bar{\mathbf{W}}_\ell\| \leq 1.$

Residual learning caveat

DnCNN uses $\mathcal{D}_\sigma(\mathbf{v}) = \mathbf{v} - f_\theta(\mathbf{v})$ . The Lipschitz constant of $\mathcal{D}_\sigma$ is at most $1 + L_f$ . To enforce $L_\mathcal{D} \leq 1$ , one needs $L_f \leq 0$ , which is too restrictive. In practice, constrain $L_f \leq c$ for small $c$ and accept a mild over-1 bound. $\square$

Common Mistake: Theoretical Convergence vs. Practical Performance

Mistake:

Strictly enforcing the Lipschitz constraint $L \leq 1$ during training, leading to a weakened denoiser and poor reconstruction.

Correction:

Enforcing $L \leq 1$ limits the denoiser's expressivity. In practice, the best PnP results use denoisers with $L$ slightly above 1 (typically 1.0–1.5). These may violate the theoretical convergence conditions but converge empirically for well-chosen $\rho$ .

Practical approach: Train without strict Lipschitz constraints, then monitor the PnP primal residual $\|\mathbf{c}^{(k+1)} - \mathbf{z}^{(k+1)}\|$ . If it diverges, increase $\rho$ (reducing the effective denoiser input variance) or apply early stopping.

PnP-ADMM Convergence vs Noise Level and Penalty

Explore how the denoiser noise level $\sigma$ and ADMM penalty $\rho$ affect PnP-ADMM convergence. The plot shows the primal residual $\|\mathbf{x}^{(k+1)} - \mathbf{z}^{(k+1)}\|$ vs iteration number.

For $\sigma\sqrt{\rho} < \lambda$ (under-regularised), the algorithm may oscillate. For $\sigma\sqrt{\rho} \gg \lambda$ (over-regularised), convergence is fast but the reconstruction is blurry.

Parameters

Denoiser noise level

\sigma

0.05

ADMM penalty

\rho

1

Iterations50

Key Takeaway

PnP convergence theory offers three levels: (1) non-expansive denoisers guarantee fixed-point convergence for PnP-PGD; (2) gradient-step denoisers guarantee minimisation of an explicit objective; (3) ICNN denoisers guarantee global minimum convergence at the cost of reduced expressivity. In practice, the best results often use unconstrained DRUNet with empirical convergence monitoring, accepting the gap between theory and performance.

Convergence Theory for PnP

The Convergence Challenge: When Is PnP Safe?

Definition: Lipschitz Condition for Denoisers

Theorem: PnP-PGD Convergence via Non-Expansiveness

Averaged operator formulation

Composition is averaged

Apply Krasnoselskii-Mann

Theorem: PnP-ADMM Convergence via Strong Monotonicity

Operator splitting interpretation

Convergence via monotonicity

Definition: Gradient-Step Denoiser (Hurault et al., 2022)

Definition: ICNN-Based Denoiser

Theorem: Global Convergence of PnP with ICNN Denoisers

Convexity of the composite objective

Apply standard ADMM convergence

Convergence Properties of PnP Variants

Example: Training a Lipschitz-Constrained DnCNN

Spectral normalisation per layer

Global Lipschitz bound

Residual learning caveat

Common Mistake: Theoretical Convergence vs. Practical Performance

PnP-ADMM Convergence vs Noise Level and Penalty

Parameters

Key Takeaway

Definition:
Lipschitz Condition for Denoisers

Definition:
Gradient-Step Denoiser (Hurault et al., 2022)

Definition:
ICNN-Based Denoiser