Theoretical Guarantees for Unrolled Networks

Can We Prove Unrolled Networks Work?

Unrolled networks achieve impressive empirical results, but practitioners rightly ask: are there theoretical guarantees? Can we bound the reconstruction error? Can we prove convergence? How many training samples are needed?

This section surveys three categories of results: (1) convergence guarantees (does the unrolled iteration converge?), (2) generalisation bounds (does training performance transfer to test data?), and (3) robustness to model mismatch (what happens when the assumed model is wrong?).

Theorem: Unrolled Networks Converge Faster Than Fixed-Parameter Algorithms

Let Tα\mathcal{T}_\alpha be an iterative algorithm with fixed parameter α\alpha that converges linearly with rate ρ(α)\rho(\alpha):

βˆ₯c^(k)βˆ’cβˆ—βˆ₯≀ρ(Ξ±)kβˆ₯c^(0)βˆ’cβˆ—βˆ₯.\|\hat{\mathbf{c}}^{(k)} - \mathbf{c}^*\| \leq \rho(\alpha)^k \|\hat{\mathbf{c}}^{(0)} - \mathbf{c}^*\|.

An unrolled network with layer-wise parameters {Ξ±k}k=1K\{\alpha_k\}_{k=1}^K achieves a convergence factor of

βˆ₯c^(K)βˆ’cβˆ—βˆ₯≀(∏k=1Kρ(Ξ±k))βˆ₯c^(0)βˆ’cβˆ—βˆ₯.\|\hat{\mathbf{c}}^{(K)} - \mathbf{c}^*\| \leq \left(\prod_{k=1}^K \rho(\alpha_k)\right) \|\hat{\mathbf{c}}^{(0)} - \mathbf{c}^*\|.

By optimising {Ξ±k}\{\alpha_k\}, the product ∏k=1Kρ(Ξ±k)\prod_{k=1}^K \rho(\alpha_k) can be made strictly smaller than ρ(Ξ±βˆ—)K\rho(\alpha^*)^K where Ξ±βˆ—\alpha^* minimises ρ(Ξ±)\rho(\alpha).

With fixed parameters, every iteration uses the same step size, which is a compromise between early iterations (large steps beneficial) and late iterations (small steps avoid oscillation). Unrolling allows aggressive early layers and conservative late layers, achieving faster overall convergence.

Theorem: Convergence Under the Restricted Isometry Property

Let A\mathbf{A} satisfy the RIP of order 2s2s with constant Ξ΄2s<2βˆ’1\delta_{2s} < \sqrt{2} - 1. Then a KK-layer LISTA network with weight-tied matrices W=Iβˆ’AH(AAH+ΞΌI)βˆ’1A\mathbf{W} = \mathbf{I} - \mathbf{A}^{H}(\mathbf{A}\mathbf{A}^{H} + \mu\mathbf{I})^{-1}\mathbf{A} and learned thresholds {Ο„k}\{\tau_k\} satisfies:

βˆ₯c^(K)βˆ’cβˆ—βˆ₯2≀ρKβˆ₯c^(0)βˆ’cβˆ—βˆ₯2+Csβˆ₯cβˆ—βˆ’csβˆ—βˆ₯1+DΟƒ2\|\hat{\mathbf{c}}^{(K)} - \mathbf{c}^*\|_2 \leq \rho^K \|\hat{\mathbf{c}}^{(0)} - \mathbf{c}^*\|_2 + \frac{C}{\sqrt{s}}\|\mathbf{c}^* - \mathbf{c}^*_s\|_1 + D\sigma^2

where csβˆ—\mathbf{c}^*_s is the best ss-sparse approximation, ρ<1\rho < 1 depends on Ξ΄2s\delta_{2s} and ΞΌ\mu, and C,DC, D are constants.

The RIP ensures that A\mathbf{A} preserves distances between sparse vectors, which in turn ensures the LMMSE-based weight matrix W\mathbf{W} is a contraction on the support of the signal. The bound has three terms: exponential decay of the iterative error, approximation error for non-exactly-sparse signals, and noise amplification.

Theorem: Generalisation Bound for Unrolled Networks

Let fΞΈf_\theta be a KK-layer unrolled network with PP total learnable parameters, trained on nn i.i.d. samples. If each layer is LkL_k-Lipschitz and the loss L\mathcal{L} is LLL_{\mathcal{L}}-Lipschitz, then with probability at least 1βˆ’Ξ΄1 - \delta:

E[L]≀L^n+O ⁣(LL∏k=1KLkβ‹…Plog⁑(K/Ξ΄)n)\mathbb{E}[\mathcal{L}] \leq \hat{\mathcal{L}}_n + O\!\left(\frac{L_{\mathcal{L}} \prod_{k=1}^K L_k \cdot \sqrt{P \log(K/\delta)}}{\sqrt{n}}\right)

where L^n\hat{\mathcal{L}}_n is the empirical training loss.

The generalisation gap scales with:

  • P/n\sqrt{P/n}: more parameters or fewer samples worsens generalisation.
  • ∏kLk\prod_k L_k: the product of per-layer Lipschitz constants (the "depth amplification" factor).

The inductive bias of unrolled networks keeps PP small relative to generic architectures, improving the bound. Weight tying further reduces PP to P/KP/K.

Generalisation Gap vs Network Depth

Visualise how the generalisation gap (test loss minus train loss) scales with network depth KK and training set size nn. The plot shows the theoretical bound ∝P/nβ‹…βˆkLk\propto \sqrt{P/n} \cdot \prod_k L_k alongside empirical measurements from unrolled OAMP training.

Increase the number of layers to see the "depth amplification" effect, and increase the training set size to see how more data tightens the bound.

Parameters
10
5000

Definition:

Convergent Unrolling

A convergent unrolled network is an architecture guaranteed to converge to a fixed point as Kβ†’βˆžK \to \infty, regardless of the learned parameters. This requires:

  1. Each layer is a contractive mapping: βˆ₯TΞΈk(a)βˆ’TΞΈk(b)βˆ₯≀ρkβˆ₯aβˆ’bβˆ₯\|\mathcal{T}_{\theta_k}(\mathbf{a}) - \mathcal{T}_{\theta_k}(\mathbf{b})\| \leq \rho_k \|\mathbf{a} - \mathbf{b}\| with ρk<1\rho_k < 1.
  2. The contraction rates satisfy ∏k=1∞ρk=0\prod_{k=1}^\infty \rho_k = 0 (guaranteed if sup⁑kρk<1\sup_k \rho_k < 1).

Convergent unrolling is achieved by constraining the learned operators (e.g., spectral normalisation of weight matrices, firmly nonexpansive denoisers).

Convergent unrolling provides a safety net: even if training is imperfect, the network output cannot diverge. This is important for safety-critical applications (medical imaging, autonomous radar) where a divergent reconstruction could have serious consequences.

Theorem: Robustness to Model Mismatch

Let fΞΈf_\theta be an unrolled network trained with sensing matrix A\mathbf{A} and tested with a perturbed matrix A~=A+Ξ”A\tilde{\mathbf{A}} = \mathbf{A} + \Delta\mathbf{A}. If fΞΈf_\theta is LfL_f-Lipschitz with respect to A\mathbf{A}, then:

βˆ₯c^A~βˆ’c^Aβˆ₯≀Lfβ‹…βˆ₯Ξ”Aβˆ₯β‹…βˆ₯cβˆ₯.\|\hat{\mathbf{c}}_{\tilde{\mathbf{A}}} - \hat{\mathbf{c}}_{\mathbf{A}}\| \leq L_f \cdot \|\Delta\mathbf{A}\| \cdot \|\mathbf{c}\|.

For unrolled OAMP with TT layers, the Lipschitz constant scales as Lf=O(Tβ‹…βˆ₯WLEβˆ₯)L_f = O(T \cdot \|\mathbf{W}_{\text{LE}}\|), which is bounded when the LMMSE regularisation is positive.

Small perturbations in the sensing matrix cause proportionally small changes in the reconstruction. The bound degrades linearly with depth TT, motivating the use of moderate depth (5--15 layers) in practice. Unrolled OAMP is more robust than generic networks because the physics-based LMMSE step provides a structured response to model perturbations.

The Over-Parameterised Regime and PL* Condition

When the unrolled network has many more parameters than training samples (P≫nP \gg n), classical generalisation theory predicts poor performance. However, recent results show that unrolled networks can interpolate training data while still generalising, provided:

  1. The loss landscape satisfies the Polyak-Lojasiewicz (PL) condition*: βˆ₯βˆ‡L(ΞΈ)βˆ₯2β‰₯ΞΌβ‹…L(ΞΈ)\|\nabla \mathcal{L}(\theta)\|^2 \geq \mu \cdot \mathcal{L}(\theta) for some ΞΌ>0\mu > 0.
  2. The initialisation is close to a good solution (e.g., ISTA/OAMP initialisation).

Under PL*, gradient descent converges to a global minimum at a linear rate, and the implicit regularisation of the algorithm structure selects solutions with good generalisation properties. This helps explain why unrolled OAMP generalises well even with relatively few training samples.

Example: Sample Complexity --- Unrolled OAMP vs Generic Network

Compare the number of training samples needed for 10-layer unrolled OAMP (55K parameters) and a generic U-Net (1.2M parameters) to achieve NMSE <βˆ’20< -20 dB on an RF imaging task with 128Γ—128128 \times 128 images.

Common Mistake: Unrolled Networks Do Not Converge by Default

Mistake:

Assuming that because the underlying OAMP algorithm converges, the unrolled version also converges.

Correction:

Convergence of OAMP relies on specific conditions (divergence-free LE, denoiser properties). The learned ProxNet may violate these, and the truncation to TT layers means the network never runs to convergence.

To guarantee convergence:

  1. Constrain the ProxNet to be firmly nonexpansive (spectral normalisation).
  2. Add a convergence penalty: Lconv=βˆ₯c^(T)βˆ’TΞΈT(c^(T))βˆ₯2\mathcal{L}_{\text{conv}} = \|\hat{\mathbf{c}}^{(T)} - \mathcal{T}_{\theta_T}(\hat{\mathbf{c}}^{(T)})\|^2.
  3. Verify the state evolution monotonically decreases Οƒt2\sigma_t^2 across layers.
⚠️Engineering Note

Gradient Checkpointing for Deep Unrolled Networks

For T>10T > 10 layers, storing all intermediate activations for backpropagation requires O(Tβ‹…N)O(T \cdot N) memory, which may exceed GPU capacity for large images.

Gradient checkpointing stores only every kk-th activation and recomputes the others during the backward pass. This reduces memory from O(TN)O(TN) to O((T+T/k)β‹…N)O((\sqrt{T} + T/k) \cdot N) at the cost of one additional forward pass.

For unrolled OAMP with T=15T = 15, N=1282N = 128^2, and checkpointing every 3 layers: memory reduction is ∼3Γ—\sim 3\times with ∼33%\sim 33\% additional computation.

Practical Constraints
  • β€’

    GPU memory for intermediate activations: O(Tβ‹…N)O(T \cdot N) without checkpointing

  • β€’

    Recomputation cost: one additional forward pass per checkpoint segment

Quick Check

According to the generalisation bound, which modification most effectively reduces the generalisation gap of an unrolled network?

Adding more layers (increasing KK)

Weight tying across layers (reducing PP to P/KP/K)

Using a larger training set

Using a more expressive denoiser

Restricted Isometry Property (RIP)

A sensing matrix A\mathbf{A} satisfies the RIP of order ss with constant Ξ΄s\delta_s if (1βˆ’Ξ΄s)βˆ₯xβˆ₯2≀βˆ₯Axβˆ₯2≀(1+Ξ΄s)βˆ₯xβˆ₯2(1-\delta_s)\|\mathbf{x}\|^2 \leq \|\mathbf{A}\mathbf{x}\|^2 \leq (1+\delta_s)\|\mathbf{x}\|^2 for all ss-sparse vectors x\mathbf{x}. RIP ensures that sparse recovery algorithms converge.

Convergent Unrolling

An unrolled network architecture that is guaranteed to converge to a fixed point regardless of the learned parameters, achieved by constraining each layer to be a contraction (e.g., via spectral normalisation).

Key Takeaway

Unrolled networks enjoy three types of theoretical guarantees: convergence under RIP (linear decay of error), generalisation bounds scaling as P/n\sqrt{P/n} (improved by weight tying and algorithmic inductive bias), and robustness to model mismatch proportional to the perturbation norm. Convergent unrolling via Lipschitz constraints provides safety guarantees for critical applications.