Ferkans — Interactive Telecom Tutor

Can We Prove Unrolled Networks Work?

Unrolled networks achieve impressive empirical results, but practitioners rightly ask: are there theoretical guarantees? Can we bound the reconstruction error? Can we prove convergence? How many training samples are needed?

This section surveys three categories of results: (1) convergence guarantees (does the unrolled iteration converge?), (2) generalisation bounds (does training performance transfer to test data?), and (3) robustness to model mismatch (what happens when the assumed model is wrong?).

Theorem: Unrolled Networks Converge Faster Than Fixed-Parameter Algorithms

Let $\mathcal{T}_\alpha$ be an iterative algorithm with fixed parameter $\alpha$ that converges linearly with rate $\rho(\alpha)$ :

$\|\hat{\mathbf{c}}^{(k)} - \mathbf{c}^*\| \leq \rho(\alpha)^k \|\hat{\mathbf{c}}^{(0)} - \mathbf{c}^*\|.$

An unrolled network with layer-wise parameters $\{\alpha_k\}_{k=1}^K$ achieves a convergence factor of

$\|\hat{\mathbf{c}}^{(K)} - \mathbf{c}^*\| \leq \left(\prod_{k=1}^K \rho(\alpha_k)\right) \|\hat{\mathbf{c}}^{(0)} - \mathbf{c}^*\|.$

By optimising $\{\alpha_k\}$ , the product $\prod_{k=1}^K \rho(\alpha_k)$ can be made strictly smaller than $\rho(\alpha^*)^K$ where $\alpha^*$ minimises $\rho(\alpha)$ .

With fixed parameters, every iteration uses the same step size, which is a compromise between early iterations (large steps beneficial) and late iterations (small steps avoid oscillation). Unrolling allows aggressive early layers and conservative late layers, achieving faster overall convergence.

Proof

Linear convergence with varying parameters

Each layer contracts the error by a factor $\rho(\alpha_k)$ : $\|\hat{\mathbf{c}}^{(k+1)} - \mathbf{c}^*\| \leq \rho(\alpha_k)\|\hat{\mathbf{c}}^{(k)} - \mathbf{c}^*\|$ . Telescoping: $\|\hat{\mathbf{c}}^{(K)} - \mathbf{c}^*\| \leq \prod_{k=1}^K \rho(\alpha_k) \cdot \|\hat{\mathbf{c}}^{(0)} - \mathbf{c}^*\|$ .

Optimising the product

The optimisation $\min_{\alpha_1,\ldots,\alpha_K} \prod_{k=1}^K \rho(\alpha_k)$ generally yields $\alpha_k$ that vary across layers. Since the per-layer rates are unconstrained (unlike the fixed-parameter case where all rates are equal), the product can be strictly smaller than the $K$ -th power of the optimal fixed rate. $\blacksquare$

Theorem: Convergence Under the Restricted Isometry Property

Let $\mathbf{A}$ satisfy the RIP of order $2s$ with constant $\delta_{2s} < \sqrt{2} - 1$ . Then a $K$ -layer LISTA network with weight-tied matrices $\mathbf{W} = \mathbf{I} - \mathbf{A}^{H}(\mathbf{A}\mathbf{A}^{H} + \mu\mathbf{I})^{-1}\mathbf{A}$ and learned thresholds $\{\tau_k\}$ satisfies:

$\|\hat{\mathbf{c}}^{(K)} - \mathbf{c}^*\|_2 \leq \rho^K \|\hat{\mathbf{c}}^{(0)} - \mathbf{c}^*\|_2 + \frac{C}{\sqrt{s}}\|\mathbf{c}^* - \mathbf{c}^*_s\|_1 + D\sigma^2$

where $\mathbf{c}^*_s$ is the best $s$ -sparse approximation, $\rho < 1$ depends on $\delta_{2s}$ and $\mu$ , and $C, D$ are constants.

The RIP ensures that $\mathbf{A}$ preserves distances between sparse vectors, which in turn ensures the LMMSE-based weight matrix $\mathbf{W}$ is a contraction on the support of the signal. The bound has three terms: exponential decay of the iterative error, approximation error for non-exactly-sparse signals, and noise amplification.

Proof

Contraction on the support

For any $s$ -sparse signal, the restricted matrix $(\mathbf{I} - \mathbf{W}\mathbf{A}^{H}\mathbf{A})_S$ has spectral norm at most $\delta_{2s}/(1 + \mu/\sigma_{\min}^2)$ under the RIP. This gives a contraction rate $\rho < 1$ .

Noise and approximation error

The noise term enters through $\mathbf{W}\mathbf{A}^{H}\mathbf{w}$ , bounded by $\|\mathbf{W}\mathbf{A}^{H}\| \cdot \|\mathbf{w}\|$ . The approximation error $\|\mathbf{c}^* - \mathbf{c}^*_s\|_1/\sqrt{s}$ accounts for the "tail" of nearly-sparse signals, following the standard LASSO recovery bound. $\blacksquare$

Theorem: Generalisation Bound for Unrolled Networks

Let $f_\theta$ be a $K$ -layer unrolled network with $P$ total learnable parameters, trained on $n$ i.i.d. samples. If each layer is $L_k$ -Lipschitz and the loss $\mathcal{L}$ is $L_{\mathcal{L}}$ -Lipschitz, then with probability at least $1 - \delta$ :

$\mathbb{E}[\mathcal{L}] \leq \hat{\mathcal{L}}_n + O\!\left(\frac{L_{\mathcal{L}} \prod_{k=1}^K L_k \cdot \sqrt{P \log(K/\delta)}}{\sqrt{n}}\right)$

where $\hat{\mathcal{L}}_n$ is the empirical training loss.

The generalisation gap scales with:

$\sqrt{P/n}$ : more parameters or fewer samples worsens generalisation.
$\prod_k L_k$ : the product of per-layer Lipschitz constants (the "depth amplification" factor).

The inductive bias of unrolled networks keeps $P$ small relative to generic architectures, improving the bound. Weight tying further reduces $P$ to $P/K$ .

Proof

Rademacher complexity

The Rademacher complexity of the hypothesis class is bounded by: $\mathcal{R}_n(\mathcal{F}) \leq C \cdot \prod_{k=1}^K L_k \cdot \sqrt{P/n}$ . This follows from the covering number bound for Lipschitz compositions of parameterised layers.

Apply the generalisation bound

By standard Rademacher complexity theory: $\mathbb{E}[\mathcal{L}] \leq \hat{\mathcal{L}}_n + 2\mathcal{R}_n(\mathcal{F}) + c\sqrt{\log(1/\delta)/n}$ . Substituting the Rademacher bound gives the stated result. $\blacksquare$

Generalisation Gap vs Network Depth

Visualise how the generalisation gap (test loss minus train loss) scales with network depth $K$ and training set size $n$ . The plot shows the theoretical bound $\propto \sqrt{P/n} \cdot \prod_k L_k$ alongside empirical measurements from unrolled OAMP training.

Increase the number of layers to see the "depth amplification" effect, and increase the training set size to see how more data tightens the bound.

Parameters

Number of Layers

K

10

Training Set Size

n

5000

Weight Tying

Definition:
Convergent Unrolling

A convergent unrolled network is an architecture guaranteed to converge to a fixed point as $K \to \infty$ , regardless of the learned parameters. This requires:

Each layer is a contractive mapping: $\|\mathcal{T}_{\theta_k}(\mathbf{a}) - \mathcal{T}_{\theta_k}(\mathbf{b})\| \leq \rho_k \|\mathbf{a} - \mathbf{b}\|$ with $\rho_k < 1$ .
The contraction rates satisfy $\prod_{k=1}^\infty \rho_k = 0$ (guaranteed if $\sup_k \rho_k < 1$ ).

Convergent unrolling is achieved by constraining the learned operators (e.g., spectral normalisation of weight matrices, firmly nonexpansive denoisers).

Convergent unrolling provides a safety net: even if training is imperfect, the network output cannot diverge. This is important for safety-critical applications (medical imaging, autonomous radar) where a divergent reconstruction could have serious consequences.

Theorem: Robustness to Model Mismatch

Let $f_\theta$ be an unrolled network trained with sensing matrix $\mathbf{A}$ and tested with a perturbed matrix $\tilde{\mathbf{A}} = \mathbf{A} + \Delta\mathbf{A}$ . If $f_\theta$ is $L_f$ -Lipschitz with respect to $\mathbf{A}$ , then:

$\|\hat{\mathbf{c}}_{\tilde{\mathbf{A}}} - \hat{\mathbf{c}}_{\mathbf{A}}\| \leq L_f \cdot \|\Delta\mathbf{A}\| \cdot \|\mathbf{c}\|.$

For unrolled OAMP with $T$ layers, the Lipschitz constant scales as $L_f = O(T \cdot \|\mathbf{W}_{\text{LE}}\|)$ , which is bounded when the LMMSE regularisation is positive.

Small perturbations in the sensing matrix cause proportionally small changes in the reconstruction. The bound degrades linearly with depth $T$ , motivating the use of moderate depth (5--15 layers) in practice. Unrolled OAMP is more robust than generic networks because the physics-based LMMSE step provides a structured response to model perturbations.

Proof

Per-layer sensitivity

At each layer, the LMMSE step depends on $\mathbf{A}$ through $\mathbf{W}_{\text{LE}} = \mathbf{A}^{H}(\mathbf{A}\mathbf{A}^{H} + \sigma_t^2 \mathbf{I})^{-1}$ . A perturbation $\Delta\mathbf{A}$ changes $\mathbf{W}_{\text{LE}}$ by at most $O(\|\Delta\mathbf{A}\|/\sigma_t^2)$ .

Accumulation across layers

The total sensitivity is the sum of per-layer sensitivities (by the triangle inequality applied to the compositional chain), giving $L_f = O(T \cdot \|\mathbf{W}_{\text{LE}}\|)$ . $\blacksquare$

The Over-Parameterised Regime and PL* Condition

When the unrolled network has many more parameters than training samples ( $P \gg n$ ), classical generalisation theory predicts poor performance. However, recent results show that unrolled networks can interpolate training data while still generalising, provided:

The loss landscape satisfies the Polyak-Lojasiewicz (PL) condition*: $\|\nabla \mathcal{L}(\theta)\|^2 \geq \mu \cdot \mathcal{L}(\theta)$ for some $\mu > 0$ .
The initialisation is close to a good solution (e.g., ISTA/OAMP initialisation).

Under PL*, gradient descent converges to a global minimum at a linear rate, and the implicit regularisation of the algorithm structure selects solutions with good generalisation properties. This helps explain why unrolled OAMP generalises well even with relatively few training samples.

Example: Sample Complexity --- Unrolled OAMP vs Generic Network

Compare the number of training samples needed for 10-layer unrolled OAMP (55K parameters) and a generic U-Net (1.2M parameters) to achieve NMSE $< -20$ dB on an RF imaging task with $128 \times 128$ images.

Solution

Unrolled OAMP

With $P = 55{,}000$ parameters, the generalisation bound suggests $n = O(P / \epsilon^2)$ samples. However, the strong inductive bias (physics-based LMMSE, state evolution) means the network is "almost correct" at initialisation.

Empirically, $n \approx 3{,}000$ -- $5{,}000$ training scenes suffice to reach NMSE $< -20$ dB.

Generic U-Net

With $P = 1.2 \times 10^6$ parameters and no physics-based initialisation, the U-Net requires $n \approx 30{,}000$ -- $50{,}000$ training scenes to reach the same NMSE.

The $10\times$ -- $15\times$ improvement in sample efficiency is the practical manifestation of the algorithmic inductive bias.

Implication for RF imaging

Generating training data for RF imaging (via simulation or measurement) is expensive. The $10\times$ reduction in data requirements makes unrolled OAMP practical for scenarios where only a few thousand training scenes are available.

Common Mistake: Unrolled Networks Do Not Converge by Default

Mistake:

Assuming that because the underlying OAMP algorithm converges, the unrolled version also converges.

Correction:

Convergence of OAMP relies on specific conditions (divergence-free LE, denoiser properties). The learned ProxNet may violate these, and the truncation to $T$ layers means the network never runs to convergence.

To guarantee convergence:

Constrain the ProxNet to be firmly nonexpansive (spectral normalisation).
Add a convergence penalty: $\mathcal{L}_{\text{conv}} = \|\hat{\mathbf{c}}^{(T)} - \mathcal{T}_{\theta_T}(\hat{\mathbf{c}}^{(T)})\|^2$ .
Verify the state evolution monotonically decreases $\sigma_t^2$ across layers.

⚠️Engineering Note

Gradient Checkpointing for Deep Unrolled Networks

For $T > 10$ layers, storing all intermediate activations for backpropagation requires $O(T \cdot N)$ memory, which may exceed GPU capacity for large images.

Gradient checkpointing stores only every $k$ -th activation and recomputes the others during the backward pass. This reduces memory from $O(TN)$ to $O((\sqrt{T} + T/k) \cdot N)$ at the cost of one additional forward pass.

For unrolled OAMP with $T = 15$ , $N = 128^2$ , and checkpointing every 3 layers: memory reduction is $\sim 3\times$ with $\sim 33\%$ additional computation.

Practical Constraints

•
GPU memory for intermediate activations: $O(T \cdot N)$ without checkpointing
•
Recomputation cost: one additional forward pass per checkpoint segment

Quick Check

According to the generalisation bound, which modification most effectively reduces the generalisation gap of an unrolled network?

Adding more layers (increasing $K$ )

Weight tying across layers (reducing $P$ to $P/K$ )

Using a larger training set

Using a more expressive denoiser

Correction:

Weight tying across layers (reducing

P

to

P/K

)

Weight tying reduces the parameter count by a factor of $K$ . The generalisation bound scales as $\sqrt{P/n}$ , so this directly reduces the gap by a factor of $\sqrt{K}$ .

Restricted Isometry Property (RIP)

A sensing matrix $\mathbf{A}$ satisfies the RIP of order $s$ with constant $\delta_s$ if $(1-\delta_s)\|\mathbf{x}\|^2 \leq \|\mathbf{A}\mathbf{x}\|^2 \leq (1+\delta_s)\|\mathbf{x}\|^2$ for all $s$ -sparse vectors $\mathbf{x}$ . RIP ensures that sparse recovery algorithms converge.

Convergent Unrolling

An unrolled network architecture that is guaranteed to converge to a fixed point regardless of the learned parameters, achieved by constraining each layer to be a contraction (e.g., via spectral normalisation).

Key Takeaway

Unrolled networks enjoy three types of theoretical guarantees: convergence under RIP (linear decay of error), generalisation bounds scaling as $\sqrt{P/n}$ (improved by weight tying and algorithmic inductive bias), and robustness to model mismatch proportional to the perturbation norm. Convergent unrolling via Lipschitz constraints provides safety guarantees for critical applications.

Theoretical Guarantees for Unrolled Networks

Can We Prove Unrolled Networks Work?

Theorem: Unrolled Networks Converge Faster Than Fixed-Parameter Algorithms

Linear convergence with varying parameters

Optimising the product

Theorem: Convergence Under the Restricted Isometry Property

Contraction on the support

Noise and approximation error

Theorem: Generalisation Bound for Unrolled Networks

Rademacher complexity

Apply the generalisation bound

Generalisation Gap vs Network Depth

Parameters

Definition: Convergent Unrolling

Theorem: Robustness to Model Mismatch

Per-layer sensitivity

Accumulation across layers

The Over-Parameterised Regime and PL* Condition

Example: Sample Complexity --- Unrolled OAMP vs Generic Network

Unrolled OAMP

Generic U-Net

Implication for RF imaging

Common Mistake: Unrolled Networks Do Not Converge by Default

Gradient Checkpointing for Deep Unrolled Networks

Quick Check

Restricted Isometry Property (RIP)

Convergent Unrolling

Key Takeaway

Definition:
Convergent Unrolling