Ferkans — Interactive Telecom Tutor

LISTA --- The First Unrolled Network

Learned ISTA (LISTA), introduced by Gregor and LeCun (2010), is the foundational example of algorithm unrolling. It unrolls ISTA for sparse coding into a fixed-depth network with learnable matrices and thresholds. Despite its simplicity, LISTA achieves orders of magnitude faster convergence than ISTA, establishing the power of the unrolling paradigm.

We study LISTA and Learned ADMM here not as competitors to unrolled OAMP, but to understand the design space: what structural choices matter, and why OAMP's orthogonalisation provides a superior inductive bias for the Kronecker-structured sensing matrices of RF imaging.

Definition:
ISTA Iteration (Review)

The Iterative Shrinkage-Thresholding Algorithm (ISTA) for the LASSO problem $\min_{\mathbf{c}} \frac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{c}\|^2 + \lambda\|\mathbf{c}\|_1$ iterates:

$\hat{\mathbf{c}}^{(k+1)} = \mathcal{S}_{\lambda/L}\!\bigl(\hat{\mathbf{c}}^{(k)} + \tfrac{1}{L}\mathbf{A}^{H}(\mathbf{y} - \mathbf{A}\hat{\mathbf{c}}^{(k)})\bigr)$

where $L = \|\mathbf{A}^{H}\mathbf{A}\|$ is the Lipschitz constant and $\mathcal{S}_\tau$ is the soft-thresholding operator: $[\mathcal{S}_\tau(\mathbf{z})]_i = \operatorname{sign}(z_i)\max(|z_i| - \tau, 0)$ .

ISTA converges at rate $O(1/k)$ . For sparse recovery with random matrices, convergence is linear but slow: the contraction rate depends on the condition number of $\mathbf{A}^{H}\mathbf{A}$ .

Definition:
Learned ISTA (LISTA)

LISTA replaces the fixed ISTA parameters with learnable ones. Each layer $k$ computes:

$\hat{\mathbf{c}}^{(k+1)} = \mathcal{S}_{\tau_k}\!\bigl(\mathbf{W}_k \hat{\mathbf{c}}^{(k)} + \mathbf{S}_k \mathbf{y}\bigr)$

where the learnable parameters at layer $k$ are:

$\mathbf{W}_k \in \mathbb{R}^{N \times N}$ : replaces $\mathbf{I} - \frac{1}{L}\mathbf{A}^{H}\mathbf{A}$
$\mathbf{S}_k \in \mathbb{R}^{N \times M}$ : replaces $\frac{1}{L}\mathbf{A}^{H}$
$\tau_k \in \mathbb{R}_+$ : learnable threshold (replaces $\lambda/L$ )

At initialisation, $\mathbf{W}_k$ and $\mathbf{S}_k$ are set to their ISTA values so the untrained network performs ISTA. Training refines these away from the ISTA initialisation.

Theorem: LISTA Convergence Acceleration

For sparse signals with sparsity $s$ and a random Gaussian sensing matrix $\mathbf{A} \in \mathbb{R}^{M \times N}$ with $M = O(s \log N)$ , a $K$ -layer LISTA network trained on the signal distribution achieves:

$\|\hat{\mathbf{c}}^{(K)} - \mathbf{c}^*\|_2 \leq C \cdot \rho_{\text{LISTA}}^K \cdot \|\hat{\mathbf{c}}^{(0)} - \mathbf{c}^*\|_2$

where $\rho_{\text{LISTA}} < \rho_{\text{ISTA}}$ strictly. Empirically, LISTA with $K = 10$ layers matches the accuracy of ISTA with $K > 100$ iterations.

ISTA uses the same step size and threshold at every iteration. Early iterations benefit from aggressive thresholding (to quickly identify the support of the sparse signal), while later iterations benefit from gentle thresholding (to refine the nonzero values). LISTA learns this schedule automatically.

Proof

Support identification

In the first few layers, the learned thresholds $\tau_k$ are large, aggressively setting small entries to zero. This rapidly identifies the support $S = \{i : c_i^* \neq 0\}$ .

Value refinement

Once the support is approximately identified, the later layers use small thresholds $\tau_k \approx 0$ and the weight matrices $\mathbf{W}_k$ effectively solve a least-squares problem restricted to the estimated support: $\hat{\mathbf{c}}_S = (\mathbf{A}_{S}^{H}\mathbf{A}_{S})^{-1}\mathbf{A}_{S}^{H}\mathbf{y}$ .

Contraction rate

The overall contraction rate $\rho_{\text{LISTA}}$ is determined by the spectral properties of $\mathbf{W}_k$ restricted to the support. Because $\mathbf{W}_k$ is optimised (rather than fixed at $\mathbf{I} - \frac{1}{L}\mathbf{A}^{H}\mathbf{A}$ ), it achieves a smaller spectral radius. $\blacksquare$

LISTA vs ISTA Convergence

Compare the convergence of LISTA and ISTA as a function of layers/iterations. The plot shows the normalised reconstruction error versus layer/iteration index.

Toggle "Learn Thresholds" to see the effect of keeping thresholds fixed at the ISTA value versus learning them. With learned thresholds, LISTA achieves a dramatically steeper convergence curve.

Parameters

Number of Layers10

Learn Thresholds

Definition:
Analytic LISTA (ALISTA)

ALISTA derives the optimal LISTA weight matrix in closed form:

$\mathbf{W}^* = \mathbf{I} - \mathbf{A}^{H}(\mathbf{A}\mathbf{A}^{H} + \mu^*\mathbf{I})^{-1}\mathbf{A}$

where $\mu^*$ is the unique positive solution of:

$\frac{1}{N}\operatorname{tr}\bigl[(\mathbf{G} + \mu\mathbf{I})^{-2}\mathbf{G}\bigr] = \frac{1}{N - s}$

with $\mathbf{G} = \mathbf{A}^{H}\mathbf{A}$ and $s$ the expected sparsity. This eliminates the need to learn $\mathbf{W}_k$ and $\mathbf{S}_k$ : only the thresholds $\tau_k$ are trained, reducing the parameter count by orders of magnitude.

ALISTA shows that the essential quantity learned by LISTA is the threshold schedule --- the weight matrices converge to a regularised pseudo-inverse that can be computed analytically.

Definition:
Learned ADMM (L-ADMM)

Learned ADMM parameterises each ADMM iteration with learnable components. Layer $k$ computes:

$\hat{\mathbf{c}}^{(k+1)} = \mathcal{F}_{\theta_k^x}\!\bigl(\mathbf{y},\; \mathbf{z}^{(k)},\; \mathbf{u}^{(k)}\bigr)$

$\mathbf{z}^{(k+1)} = \mathcal{G}_{\theta_k^z}\!\bigl(\hat{\mathbf{c}}^{(k+1)} + \mathbf{u}^{(k)}\bigr)$

$\mathbf{u}^{(k+1)} = \mathbf{u}^{(k)} + \eta_k\bigl(\hat{\mathbf{c}}^{(k+1)} - \mathbf{z}^{(k+1)}\bigr)$

where $\mathcal{F}_{\theta_k^x}$ solves the data-fidelity subproblem, $\mathcal{G}_{\theta_k^z}$ replaces soft-thresholding with a learned proximal operator (e.g., a small CNN), and $\rho_k$ , $\eta_k$ are learnable scalars.

The dual variable $\mathbf{u}^{(k)}$ accumulates constraint violations, providing cross-layer memory that LISTA lacks. The learned penalty $\rho_k$ schedules typically decrease: large early (enforce consensus) to small later (refine details).

Definition:
Learned Primal-Dual (LPD)

The Learned Primal-Dual network maintains primal iterates $\{\mathbf{f}^{(k)}\}$ (images) and dual iterates $\{\mathbf{h}^{(k)}\}$ (measurements) simultaneously. Each layer $k$ performs:

Dual update: $\mathbf{h}^{(k+1)} = \Gamma_{\theta_k^d}\!\bigl(\mathbf{h}^{(k)},\; \mathbf{A}\,\mathbf{f}_1^{(k)},\; \mathbf{y}\bigr)$

Primal update: $\mathbf{f}^{(k+1)} = \Lambda_{\theta_k^p}\!\bigl(\mathbf{f}^{(k)},\; \mathbf{A}^{H}\,\mathbf{h}_1^{(k+1)}\bigr)$

where $\Gamma_{\theta_k^d}$ and $\Lambda_{\theta_k^p}$ are small CNNs. The dual-space network reasons about measurement-domain structure (k-space artefacts, missing samples) invisible in the image domain.

Comparison of Unrolled Architectures

Property	LISTA	L-ADMM	LPD	Unrolled OAMP
Base algorithm	ISTA	ADMM	Chambolle-Pock	OAMP
Cross-layer memory	None	Dual variable $\mathbf{u}$	Multi-channel primal + dual	State evolution $\sigma_t^2$
Works for structured $\mathbf{A}$	Partially (dense $\mathbf{W}_k$ )	Partially	Yes	Yes (designed for it)
Exploits Kronecker structure	No	No	No	Yes ( $O(N\log N)$ )
Convergence theory	ALISTA contraction rate	Firmly nonexpansive	Expressivity subsumption	State evolution (exact)
Typical param count (per layer)	$O(N^2)$	$O(N^2)$ or CNN	CNN (primal + dual)	CNN only ( $\sim 10^4$ )
Best suited for	i.i.d. Gaussian $\mathbf{A}$	General, splitting-friendly	Dual-domain artefacts	Structured/Kronecker $\mathbf{A}$

Theorem: Why OAMP Outperforms ISTA Unrolling for Structured Matrices

Let $\mathbf{A}$ be a right-unitarily invariant matrix with singular values $\{\sigma_i\}$ and condition number $\kappa = \sigma_{\max}/\sigma_{\min}$ . Then:

The ISTA contraction rate is $\rho_{\text{ISTA}} = 1 - 2/(\kappa + 1)$ .
The OAMP contraction rate (with optimal LMMSE) is $\rho_{\text{OAMP}} = 1 - 1/\text{mmse}(\sigma_t^2)$ , which depends on the full spectrum of $\mathbf{A}$ , not just $\sigma_{\max}$ .

For highly structured matrices (partial DFT, Kronecker products) where $\kappa \gg 1$ but the bulk of singular values are well-conditioned, $\rho_{\text{OAMP}} \ll \rho_{\text{ISTA}}$ .

ISTA's step size $1/L = 1/\sigma_{\max}^2$ is dictated by the largest singular value --- one bad eigenvalue slows down the entire algorithm. OAMP's LMMSE step adapts to the full spectrum: it inverts well-conditioned modes aggressively and regularises ill-conditioned modes. This spectral adaptivity is what makes OAMP natural for the structured operators arising in RF imaging.

Proof

ISTA contraction

The ISTA iteration matrix is $\mathbf{I} - \frac{1}{L}\mathbf{A}^{H}\mathbf{A}$ , whose spectral radius is $\max(1 - \sigma_{\min}^2/L, \sigma_{\max}^2/L - 1) = 1 - \sigma_{\min}^2/\sigma_{\max}^2 = 1 - 1/\kappa^2 \approx 1 - 2/(\kappa+1)$ for large $\kappa$ .

OAMP spectral adaptivity

The OAMP linear estimator $\mathbf{W}_{\text{LE}} = \mathbf{A}^{H}(\mathbf{A}\mathbf{A}^{H} + v\mathbf{I})^{-1}$ applies per-singular-value weights $w_i = \sigma_i/(\sigma_i^2 + v)$ . The effective noise after the LE step involves the average $\frac{1}{N}\sum_i \sigma_i^2/(\sigma_i^2 + v)^2$ , which is dominated by the bulk of the spectrum, not the extremes. $\blacksquare$

Example: LISTA Parameter Count vs Unrolled OAMP

Compare the parameter counts and per-layer costs of 10-layer LISTA and 10-layer unrolled OAMP for a problem with $M = 4096$ measurements and $N = 16384$ unknowns (a $128 \times 128$ image).

Solution

LISTA parameters

Per layer: $\mathbf{W}_k \in \mathbb{R}^{N \times N}$ has $N^2 \approx 2.7 \times 10^8$ parameters, $\mathbf{S}_k \in \mathbb{R}^{N \times M}$ has $NM \approx 6.7 \times 10^7$ , plus $\tau_k$ . Total per layer: $\sim 3.4 \times 10^8$ --- far too large.

Even weight-tied LISTA has $\sim 3.4 \times 10^8$ shared parameters. These dense matrices do not fit in GPU memory for imaging-scale problems.

Unrolled OAMP parameters

The LMMSE step uses $\mathbf{A}$ directly (no learnable matrix). Each ProxNet denoiser (5-layer DnCNN, 64 channels) has $\sim 55{,}000$ parameters. Total for 10 layers: $\sim 550{,}000$ parameters.

The per-layer LMMSE cost with Kronecker-FFT structure is $O(N\log N) \approx 2.3 \times 10^5$ operations, versus $O(N^2) \approx 2.7 \times 10^8$ for LISTA's dense matrix multiply.

Conclusion

Unrolled OAMP uses 600 $\times$ fewer parameters and 1000 $\times$ fewer per-layer FLOPs than LISTA for this imaging-scale problem. This is the practical consequence of encoding the forward model structure directly in the architecture.

Historical Note: The Birth of Algorithm Unrolling: LISTA (2010)

2010s

Karol Gregor and Yann LeCun introduced LISTA at ICML 2010, motivated by the need for fast sparse coding in computer vision. At the time, sparse coding required hundreds of ISTA iterations per image patch. LISTA showed that 10 learned layers could match 100+ ISTA iterations, reducing inference time by an order of magnitude.

The paper's insight --- that iterative algorithms can be viewed as recurrent networks with tied weights, and untying them yields better performance --- launched the algorithm unrolling field. Within a decade, unrolled networks became the dominant approach for model-based deep learning in signal processing and imaging.

Historical Note: ADMM-Net and the Imaging Connection

2016--2018

Yang et al. (2016) applied ADMM unrolling to MRI reconstruction, demonstrating that the ADMM splitting structure (data-fidelity + regulariser + dual variable) provides a richer architecture than ISTA unrolling. Chang et al. (2017) extended this to general compressed sensing with ADMM-CSNet. Adler and Oktem (2018) introduced Learned Primal-Dual for CT, adding dual-domain processing.

These works established that the choice of base algorithm matters: different algorithms provide different inductive biases, and matching the algorithm to the problem structure is essential. For RF imaging with structured sensing matrices, OAMP provides the best match.

,

Why This Matters: Unrolled OAMP for MIMO Radar Imaging

MIMO radar naturally produces Kronecker-structured sensing matrices: $\mathbf{A} = \mathbf{A}_{\text{tx}} \otimes \mathbf{A}_{\text{rx}}$ where $\mathbf{A}_{\text{tx}}$ and $\mathbf{A}_{\text{rx}}$ are the transmit and receive steering matrices.

Unrolled OAMP-ProxNet with Kronecker structure exploits this separability to achieve near-optimal reconstruction with computational cost scaling as $O(N\log N)$ per layer rather than $O(N^2)$ . Key advantages over MF $\to$ U-Net for MIMO radar:

The forward model is used at every layer, not just once.
The state evolution provides a noise schedule for the ProxNet denoiser.
Theoretical guarantees from OAMP state evolution transfer to the unrolled network (under the Gaussianity assumption).

Common Mistake: LISTA's Dense Matrices Are Prohibitive for Imaging

Mistake:

Applying LISTA with full dense matrices $\mathbf{W}_k \in \mathbb{R}^{N \times N}$ to high-resolution imaging problems ( $N > 10^4$ ).

Correction:

For $N = 10^4$ , each $\mathbf{W}_k$ has $10^8$ entries --- far too large for GPU memory. Alternatives:

Structured LISTA: parameterise $\mathbf{W}_k$ as sparse or low-rank.
Convolutional LISTA: replace dense matrices with convolutional operators.
ALISTA: fix $\mathbf{W}_k$ analytically, learn only $\tau_k$ .
Use OAMP instead: the LMMSE step exploits the structure of $\mathbf{A}$ without any dense learned matrix.

Common Mistake: More Layers Is Not Always Better

Mistake:

Increasing the number of unrolled layers $K$ indefinitely, assuming that more layers always improve reconstruction quality.

Correction:

Beyond a certain $K$ , additional layers provide diminishing returns and introduce two problems:

Vanishing gradients: the backward pass through $K$ layers involves $K$ Jacobian multiplications that can shrink gradients.
Overfitting: more parameters with limited training data.

In practice, $K = 5$ -- $20$ layers suffice. Weight tying and intermediate supervision help mitigate gradient issues.

Quick Check

How should the LISTA weight matrices $\mathbf{W}_k$ be initialised?

Random Gaussian initialisation

Identity matrix

The ISTA matrix $\mathbf{I} - \frac{1}{L}\mathbf{A}^{H}\mathbf{A}$

Zero matrix

Correction:

The ISTA matrix

\mathbf{I} - \frac{1}{L}\mathbf{A}^{H}\mathbf{A}

Initialising at the ISTA operating point ensures the untrained network already performs ISTA. Training refines the parameters, guaranteed to improve upon the baseline.

Quick Check

What role does the dual variable $\mathbf{u}^{(k)}$ play in Learned ADMM?

It stores the previous reconstruction for momentum

It accumulates constraint violations, providing memory across layers

It normalises activations between layers

It controls the learning rate during training

Correction:

It accumulates constraint violations, providing memory across layers

The dual update $\mathbf{u}^{(k+1)} = \mathbf{u}^{(k)} + \hat{\mathbf{c}}^{(k+1)} - \mathbf{z}^{(k+1)}$ accumulates the difference between the data-fidelity and regularised solutions. This acts as a running penalty driving the two toward agreement.

LISTA (Learned ISTA)

An unrolled version of ISTA where the step-size matrix, measurement matrix, and threshold at each layer become learnable parameters, trained end-to-end via backpropagation.

Related: ALISTA (Analytic LISTA)

ALISTA (Analytic LISTA)

A variant of LISTA where the weight matrices are computed in closed form (as a regularised pseudo-inverse), and only the thresholds are learned, drastically reducing the parameter count.

Related: LISTA (Learned ISTA)

Key Takeaway

LISTA, L-ADMM, and LPD are important unrolled architectures, but for RF imaging with structured sensing matrices, unrolled OAMP is superior: it exploits Kronecker structure for $O(N\log N)$ per-layer cost, uses orders of magnitude fewer parameters than LISTA, and the LMMSE step adapts to the full spectrum of $\mathbf{A}$ rather than being limited by the condition number.

Connection to LISTA and Learned ADMM

LISTA --- The First Unrolled Network

Definition: ISTA Iteration (Review)

Definition: Learned ISTA (LISTA)

Theorem: LISTA Convergence Acceleration

Support identification

Value refinement

Contraction rate

LISTA vs ISTA Convergence

Parameters

Definition: Analytic LISTA (ALISTA)

Definition: Learned ADMM (L-ADMM)

Definition: Learned Primal-Dual (LPD)

Comparison of Unrolled Architectures

Theorem: Why OAMP Outperforms ISTA Unrolling for Structured Matrices

ISTA contraction

OAMP spectral adaptivity

Example: LISTA Parameter Count vs Unrolled OAMP

LISTA parameters

Unrolled OAMP parameters

Conclusion

Historical Note: The Birth of Algorithm Unrolling: LISTA (2010)

Historical Note: ADMM-Net and the Imaging Connection

Why This Matters: Unrolled OAMP for MIMO Radar Imaging

Common Mistake: LISTA's Dense Matrices Are Prohibitive for Imaging

Common Mistake: More Layers Is Not Always Better

Quick Check

Quick Check

LISTA (Learned ISTA)

ALISTA (Analytic LISTA)

Key Takeaway

Definition:
ISTA Iteration (Review)

Definition:
Learned ISTA (LISTA)

Definition:
Analytic LISTA (ALISTA)

Definition:
Learned ADMM (L-ADMM)

Definition:
Learned Primal-Dual (LPD)