Connection to LISTA and Learned ADMM
LISTA --- The First Unrolled Network
Learned ISTA (LISTA), introduced by Gregor and LeCun (2010), is the foundational example of algorithm unrolling. It unrolls ISTA for sparse coding into a fixed-depth network with learnable matrices and thresholds. Despite its simplicity, LISTA achieves orders of magnitude faster convergence than ISTA, establishing the power of the unrolling paradigm.
We study LISTA and Learned ADMM here not as competitors to unrolled OAMP, but to understand the design space: what structural choices matter, and why OAMP's orthogonalisation provides a superior inductive bias for the Kronecker-structured sensing matrices of RF imaging.
Definition: ISTA Iteration (Review)
ISTA Iteration (Review)
The Iterative Shrinkage-Thresholding Algorithm (ISTA) for the LASSO problem iterates:
where is the Lipschitz constant and is the soft-thresholding operator: .
ISTA converges at rate . For sparse recovery with random matrices, convergence is linear but slow: the contraction rate depends on the condition number of .
Definition: Learned ISTA (LISTA)
Learned ISTA (LISTA)
LISTA replaces the fixed ISTA parameters with learnable ones. Each layer computes:
where the learnable parameters at layer are:
- : replaces
- : replaces
- : learnable threshold (replaces )
At initialisation, and are set to their ISTA values so the untrained network performs ISTA. Training refines these away from the ISTA initialisation.
Theorem: LISTA Convergence Acceleration
For sparse signals with sparsity and a random Gaussian sensing matrix with , a -layer LISTA network trained on the signal distribution achieves:
where strictly. Empirically, LISTA with layers matches the accuracy of ISTA with iterations.
ISTA uses the same step size and threshold at every iteration. Early iterations benefit from aggressive thresholding (to quickly identify the support of the sparse signal), while later iterations benefit from gentle thresholding (to refine the nonzero values). LISTA learns this schedule automatically.
Support identification
In the first few layers, the learned thresholds are large, aggressively setting small entries to zero. This rapidly identifies the support .
Value refinement
Once the support is approximately identified, the later layers use small thresholds and the weight matrices effectively solve a least-squares problem restricted to the estimated support: .
Contraction rate
The overall contraction rate is determined by the spectral properties of restricted to the support. Because is optimised (rather than fixed at ), it achieves a smaller spectral radius.
LISTA vs ISTA Convergence
Compare the convergence of LISTA and ISTA as a function of layers/iterations. The plot shows the normalised reconstruction error versus layer/iteration index.
Toggle "Learn Thresholds" to see the effect of keeping thresholds fixed at the ISTA value versus learning them. With learned thresholds, LISTA achieves a dramatically steeper convergence curve.
Parameters
Definition: Analytic LISTA (ALISTA)
Analytic LISTA (ALISTA)
ALISTA derives the optimal LISTA weight matrix in closed form:
where is the unique positive solution of:
with and the expected sparsity. This eliminates the need to learn and : only the thresholds are trained, reducing the parameter count by orders of magnitude.
ALISTA shows that the essential quantity learned by LISTA is the threshold schedule --- the weight matrices converge to a regularised pseudo-inverse that can be computed analytically.
Definition: Learned ADMM (L-ADMM)
Learned ADMM (L-ADMM)
Learned ADMM parameterises each ADMM iteration with learnable components. Layer computes:
where solves the data-fidelity subproblem, replaces soft-thresholding with a learned proximal operator (e.g., a small CNN), and , are learnable scalars.
The dual variable accumulates constraint violations, providing cross-layer memory that LISTA lacks. The learned penalty schedules typically decrease: large early (enforce consensus) to small later (refine details).
Definition: Learned Primal-Dual (LPD)
Learned Primal-Dual (LPD)
The Learned Primal-Dual network maintains primal iterates (images) and dual iterates (measurements) simultaneously. Each layer performs:
Dual update:
Primal update:
where and are small CNNs. The dual-space network reasons about measurement-domain structure (k-space artefacts, missing samples) invisible in the image domain.
Comparison of Unrolled Architectures
| Property | LISTA | L-ADMM | LPD | Unrolled OAMP |
|---|---|---|---|---|
| Base algorithm | ISTA | ADMM | Chambolle-Pock | OAMP |
| Cross-layer memory | None | Dual variable | Multi-channel primal + dual | State evolution |
| Works for structured | Partially (dense ) | Partially | Yes | Yes (designed for it) |
| Exploits Kronecker structure | No | No | No | Yes () |
| Convergence theory | ALISTA contraction rate | Firmly nonexpansive | Expressivity subsumption | State evolution (exact) |
| Typical param count (per layer) | or CNN | CNN (primal + dual) | CNN only () | |
| Best suited for | i.i.d. Gaussian | General, splitting-friendly | Dual-domain artefacts | Structured/Kronecker |
Theorem: Why OAMP Outperforms ISTA Unrolling for Structured Matrices
Let be a right-unitarily invariant matrix with singular values and condition number . Then:
- The ISTA contraction rate is .
- The OAMP contraction rate (with optimal LMMSE) is , which depends on the full spectrum of , not just .
For highly structured matrices (partial DFT, Kronecker products) where but the bulk of singular values are well-conditioned, .
ISTA's step size is dictated by the largest singular value --- one bad eigenvalue slows down the entire algorithm. OAMP's LMMSE step adapts to the full spectrum: it inverts well-conditioned modes aggressively and regularises ill-conditioned modes. This spectral adaptivity is what makes OAMP natural for the structured operators arising in RF imaging.
ISTA contraction
The ISTA iteration matrix is , whose spectral radius is for large .
OAMP spectral adaptivity
The OAMP linear estimator applies per-singular-value weights . The effective noise after the LE step involves the average , which is dominated by the bulk of the spectrum, not the extremes.
Example: LISTA Parameter Count vs Unrolled OAMP
Compare the parameter counts and per-layer costs of 10-layer LISTA and 10-layer unrolled OAMP for a problem with measurements and unknowns (a image).
LISTA parameters
Per layer: has parameters, has , plus . Total per layer: --- far too large.
Even weight-tied LISTA has shared parameters. These dense matrices do not fit in GPU memory for imaging-scale problems.
Unrolled OAMP parameters
The LMMSE step uses directly (no learnable matrix). Each ProxNet denoiser (5-layer DnCNN, 64 channels) has parameters. Total for 10 layers: parameters.
The per-layer LMMSE cost with Kronecker-FFT structure is operations, versus for LISTA's dense matrix multiply.
Conclusion
Unrolled OAMP uses 600 fewer parameters and 1000 fewer per-layer FLOPs than LISTA for this imaging-scale problem. This is the practical consequence of encoding the forward model structure directly in the architecture.
Historical Note: The Birth of Algorithm Unrolling: LISTA (2010)
2010sKarol Gregor and Yann LeCun introduced LISTA at ICML 2010, motivated by the need for fast sparse coding in computer vision. At the time, sparse coding required hundreds of ISTA iterations per image patch. LISTA showed that 10 learned layers could match 100+ ISTA iterations, reducing inference time by an order of magnitude.
The paper's insight --- that iterative algorithms can be viewed as recurrent networks with tied weights, and untying them yields better performance --- launched the algorithm unrolling field. Within a decade, unrolled networks became the dominant approach for model-based deep learning in signal processing and imaging.
Historical Note: ADMM-Net and the Imaging Connection
2016--2018Yang et al. (2016) applied ADMM unrolling to MRI reconstruction, demonstrating that the ADMM splitting structure (data-fidelity + regulariser + dual variable) provides a richer architecture than ISTA unrolling. Chang et al. (2017) extended this to general compressed sensing with ADMM-CSNet. Adler and Oktem (2018) introduced Learned Primal-Dual for CT, adding dual-domain processing.
These works established that the choice of base algorithm matters: different algorithms provide different inductive biases, and matching the algorithm to the problem structure is essential. For RF imaging with structured sensing matrices, OAMP provides the best match.
Why This Matters: Unrolled OAMP for MIMO Radar Imaging
MIMO radar naturally produces Kronecker-structured sensing matrices: where and are the transmit and receive steering matrices.
Unrolled OAMP-ProxNet with Kronecker structure exploits this separability to achieve near-optimal reconstruction with computational cost scaling as per layer rather than . Key advantages over MFU-Net for MIMO radar:
- The forward model is used at every layer, not just once.
- The state evolution provides a noise schedule for the ProxNet denoiser.
- Theoretical guarantees from OAMP state evolution transfer to the unrolled network (under the Gaussianity assumption).
Common Mistake: LISTA's Dense Matrices Are Prohibitive for Imaging
Mistake:
Applying LISTA with full dense matrices to high-resolution imaging problems ().
Correction:
For , each has entries --- far too large for GPU memory. Alternatives:
- Structured LISTA: parameterise as sparse or low-rank.
- Convolutional LISTA: replace dense matrices with convolutional operators.
- ALISTA: fix analytically, learn only .
- Use OAMP instead: the LMMSE step exploits the structure of without any dense learned matrix.
Common Mistake: More Layers Is Not Always Better
Mistake:
Increasing the number of unrolled layers indefinitely, assuming that more layers always improve reconstruction quality.
Correction:
Beyond a certain , additional layers provide diminishing returns and introduce two problems:
- Vanishing gradients: the backward pass through layers involves Jacobian multiplications that can shrink gradients.
- Overfitting: more parameters with limited training data.
In practice, -- layers suffice. Weight tying and intermediate supervision help mitigate gradient issues.
Quick Check
How should the LISTA weight matrices be initialised?
Random Gaussian initialisation
Identity matrix
The ISTA matrix
Zero matrix
Initialising at the ISTA operating point ensures the untrained network already performs ISTA. Training refines the parameters, guaranteed to improve upon the baseline.
Quick Check
What role does the dual variable play in Learned ADMM?
It stores the previous reconstruction for momentum
It accumulates constraint violations, providing memory across layers
It normalises activations between layers
It controls the learning rate during training
The dual update accumulates the difference between the data-fidelity and regularised solutions. This acts as a running penalty driving the two toward agreement.
LISTA (Learned ISTA)
An unrolled version of ISTA where the step-size matrix, measurement matrix, and threshold at each layer become learnable parameters, trained end-to-end via backpropagation.
Related: ALISTA (Analytic LISTA)
ALISTA (Analytic LISTA)
A variant of LISTA where the weight matrices are computed in closed form (as a regularised pseudo-inverse), and only the thresholds are learned, drastically reducing the parameter count.
Related: LISTA (Learned ISTA)
Key Takeaway
LISTA, L-ADMM, and LPD are important unrolled architectures, but for RF imaging with structured sensing matrices, unrolled OAMP is superior: it exploits Kronecker structure for per-layer cost, uses orders of magnitude fewer parameters than LISTA, and the LMMSE step adapts to the full spectrum of rather than being limited by the condition number.