Connection to LISTA and Learned ADMM

LISTA --- The First Unrolled Network

Learned ISTA (LISTA), introduced by Gregor and LeCun (2010), is the foundational example of algorithm unrolling. It unrolls ISTA for sparse coding into a fixed-depth network with learnable matrices and thresholds. Despite its simplicity, LISTA achieves orders of magnitude faster convergence than ISTA, establishing the power of the unrolling paradigm.

We study LISTA and Learned ADMM here not as competitors to unrolled OAMP, but to understand the design space: what structural choices matter, and why OAMP's orthogonalisation provides a superior inductive bias for the Kronecker-structured sensing matrices of RF imaging.

Definition:

ISTA Iteration (Review)

The Iterative Shrinkage-Thresholding Algorithm (ISTA) for the LASSO problem min⁑c12βˆ₯yβˆ’Acβˆ₯2+Ξ»βˆ₯cβˆ₯1\min_{\mathbf{c}} \frac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{c}\|^2 + \lambda\|\mathbf{c}\|_1 iterates:

c^(k+1)=SΞ»/L ⁣(c^(k)+1LAH(yβˆ’Ac^(k)))\hat{\mathbf{c}}^{(k+1)} = \mathcal{S}_{\lambda/L}\!\bigl(\hat{\mathbf{c}}^{(k)} + \tfrac{1}{L}\mathbf{A}^{H}(\mathbf{y} - \mathbf{A}\hat{\mathbf{c}}^{(k)})\bigr)

where L=βˆ₯AHAβˆ₯L = \|\mathbf{A}^{H}\mathbf{A}\| is the Lipschitz constant and SΟ„\mathcal{S}_\tau is the soft-thresholding operator: [SΟ„(z)]i=sign⁑(zi)max⁑(∣ziβˆ£βˆ’Ο„,0)[\mathcal{S}_\tau(\mathbf{z})]_i = \operatorname{sign}(z_i)\max(|z_i| - \tau, 0).

ISTA converges at rate O(1/k)O(1/k). For sparse recovery with random matrices, convergence is linear but slow: the contraction rate depends on the condition number of AHA\mathbf{A}^{H}\mathbf{A}.

Definition:

Learned ISTA (LISTA)

LISTA replaces the fixed ISTA parameters with learnable ones. Each layer kk computes:

c^(k+1)=SΟ„k ⁣(Wkc^(k)+Sky)\hat{\mathbf{c}}^{(k+1)} = \mathcal{S}_{\tau_k}\!\bigl(\mathbf{W}_k \hat{\mathbf{c}}^{(k)} + \mathbf{S}_k \mathbf{y}\bigr)

where the learnable parameters at layer kk are:

  • Wk∈RNΓ—N\mathbf{W}_k \in \mathbb{R}^{N \times N}: replaces Iβˆ’1LAHA\mathbf{I} - \frac{1}{L}\mathbf{A}^{H}\mathbf{A}
  • Sk∈RNΓ—M\mathbf{S}_k \in \mathbb{R}^{N \times M}: replaces 1LAH\frac{1}{L}\mathbf{A}^{H}
  • Ο„k∈R+\tau_k \in \mathbb{R}_+: learnable threshold (replaces Ξ»/L\lambda/L)

At initialisation, Wk\mathbf{W}_k and Sk\mathbf{S}_k are set to their ISTA values so the untrained network performs ISTA. Training refines these away from the ISTA initialisation.

Theorem: LISTA Convergence Acceleration

For sparse signals with sparsity ss and a random Gaussian sensing matrix A∈RMΓ—N\mathbf{A} \in \mathbb{R}^{M \times N} with M=O(slog⁑N)M = O(s \log N), a KK-layer LISTA network trained on the signal distribution achieves:

βˆ₯c^(K)βˆ’cβˆ—βˆ₯2≀C⋅ρLISTAKβ‹…βˆ₯c^(0)βˆ’cβˆ—βˆ₯2\|\hat{\mathbf{c}}^{(K)} - \mathbf{c}^*\|_2 \leq C \cdot \rho_{\text{LISTA}}^K \cdot \|\hat{\mathbf{c}}^{(0)} - \mathbf{c}^*\|_2

where ρLISTA<ρISTA\rho_{\text{LISTA}} < \rho_{\text{ISTA}} strictly. Empirically, LISTA with K=10K = 10 layers matches the accuracy of ISTA with K>100K > 100 iterations.

ISTA uses the same step size and threshold at every iteration. Early iterations benefit from aggressive thresholding (to quickly identify the support of the sparse signal), while later iterations benefit from gentle thresholding (to refine the nonzero values). LISTA learns this schedule automatically.

LISTA vs ISTA Convergence

Compare the convergence of LISTA and ISTA as a function of layers/iterations. The plot shows the normalised reconstruction error versus layer/iteration index.

Toggle "Learn Thresholds" to see the effect of keeping thresholds fixed at the ISTA value versus learning them. With learned thresholds, LISTA achieves a dramatically steeper convergence curve.

Parameters
10

Definition:

Analytic LISTA (ALISTA)

ALISTA derives the optimal LISTA weight matrix in closed form:

Wβˆ—=Iβˆ’AH(AAH+ΞΌβˆ—I)βˆ’1A\mathbf{W}^* = \mathbf{I} - \mathbf{A}^{H}(\mathbf{A}\mathbf{A}^{H} + \mu^*\mathbf{I})^{-1}\mathbf{A}

where ΞΌβˆ—\mu^* is the unique positive solution of:

1Ntr⁑[(G+ΞΌI)βˆ’2G]=1Nβˆ’s\frac{1}{N}\operatorname{tr}\bigl[(\mathbf{G} + \mu\mathbf{I})^{-2}\mathbf{G}\bigr] = \frac{1}{N - s}

with G=AHA\mathbf{G} = \mathbf{A}^{H}\mathbf{A} and ss the expected sparsity. This eliminates the need to learn Wk\mathbf{W}_k and Sk\mathbf{S}_k: only the thresholds Ο„k\tau_k are trained, reducing the parameter count by orders of magnitude.

ALISTA shows that the essential quantity learned by LISTA is the threshold schedule --- the weight matrices converge to a regularised pseudo-inverse that can be computed analytically.

Definition:

Learned ADMM (L-ADMM)

Learned ADMM parameterises each ADMM iteration with learnable components. Layer kk computes:

c^(k+1)=FΞΈkx ⁣(y,β€…β€Šz(k),β€…β€Šu(k))\hat{\mathbf{c}}^{(k+1)} = \mathcal{F}_{\theta_k^x}\!\bigl(\mathbf{y},\; \mathbf{z}^{(k)},\; \mathbf{u}^{(k)}\bigr)

z(k+1)=GΞΈkz ⁣(c^(k+1)+u(k))\mathbf{z}^{(k+1)} = \mathcal{G}_{\theta_k^z}\!\bigl(\hat{\mathbf{c}}^{(k+1)} + \mathbf{u}^{(k)}\bigr)

u(k+1)=u(k)+Ξ·k(c^(k+1)βˆ’z(k+1))\mathbf{u}^{(k+1)} = \mathbf{u}^{(k)} + \eta_k\bigl(\hat{\mathbf{c}}^{(k+1)} - \mathbf{z}^{(k+1)}\bigr)

where Fθkx\mathcal{F}_{\theta_k^x} solves the data-fidelity subproblem, Gθkz\mathcal{G}_{\theta_k^z} replaces soft-thresholding with a learned proximal operator (e.g., a small CNN), and ρk\rho_k, ηk\eta_k are learnable scalars.

The dual variable u(k)\mathbf{u}^{(k)} accumulates constraint violations, providing cross-layer memory that LISTA lacks. The learned penalty ρk\rho_k schedules typically decrease: large early (enforce consensus) to small later (refine details).

Definition:

Learned Primal-Dual (LPD)

The Learned Primal-Dual network maintains primal iterates {f(k)}\{\mathbf{f}^{(k)}\} (images) and dual iterates {h(k)}\{\mathbf{h}^{(k)}\} (measurements) simultaneously. Each layer kk performs:

Dual update: h(k+1)=Γθkd ⁣(h(k),β€…β€ŠA f1(k),β€…β€Šy)\mathbf{h}^{(k+1)} = \Gamma_{\theta_k^d}\!\bigl(\mathbf{h}^{(k)},\; \mathbf{A}\,\mathbf{f}_1^{(k)},\; \mathbf{y}\bigr)

Primal update: f(k+1)=Λθkp ⁣(f(k),β€…β€ŠAH h1(k+1))\mathbf{f}^{(k+1)} = \Lambda_{\theta_k^p}\!\bigl(\mathbf{f}^{(k)},\; \mathbf{A}^{H}\,\mathbf{h}_1^{(k+1)}\bigr)

where Γθkd\Gamma_{\theta_k^d} and Λθkp\Lambda_{\theta_k^p} are small CNNs. The dual-space network reasons about measurement-domain structure (k-space artefacts, missing samples) invisible in the image domain.

Comparison of Unrolled Architectures

PropertyLISTAL-ADMMLPDUnrolled OAMP
Base algorithmISTAADMMChambolle-PockOAMP
Cross-layer memoryNoneDual variable u\mathbf{u}Multi-channel primal + dualState evolution Οƒt2\sigma_t^2
Works for structured A\mathbf{A}Partially (dense Wk\mathbf{W}_k)PartiallyYesYes (designed for it)
Exploits Kronecker structureNoNoNoYes (O(Nlog⁑N)O(N\log N))
Convergence theoryALISTA contraction rateFirmly nonexpansiveExpressivity subsumptionState evolution (exact)
Typical param count (per layer)O(N2)O(N^2)O(N2)O(N^2) or CNNCNN (primal + dual)CNN only (∼104\sim 10^4)
Best suited fori.i.d. Gaussian A\mathbf{A}General, splitting-friendlyDual-domain artefactsStructured/Kronecker A\mathbf{A}

Theorem: Why OAMP Outperforms ISTA Unrolling for Structured Matrices

Let A\mathbf{A} be a right-unitarily invariant matrix with singular values {Οƒi}\{\sigma_i\} and condition number ΞΊ=Οƒmax⁑/Οƒmin⁑\kappa = \sigma_{\max}/\sigma_{\min}. Then:

  1. The ISTA contraction rate is ρISTA=1βˆ’2/(ΞΊ+1)\rho_{\text{ISTA}} = 1 - 2/(\kappa + 1).
  2. The OAMP contraction rate (with optimal LMMSE) is ρOAMP=1βˆ’1/mmse(Οƒt2)\rho_{\text{OAMP}} = 1 - 1/\text{mmse}(\sigma_t^2), which depends on the full spectrum of A\mathbf{A}, not just Οƒmax⁑\sigma_{\max}.

For highly structured matrices (partial DFT, Kronecker products) where κ≫1\kappa \gg 1 but the bulk of singular values are well-conditioned, ρOAMPβ‰ͺρISTA\rho_{\text{OAMP}} \ll \rho_{\text{ISTA}}.

ISTA's step size 1/L=1/Οƒmax⁑21/L = 1/\sigma_{\max}^2 is dictated by the largest singular value --- one bad eigenvalue slows down the entire algorithm. OAMP's LMMSE step adapts to the full spectrum: it inverts well-conditioned modes aggressively and regularises ill-conditioned modes. This spectral adaptivity is what makes OAMP natural for the structured operators arising in RF imaging.

Example: LISTA Parameter Count vs Unrolled OAMP

Compare the parameter counts and per-layer costs of 10-layer LISTA and 10-layer unrolled OAMP for a problem with M=4096M = 4096 measurements and N=16384N = 16384 unknowns (a 128Γ—128128 \times 128 image).

Historical Note: The Birth of Algorithm Unrolling: LISTA (2010)

2010s

Karol Gregor and Yann LeCun introduced LISTA at ICML 2010, motivated by the need for fast sparse coding in computer vision. At the time, sparse coding required hundreds of ISTA iterations per image patch. LISTA showed that 10 learned layers could match 100+ ISTA iterations, reducing inference time by an order of magnitude.

The paper's insight --- that iterative algorithms can be viewed as recurrent networks with tied weights, and untying them yields better performance --- launched the algorithm unrolling field. Within a decade, unrolled networks became the dominant approach for model-based deep learning in signal processing and imaging.

Historical Note: ADMM-Net and the Imaging Connection

2016--2018

Yang et al. (2016) applied ADMM unrolling to MRI reconstruction, demonstrating that the ADMM splitting structure (data-fidelity + regulariser + dual variable) provides a richer architecture than ISTA unrolling. Chang et al. (2017) extended this to general compressed sensing with ADMM-CSNet. Adler and Oktem (2018) introduced Learned Primal-Dual for CT, adding dual-domain processing.

These works established that the choice of base algorithm matters: different algorithms provide different inductive biases, and matching the algorithm to the problem structure is essential. For RF imaging with structured sensing matrices, OAMP provides the best match.

,

Why This Matters: Unrolled OAMP for MIMO Radar Imaging

MIMO radar naturally produces Kronecker-structured sensing matrices: A=AtxβŠ—Arx\mathbf{A} = \mathbf{A}_{\text{tx}} \otimes \mathbf{A}_{\text{rx}} where Atx\mathbf{A}_{\text{tx}} and Arx\mathbf{A}_{\text{rx}} are the transmit and receive steering matrices.

Unrolled OAMP-ProxNet with Kronecker structure exploits this separability to achieve near-optimal reconstruction with computational cost scaling as O(Nlog⁑N)O(N\log N) per layer rather than O(N2)O(N^2). Key advantages over MFβ†’\toU-Net for MIMO radar:

  1. The forward model is used at every layer, not just once.
  2. The state evolution provides a noise schedule for the ProxNet denoiser.
  3. Theoretical guarantees from OAMP state evolution transfer to the unrolled network (under the Gaussianity assumption).

Common Mistake: LISTA's Dense Matrices Are Prohibitive for Imaging

Mistake:

Applying LISTA with full dense matrices Wk∈RNΓ—N\mathbf{W}_k \in \mathbb{R}^{N \times N} to high-resolution imaging problems (N>104N > 10^4).

Correction:

For N=104N = 10^4, each Wk\mathbf{W}_k has 10810^8 entries --- far too large for GPU memory. Alternatives:

  1. Structured LISTA: parameterise Wk\mathbf{W}_k as sparse or low-rank.
  2. Convolutional LISTA: replace dense matrices with convolutional operators.
  3. ALISTA: fix Wk\mathbf{W}_k analytically, learn only Ο„k\tau_k.
  4. Use OAMP instead: the LMMSE step exploits the structure of A\mathbf{A} without any dense learned matrix.

Common Mistake: More Layers Is Not Always Better

Mistake:

Increasing the number of unrolled layers KK indefinitely, assuming that more layers always improve reconstruction quality.

Correction:

Beyond a certain KK, additional layers provide diminishing returns and introduce two problems:

  1. Vanishing gradients: the backward pass through KK layers involves KK Jacobian multiplications that can shrink gradients.
  2. Overfitting: more parameters with limited training data.

In practice, K=5K = 5--2020 layers suffice. Weight tying and intermediate supervision help mitigate gradient issues.

Quick Check

How should the LISTA weight matrices Wk\mathbf{W}_k be initialised?

Random Gaussian initialisation

Identity matrix

The ISTA matrix Iβˆ’1LAHA\mathbf{I} - \frac{1}{L}\mathbf{A}^{H}\mathbf{A}

Zero matrix

Quick Check

What role does the dual variable u(k)\mathbf{u}^{(k)} play in Learned ADMM?

It stores the previous reconstruction for momentum

It accumulates constraint violations, providing memory across layers

It normalises activations between layers

It controls the learning rate during training

LISTA (Learned ISTA)

An unrolled version of ISTA where the step-size matrix, measurement matrix, and threshold at each layer become learnable parameters, trained end-to-end via backpropagation.

Related: ALISTA (Analytic LISTA)

ALISTA (Analytic LISTA)

A variant of LISTA where the weight matrices are computed in closed form (as a regularised pseudo-inverse), and only the thresholds are learned, drastically reducing the parameter count.

Related: LISTA (Learned ISTA)

Key Takeaway

LISTA, L-ADMM, and LPD are important unrolled architectures, but for RF imaging with structured sensing matrices, unrolled OAMP is superior: it exploits Kronecker structure for O(Nlog⁑N)O(N\log N) per-layer cost, uses orders of magnitude fewer parameters than LISTA, and the LMMSE step adapts to the full spectrum of A\mathbf{A} rather than being limited by the condition number.