Deep Unfolding and Algorithm Unrolling

From Iterative Algorithms to Trainable Networks

Many signal processing algorithms are iterative: they repeat a fixed update rule until convergence. Examples include ISTA/FISTA for sparse recovery, ADMM for convex optimisation, gradient descent for detection, and expectation propagation for decoding. These algorithms are well-understood theoretically but often require many iterations for convergence, and their hyper-parameters (step size, regularisation weight, damping factor) are set by hand or by conservative worst-case analysis.

Deep unfolding (also called algorithm unrolling) bridges the gap between model-based algorithms and data-driven learning by:

  1. Unrolling a fixed number LL of iterations into LL layers of a neural network,
  2. Making the per-iteration hyper-parameters learnable (step sizes, thresholds, weight matrices), and
  3. Training end-to-end via backpropagation on task-specific data.

The result is a network that inherits the structure and interpretability of the original algorithm while gaining adaptivity from data. Typically, L=5L = 5--2020 learned layers match or exceed the performance of 5050--500500 classical iterations.

Definition:

Iterative Shrinkage-Thresholding Algorithm (ISTA)

Consider the sparse recovery problem: given a measurement matrix ARM×N\mathbf{A} \in \mathbb{R}^{M \times N} (M<NM < N) and noisy observations y=Ax+n\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{n}, recover the ss-sparse signal x\mathbf{x} by solving the LASSO:

minx  12yAx2+λx1\min_{\mathbf{x}} \;\frac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{x}\|^2 + \lambda \|\mathbf{x}\|_1

ISTA solves this via proximal gradient descent. At iteration tt:

x(t+1)=Sλ/L ⁣(x(t)1LAT(Ax(t)y))\mathbf{x}^{(t+1)} = \mathcal{S}_{\lambda/L}\!\left( \mathbf{x}^{(t)} - \frac{1}{L}\mathbf{A}^T (\mathbf{A}\mathbf{x}^{(t)} - \mathbf{y}) \right)

where L=λmax(ATA)L = \lambda_{\max}(\mathbf{A}^T\mathbf{A}) is the Lipschitz constant of the gradient and Sθ\mathcal{S}_\theta is the element-wise soft-thresholding operator:

[Sθ(z)]i=sign(zi)max(ziθ,0)[\mathcal{S}_\theta(\mathbf{z})]_i = \mathrm{sign}(z_i) \max(|z_i| - \theta, 0)

Substituting the gradient step, each ISTA iteration can be written as:

x(t+1)=Sλ/L ⁣(Wx(t)+Sy)\mathbf{x}^{(t+1)} = \mathcal{S}_{\lambda/L}\!\left( \mathbf{W}\mathbf{x}^{(t)} + \mathbf{S}\mathbf{y} \right)

where W=I1LATA\mathbf{W} = \mathbf{I} - \frac{1}{L}\mathbf{A}^T\mathbf{A} and S=1LAT\mathbf{S} = \frac{1}{L}\mathbf{A}^T.

ISTA converges at rate O(1/t)O(1/t) in objective value. FISTA (with Nesterov momentum) improves this to O(1/t2)O(1/t^2), but both can require hundreds of iterations for high-accuracy recovery. The step size 1/L1/L and threshold λ/L\lambda/L are fixed across all iterations --- a key limitation that deep unfolding addresses.

Definition:

Learned ISTA (LISTA)

LISTA (Gregor and LeCun, 2010) unfolds LL ISTA iterations into LL layers of a neural network, with per-layer learnable parameters:

x(l+1)=Sθ(l) ⁣(W(l)x(l)+S(l)y),l=0,1,,L1\mathbf{x}^{(l+1)} = \mathcal{S}_{\theta^{(l)}}\!\left( \mathbf{W}^{(l)}\mathbf{x}^{(l)} + \mathbf{S}^{(l)}\mathbf{y} \right), \qquad l = 0, 1, \ldots, L-1

The learnable parameters at each layer are:

  • W(l)RN×N\mathbf{W}^{(l)} \in \mathbb{R}^{N \times N}: state transition matrix (initialised as I1LATA\mathbf{I} - \frac{1}{L} \mathbf{A}^T\mathbf{A})
  • S(l)RN×M\mathbf{S}^{(l)} \in \mathbb{R}^{N \times M}: input projection (initialised as 1LAT\frac{1}{L}\mathbf{A}^T)
  • θ(l)0\theta^{(l)} \geq 0: soft-threshold (initialised as λ/L\lambda/L; can be scalar or per-element)

The network is trained end-to-end by minimising the MSE between the output x(L)\mathbf{x}^{(L)} and the ground truth xtrue\mathbf{x}_{\mathrm{true}}:

L({W(l),S(l),θ(l)}l=0L1)=1Ntraini=1Ntrainxi(L)xtrue,i2\mathcal{L}(\{\mathbf{W}^{(l)}, \mathbf{S}^{(l)}, \theta^{(l)}\}_{l=0}^{L-1}) = \frac{1}{N_{\mathrm{train}}} \sum_{i=1}^{N_{\mathrm{train}}} \|\mathbf{x}_i^{(L)} - \mathbf{x}_{\mathrm{true},i}\|^2

Key advantages over ISTA:

  1. Convergence speed: L=10L = 10 LISTA layers often outperform T=100T = 100 ISTA iterations.
  2. Per-layer adaptation: Early layers can use larger thresholds (aggressive sparsification) while later layers use smaller thresholds (refinement).
  3. Learned matrices: W(l)\mathbf{W}^{(l)} and S(l)\mathbf{S}^{(l)} can deviate from the ISTA initialisation to accelerate convergence for the specific data distribution.

Deep Unfolding: From ISTA to LISTA

Animated visualisation of the deep unfolding process. ISTA iterations with fixed parameters are transformed into LISTA layers with per-layer learnable parameters, achieving linear convergence (ρL\rho^L) vs ISTA's sublinear O(1/t)O(1/t).
Each LISTA layer has its own threshold and weight matrix, enabling 10 layers to match 100+ ISTA iterations.

Deep Unfolding of ISTA into LISTA

Deep Unfolding of ISTA into LISTA
Top: Classical ISTA repeats the same update rule with fixed parameters (W\mathbf{W}, S\mathbf{S}, θ\theta) for TT iterations. Bottom: LISTA unfolds LL iterations into a feed-forward network where each layer has its own learnable parameters (W(l)\mathbf{W}^{(l)}, S(l)\mathbf{S}^{(l)}, θ(l)\theta^{(l)}). The network is trained end-to-end via backpropagation. Grey arrows show the forward pass; red arrows show gradient flow during training.

LISTA Training and Inference

Input: Measurement matrix A\mathbf{A}, training data
{(yi,xi)}i=1Ntrain\{(\mathbf{y}_i, \mathbf{x}_i)\}_{i=1}^{N_{\text{train}}},
number of layers LL, learning rate η\eta
Initialisation:
1. Llipλmax(ATA)L_{\text{lip}} \leftarrow \lambda_{\max}(\mathbf{A}^T\mathbf{A})
2. For l=0,,L1l = 0, \ldots, L-1:
- W(l)I1LlipATA\mathbf{W}^{(l)} \leftarrow \mathbf{I} - \frac{1}{L_{\text{lip}}}\mathbf{A}^T\mathbf{A}
- S(l)1LlipAT\mathbf{S}^{(l)} \leftarrow \frac{1}{L_{\text{lip}}}\mathbf{A}^T
- θ(l)0.1/Llip\theta^{(l)} \leftarrow 0.1 / L_{\text{lip}}
Training (for each epoch):
3. For each mini-batch {(yj,xj)}\{(\mathbf{y}_j, \mathbf{x}_j)\}:
- x(0)0\mathbf{x}^{(0)} \leftarrow \mathbf{0}
- For l=0,,L1l = 0, \ldots, L-1:
x(l+1)=Sθ(l)(W(l)x(l)+S(l)yj)\quad \mathbf{x}^{(l+1)} = \mathcal{S}_{\theta^{(l)}}(\mathbf{W}^{(l)}\mathbf{x}^{(l)} + \mathbf{S}^{(l)}\mathbf{y}_j)
- L=x(L)xj2\mathcal{L} = \|\mathbf{x}^{(L)} - \mathbf{x}_j\|^2
- Compute L\nabla \mathcal{L} w.r.t. all {W(l),S(l),θ(l)}\{\mathbf{W}^{(l)}, \mathbf{S}^{(l)}, \theta^{(l)}\} via backpropagation
- Update: θθηθL\theta \leftarrow \theta - \eta\,\nabla_\theta \mathcal{L} (and similarly for W(l),S(l)\mathbf{W}^{(l)}, \mathbf{S}^{(l)})
Inference:
4. Given new measurement y\mathbf{y}:
- x(0)0\mathbf{x}^{(0)} \leftarrow \mathbf{0}
- For l=0,,L1l = 0, \ldots, L-1:
x(l+1)=Sθ(l)(W(l)x(l)+S(l)y)\quad \mathbf{x}^{(l+1)} = \mathcal{S}_{\theta^{(l)}}(\mathbf{W}^{(l)}\mathbf{x}^{(l)} + \mathbf{S}^{(l)}\mathbf{y})
- Return x^=x(L)\hat{\mathbf{x}} = \mathbf{x}^{(L)}
,

Deep Unfolding Beyond Sparse Recovery

The unfolding principle applies to any iterative algorithm. Prominent wireless applications include:

Algorithm Unfolded Version Application
ISTA/FISTA LISTA, LISTA-CP Sparse channel estimation, CS recovery
ADMM ADMM-Net MIMO detection, resource allocation
Gradient descent DetNet MIMO detection
WMMSE UWMMSE Power control, beamforming
Expectation propagation Learned EP Turbo detection/decoding
Belief propagation Neural BP LDPC decoding

The general recipe is: (1) identify the iterative update, (2) make the step size, damping, and/or weight matrices per-layer learnable, (3) train end-to-end with task-specific loss (e.g., BER rather than MSE for detection). The structure of the original algorithm acts as a strong inductive bias, dramatically reducing the number of trainable parameters compared to a black-box network.

,

Example: LISTA Convergence Rate

A sparse recovery problem has N=50N = 50, M=25M = 25, sparsity s=5s = 5, and SNR =20= 20 dB. Classical ISTA requires T=50T = 50 iterations to achieve NMSE =20= -20 dB.

(a) If LISTA uses L=8L = 8 layers, estimate the total number of multiply-accumulate (MAC) operations for a single inference, compared to T=50T = 50 ISTA iterations.

(b) Explain qualitatively why LISTA converges faster.

LISTA vs ISTA Convergence

Compare the convergence of LISTA (learned ISTA) and classical ISTA for sparse signal recovery. The xx-axis shows iterations (ISTA) or layers (LISTA), and the yy-axis shows the NMSE on a log scale. Observe that LISTA (with per-layer learnable thresholds) achieves much lower NMSE in far fewer layers than ISTA iterations. Increase the number of LISTA layers to see diminishing returns. Higher sparsity makes the problem harder for both algorithms. Note: The LISTA thresholds here are initialised (not fully trained) to illustrate the structural advantage; a fully trained LISTA would perform even better.

Parameters
8
5
20

Theorem: Linear Convergence of LISTA

Common Mistake: Unfolded Networks May Not Generalise to New Measurement Matrices

Mistake:

Training LISTA with a specific measurement matrix A\mathbf{A} and expecting it to work when A\mathbf{A} changes (e.g., different antenna configuration or pilot pattern).

Correction:

LISTA initialises W(l)\mathbf{W}^{(l)} and S(l)\mathbf{S}^{(l)} from A\mathbf{A}. If A\mathbf{A} changes, these initialisations are stale and the learned parameters may not transfer. Solutions:

  • Re-initialise and fine-tune with the new A\mathbf{A} (typically converges in 5--10 epochs)
  • Train with randomised A\mathbf{A} to learn a robust model
  • Feed A\mathbf{A} as input (hyper-network approach)

Classical ISTA, by contrast, works with any A\mathbf{A} out of the box — it's slower but universally applicable. This robustness-speed trade-off is fundamental to deep unfolding.

Why This Matters: Sparse Recovery Theory in the FSI Book

The ISTA and LISTA algorithms in this section solve the LASSO problem for sparse signal recovery. The FSI (Foundations of Statistical Inference) book develops the full theoretical framework:

  • Compressed sensing: RIP, basis pursuit, recovery guarantees
  • LASSO path: relationship between λ\lambda and the support
  • AMP and OAMP: message-passing algorithms that achieve Bayes-optimal recovery (CommIT research area)
  • Factor graphs and belief propagation: the probabilistic framework that unifies iterative algorithms

Readers interested in the convergence theory of LISTA and its connections to proximal operators should consult the FSI book.

Deep Unfolding (Algorithm Unrolling)

A methodology that converts a fixed number of iterations of an iterative algorithm into layers of a neural network, making per-iteration hyper-parameters (step sizes, thresholds, weight matrices) learnable. Inherits the structure and interpretability of the original algorithm while gaining adaptivity from data.

Related: LISTA (Learned ISTA), Supervised Learning

LISTA (Learned ISTA)

An unfolded version of ISTA where each layer has per-layer learnable thresholds and weight matrices. Achieves linear convergence (ρL\rho^L) compared to ISTA's sublinear O(1/t)O(1/t) rate, typically requiring 5--20 layers to match 50--500 ISTA iterations.

Related: Deep Unfolding (Algorithm Unrolling)

Quick Check

Which of the following is NOT an advantage of LISTA over classical ISTA?

Per-layer learnable thresholds that adapt to the data distribution

Guaranteed convergence to the global optimum of the LASSO for any input

Faster inference due to fewer layers than ISTA iterations

End-to-end training that optimises the final output quality