Ferkans — Interactive Telecom Tutor

From Iterative Algorithms to Trainable Networks

Many signal processing algorithms are iterative: they repeat a fixed update rule until convergence. Examples include ISTA/FISTA for sparse recovery, ADMM for convex optimisation, gradient descent for detection, and expectation propagation for decoding. These algorithms are well-understood theoretically but often require many iterations for convergence, and their hyper-parameters (step size, regularisation weight, damping factor) are set by hand or by conservative worst-case analysis.

Deep unfolding (also called algorithm unrolling) bridges the gap between model-based algorithms and data-driven learning by:

Unrolling a fixed number $L$ of iterations into $L$ layers of a neural network,
Making the per-iteration hyper-parameters learnable (step sizes, thresholds, weight matrices), and
Training end-to-end via backpropagation on task-specific data.

The result is a network that inherits the structure and interpretability of the original algorithm while gaining adaptivity from data. Typically, $L = 5$ -- $20$ learned layers match or exceed the performance of $50$ -- $500$ classical iterations.

Definition:
Iterative Shrinkage-Thresholding Algorithm (ISTA)

Consider the sparse recovery problem: given a measurement matrix $\mathbf{A} \in \mathbb{R}^{M \times N}$ ( $M < N$ ) and noisy observations $\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{n}$ , recover the $s$ -sparse signal $\mathbf{x}$ by solving the LASSO:

$\min_{\mathbf{x}} \;\frac{1}{2}\|\mathbf{y} - \mathbf{A}\mathbf{x}\|^2 + \lambda \|\mathbf{x}\|_1$

ISTA solves this via proximal gradient descent. At iteration $t$ :

$\mathbf{x}^{(t+1)} = \mathcal{S}_{\lambda/L}\!\left( \mathbf{x}^{(t)} - \frac{1}{L}\mathbf{A}^T (\mathbf{A}\mathbf{x}^{(t)} - \mathbf{y}) \right)$

where $L = \lambda_{\max}(\mathbf{A}^T\mathbf{A})$ is the Lipschitz constant of the gradient and $\mathcal{S}_\theta$ is the element-wise soft-thresholding operator:

$[\mathcal{S}_\theta(\mathbf{z})]_i = \mathrm{sign}(z_i) \max(|z_i| - \theta, 0)$

Substituting the gradient step, each ISTA iteration can be written as:

$\mathbf{x}^{(t+1)} = \mathcal{S}_{\lambda/L}\!\left( \mathbf{W}\mathbf{x}^{(t)} + \mathbf{S}\mathbf{y} \right)$

where $\mathbf{W} = \mathbf{I} - \frac{1}{L}\mathbf{A}^T\mathbf{A}$ and $\mathbf{S} = \frac{1}{L}\mathbf{A}^T$ .

ISTA converges at rate $O(1/t)$ in objective value. FISTA (with Nesterov momentum) improves this to $O(1/t^2)$ , but both can require hundreds of iterations for high-accuracy recovery. The step size $1/L$ and threshold $\lambda/L$ are fixed across all iterations --- a key limitation that deep unfolding addresses.

Definition:
Learned ISTA (LISTA)

LISTA (Gregor and LeCun, 2010) unfolds $L$ ISTA iterations into $L$ layers of a neural network, with per-layer learnable parameters:

$\mathbf{x}^{(l+1)} = \mathcal{S}_{\theta^{(l)}}\!\left( \mathbf{W}^{(l)}\mathbf{x}^{(l)} + \mathbf{S}^{(l)}\mathbf{y} \right), \qquad l = 0, 1, \ldots, L-1$

The learnable parameters at each layer are:

$\mathbf{W}^{(l)} \in \mathbb{R}^{N \times N}$ : state transition matrix (initialised as $\mathbf{I} - \frac{1}{L} \mathbf{A}^T\mathbf{A}$ )
$\mathbf{S}^{(l)} \in \mathbb{R}^{N \times M}$ : input projection (initialised as $\frac{1}{L}\mathbf{A}^T$ )
$\theta^{(l)} \geq 0$ : soft-threshold (initialised as $\lambda/L$ ; can be scalar or per-element)

The network is trained end-to-end by minimising the MSE between the output $\mathbf{x}^{(L)}$ and the ground truth $\mathbf{x}_{\mathrm{true}}$ :

$\mathcal{L}(\{\mathbf{W}^{(l)}, \mathbf{S}^{(l)}, \theta^{(l)}\}_{l=0}^{L-1}) = \frac{1}{N_{\mathrm{train}}} \sum_{i=1}^{N_{\mathrm{train}}} \|\mathbf{x}_i^{(L)} - \mathbf{x}_{\mathrm{true},i}\|^2$

Key advantages over ISTA:

Convergence speed: $L = 10$ LISTA layers often outperform $T = 100$ ISTA iterations.
Per-layer adaptation: Early layers can use larger thresholds (aggressive sparsification) while later layers use smaller thresholds (refinement).
Learned matrices: $\mathbf{W}^{(l)}$ and $\mathbf{S}^{(l)}$ can deviate from the ISTA initialisation to accelerate convergence for the specific data distribution.

Deep Unfolding: From ISTA to LISTA

Animated visualisation of the deep unfolding process. ISTA iterations with fixed parameters are transformed into LISTA layers with per-layer learnable parameters, achieving linear convergence (

\rho^L

) vs ISTA's sublinear

O(1/t)

.

Each LISTA layer has its own threshold and weight matrix, enabling 10 layers to match 100+ ISTA iterations.

Deep Unfolding of ISTA into LISTA — **Top:** Classical ISTA repeats the same update rule with fixed parameters ( $\mathbf{W}$ , $\mathbf{S}$ , $\theta$ ) for $T$ iterations. **Bottom:** LISTA unfolds $L$ iterations into a feed-forward network where each layer has its own learnable parameters ( $\mathbf{W}^{(l)}$ , $\mathbf{S}^{(l)}$ , $\theta^{(l)}$ ). The network is trained end-to-end via backpropagation. Grey arrows show the forward pass; red arrows show gradient flow during training.

LISTA Training and Inference

Input: Measurement matrix

\mathbf{A}

, training data

\{(\mathbf{y}_i, \mathbf{x}_i)\}_{i=1}^{N_{\text{train}}}

,

number of layers

L

, learning rate

\eta

Initialisation:

1.

L_{\text{lip}} \leftarrow \lambda_{\max}(\mathbf{A}^T\mathbf{A})

2. For

l = 0, \ldots, L-1

:

-

\mathbf{W}^{(l)} \leftarrow \mathbf{I} - \frac{1}{L_{\text{lip}}}\mathbf{A}^T\mathbf{A}

-

\mathbf{S}^{(l)} \leftarrow \frac{1}{L_{\text{lip}}}\mathbf{A}^T

-

\theta^{(l)} \leftarrow 0.1 / L_{\text{lip}}

Training (for each epoch):

3. For each mini-batch

\{(\mathbf{y}_j, \mathbf{x}_j)\}

:

-

\mathbf{x}^{(0)} \leftarrow \mathbf{0}

- For

l = 0, \ldots, L-1

:

\quad \mathbf{x}^{(l+1)} = \mathcal{S}_{\theta^{(l)}}(\mathbf{W}^{(l)}\mathbf{x}^{(l)} + \mathbf{S}^{(l)}\mathbf{y}_j)

-

\mathcal{L} = \|\mathbf{x}^{(L)} - \mathbf{x}_j\|^2

- Compute

\nabla \mathcal{L}

w.r.t. all

\{\mathbf{W}^{(l)}, \mathbf{S}^{(l)}, \theta^{(l)}\}

via backpropagation

- Update:

\theta \leftarrow \theta - \eta\,\nabla_\theta \mathcal{L}

(and similarly for

\mathbf{W}^{(l)}, \mathbf{S}^{(l)}

)

Inference:

4. Given new measurement

\mathbf{y}

:

-

\mathbf{x}^{(0)} \leftarrow \mathbf{0}

- For

l = 0, \ldots, L-1

:

\quad \mathbf{x}^{(l+1)} = \mathcal{S}_{\theta^{(l)}}(\mathbf{W}^{(l)}\mathbf{x}^{(l)} + \mathbf{S}^{(l)}\mathbf{y})

- Return

\hat{\mathbf{x}} = \mathbf{x}^{(L)}

,

Deep Unfolding Beyond Sparse Recovery

The unfolding principle applies to any iterative algorithm. Prominent wireless applications include:

Algorithm	Unfolded Version	Application
ISTA/FISTA	LISTA, LISTA-CP	Sparse channel estimation, CS recovery
ADMM	ADMM-Net	MIMO detection, resource allocation
Gradient descent	DetNet	MIMO detection
WMMSE	UWMMSE	Power control, beamforming
Expectation propagation	Learned EP	Turbo detection/decoding
Belief propagation	Neural BP	LDPC decoding

The general recipe is: (1) identify the iterative update, (2) make the step size, damping, and/or weight matrices per-layer learnable, (3) train end-to-end with task-specific loss (e.g., BER rather than MSE for detection). The structure of the original algorithm acts as a strong inductive bias, dramatically reducing the number of trainable parameters compared to a black-box network.

,

Example: LISTA Convergence Rate

A sparse recovery problem has $N = 50$ , $M = 25$ , sparsity $s = 5$ , and SNR $= 20$ dB. Classical ISTA requires $T = 50$ iterations to achieve NMSE $= -20$ dB.

(a) If LISTA uses $L = 8$ layers, estimate the total number of multiply-accumulate (MAC) operations for a single inference, compared to $T = 50$ ISTA iterations.

(b) Explain qualitatively why LISTA converges faster.

Solution

MAC count comparison

Each ISTA iteration (or LISTA layer) involves:

Matrix-vector product $\mathbf{W}\mathbf{x}$ : $N^2 = 2500$ MACs
Matrix-vector product $\mathbf{S}\mathbf{y}$ : $NM = 1250$ MACs
Soft-thresholding: $N = 50$ operations (negligible)
Total per iteration/layer: $\approx 3750$ MACs

ISTA: $50 \times 3750 = 187\,500$ MACs

LISTA: $8 \times 3750 = 30\,000$ MACs

LISTA achieves the same NMSE with $\mathbf{6.25\times}$ fewer MACs. In practice, the LISTA weight matrices may differ from ISTA, but the dimensions (and hence MAC count per layer) are identical.

Why LISTA converges faster

Per-layer thresholds: ISTA uses a fixed threshold $\lambda/L$ at every iteration. LISTA learns to use a large threshold in early layers (aggressively prune small coefficients) and a small threshold in later layers (refine the support). This is analogous to a warm-start with decreasing regularisation.
Learned matrices: The matrices $\mathbf{W}^{(l)}$ can learn to precondition the problem, effectively adapting the step size and direction to the data distribution. This is similar to Nesterov acceleration but learned from data.
End-to-end training: LISTA optimises the final output quality rather than the per-iteration objective decrease. Early layers may produce intermediate representations that are not valid sparse signals but that enable faster convergence in later layers.

LISTA vs ISTA Convergence

Compare the convergence of LISTA (learned ISTA) and classical ISTA for sparse signal recovery. The $x$ -axis shows iterations (ISTA) or layers (LISTA), and the $y$ -axis shows the NMSE on a log scale. Observe that LISTA (with per-layer learnable thresholds) achieves much lower NMSE in far fewer layers than ISTA iterations. Increase the number of LISTA layers to see diminishing returns. Higher sparsity makes the problem harder for both algorithms. Note: The LISTA thresholds here are initialised (not fully trained) to illustrate the structural advantage; a fully trained LISTA would perform even better.

Parameters

LISTA layers

L

8

Sparsity

s

5

SNR (dB)20

Theorem: Linear Convergence of LISTA

Common Mistake: Unfolded Networks May Not Generalise to New Measurement Matrices

Mistake:

Training LISTA with a specific measurement matrix $\mathbf{A}$ and expecting it to work when $\mathbf{A}$ changes (e.g., different antenna configuration or pilot pattern).

Correction:

LISTA initialises $\mathbf{W}^{(l)}$ and $\mathbf{S}^{(l)}$ from $\mathbf{A}$ . If $\mathbf{A}$ changes, these initialisations are stale and the learned parameters may not transfer. Solutions:

Re-initialise and fine-tune with the new $\mathbf{A}$ (typically converges in 5--10 epochs)
Train with randomised $\mathbf{A}$ to learn a robust model
Feed $\mathbf{A}$ as input (hyper-network approach)

Classical ISTA, by contrast, works with any $\mathbf{A}$ out of the box — it's slower but universally applicable. This robustness-speed trade-off is fundamental to deep unfolding.

Why This Matters: Sparse Recovery Theory in the FSI Book

The ISTA and LISTA algorithms in this section solve the LASSO problem for sparse signal recovery. The FSI (Foundations of Statistical Inference) book develops the full theoretical framework:

Compressed sensing: RIP, basis pursuit, recovery guarantees
LASSO path: relationship between $\lambda$ and the support
AMP and OAMP: message-passing algorithms that achieve Bayes-optimal recovery (CommIT research area)
Factor graphs and belief propagation: the probabilistic framework that unifies iterative algorithms

Readers interested in the convergence theory of LISTA and its connections to proximal operators should consult the FSI book.

Deep Unfolding (Algorithm Unrolling)

A methodology that converts a fixed number of iterations of an iterative algorithm into layers of a neural network, making per-iteration hyper-parameters (step sizes, thresholds, weight matrices) learnable. Inherits the structure and interpretability of the original algorithm while gaining adaptivity from data.

LISTA (Learned ISTA)

An unfolded version of ISTA where each layer has per-layer learnable thresholds and weight matrices. Achieves linear convergence ( $\rho^L$ ) compared to ISTA's sublinear $O(1/t)$ rate, typically requiring 5--20 layers to match 50--500 ISTA iterations.

Quick Check

Which of the following is NOT an advantage of LISTA over classical ISTA?

Per-layer learnable thresholds that adapt to the data distribution

Guaranteed convergence to the global optimum of the LASSO for any input

Faster inference due to fewer layers than ISTA iterations

End-to-end training that optimises the final output quality