Ferkans — Interactive Telecom Tutor

From Hand-Designed to Learned Message Passing

AMP, OAMP, VAMP, and GAMP all share a common template: a linear step, a nonlinear denoiser, and an Onsager-style correction. The parameters of each stage — thresholds, damping factors, denoiser shape, even the linear operator itself — are derived from a statistical model (sparsity prior, noise level, matrix spectrum). When the assumed model matches reality, performance is Bayes-optimal; when it does not, mismatch erodes the gains.

Deep unfolding turns this limitation into a feature. Each iteration of the hand-designed algorithm is reinterpreted as a layer in a neural network, the per-layer parameters are declared free, and the whole unrolled network is trained end-to-end on representative $(\mathbf{x},\ntn{obs})$ pairs. The result is an algorithm that retains the interpretability of message passing but adapts its coefficients to the empirical signal and matrix distribution — escaping the restrictive assumptions of analytical state evolution.

Definition:
LISTA — Learned ISTA

Given $T$ unrolled layers, LISTA replaces the fixed ISTA iteration $\mathbf{x}^{t+1} = \eta_{\text{st}}(\mathbf{x}^t + \mathbf{A}^{\mathsf{H}}(\ntn{obs}-\mathbf{A}\mathbf{x}^t);\lambda)$ with the learned recursion

$\mathbf{x}^{t+1} = \eta_{\text{st}}\!\left(\mathbf{W}_t \ntn{obs} + \mathbf{S}_t \mathbf{x}^t;\ \lambda_t\right), \qquad t=0,\dots,T-1,$

where $\{\mathbf{W}_t \in \mathbb{C}^{N\times M},\ \mathbf{S}_t \in \mathbb{C}^{N\times N},\ \lambda_t \in \mathbb{R}_+\}$ are learnable parameters, typically initialized from the ISTA values $\mathbf{W}_t = \mathbf{A}^{\mathsf{H}}/L$ , $\mathbf{S}_t = \mathbf{I} - \mathbf{A}^{\mathsf{H}}\mathbf{A}/L$ . Training minimizes the reconstruction loss $\mathbb{E}\|\mathbf{x}^{T} - \mathbf{x}\|^2$ over a dataset of signal--measurement pairs.

Empirically, LISTA reaches ISTA's $T=1000$ -iteration MSE in 10-20 learned layers — a 50 $\times$ speed-up. Weight tying ( $\mathbf{W}_t\equiv\mathbf{W}$ , $\mathbf{S}_t\equiv\mathbf{S}$ ) gives a lighter model that still beats vanilla ISTA.

Definition:
LAMP — Learned AMP

LAMP unrolls the AMP iteration. Each layer reads $(\mathbf{x}^t,\mathbf{r}^t)$ and produces

$\begin{aligned} \mathbf{x}^{t+1} &= \eta_t\!\left(\mathbf{B}_t\mathbf{r}^t + \mathbf{x}^t;\ \boldsymbol{\theta}_t\right), \\ \mathbf{r}^{t+1} &= \ntn{obs} - \mathbf{A}\mathbf{x}^{t+1} + b_t \mathbf{r}^t, \end{aligned}$

where $\mathbf{B}_t$ is a learned feedback matrix (initialized at $\mathbf{A}^{\mathsf{H}}$ ), $\eta_t$ is a parameterized denoiser (soft-threshold, scaled soft-threshold, or a small MLP), and $b_t \in \mathbb{R}$ is the learned Onsager coefficient. All $\{\mathbf{B}_t,\boldsymbol{\theta}_t,b_t\}_{t=0}^{T-1}$ are trained jointly by back-propagation.

LAMP preserves the Onsager-style correction $b_t\mathbf{r}^t$ — but instead of computing it analytically from $\delta^{-1}\langle\eta'\rangle$ , it learns the scalar $b_t$ directly. This is what makes LAMP robust to mismatched matrix ensembles: the network finds the right correction for the actual operator at hand.

Definition:
LDVAMP — Learned Denoising VAMP

LDVAMP unrolls the VAMP recursion, replacing each scalar state-evolution update with a learned function and each denoiser with a neural network. Per layer:

$\begin{aligned} \hat{\mathbf{x}}_1^t &= \mathbf{g}_1(\mathbf{r}_1^t;\ \boldsymbol{\phi}_t^{(1)}), \quad \alpha_1^t = \text{learned divergence}, \\ (\mathbf{r}_2^t,\gamma_2^t) &= \text{Onsager}(\hat{\mathbf{x}}_1^t,\mathbf{r}_1^t,\alpha_1^t), \\ \hat{\mathbf{x}}_2^t &= \mathbf{g}_2(\mathbf{r}_2^t;\ \boldsymbol{\phi}_t^{(2)}), \quad \alpha_2^t = \text{learned divergence}, \\ (\mathbf{r}_1^{t+1},\gamma_1^{t+1}) &= \text{Onsager}(\hat{\mathbf{x}}_2^t,\mathbf{r}_2^t,\alpha_2^t). \end{aligned}$

Here $\mathbf{g}_2$ is the LMMSE step (its parameters $\gamma_2$ learned rather than matched to $\mathbf{A}$ ) and $\mathbf{g}_1$ is a learned prior denoiser (a CNN for images, a parameterized shrinkage for sparse signals).

LDVAMP inherits VAMP's robustness to ill-conditioned matrices while dropping the need to specify the signal prior or matrix spectrum analytically. It is the state-of-the-art unrolled network for structured compressed sensing.

Theorem: Linear Convergence Rate of LISTA

Assume $\mathbf{A}$ satisfies a restricted isometry property with constant $\delta_{2s} < 1/3$ and the signal is $s$ -sparse. Then the optimal LISTA parameters $\{\mathbf{W}_t^\star,\mathbf{S}_t^\star,\lambda_t^\star\}$ achieve the linear convergence rate

$\|\mathbf{x}^{T} - \mathbf{x}\|_2 \le c_1 \cdot q^{T}\ \|\mathbf{x}\|_2 + c_2\ \ntn{noisestd},$

with contraction factor $q = q(\delta_{2s}) \in (0,1)$ strictly smaller than the ISTA contraction $q_{\text{ISTA}}$ at the same regularization.

LISTA beats ISTA's linear rate by adapting each layer's operator to the current sparsity pattern. Early layers can be aggressive (large thresholds) to commit to the strongest components; later layers can be gentle to refine small components. ISTA uses the same operator for every iteration, sacrificing this adaptivity.

Proof

Coupling LISTA layers to ISTA iterates

Set $\mathbf{W}_t = \alpha_t\mathbf{A}^{\mathsf{H}}$ , $\mathbf{S}_t = \mathbf{I} - \alpha_t\mathbf{A}^{\mathsf{H}}\mathbf{A}$ , $\lambda_t = \alpha_t\lambda$ . Substituting into the LISTA update gives ISTA with step $\alpha_t$ . Thus any ISTA contraction is achievable by LISTA — the LISTA optimum is no worse.

Layerwise optimal $\alpha_t$

For each layer, choose $\alpha_t$ to minimize the one-step error under the RIP bound. This gives $\alpha_t^\star = 1/(1+\delta_{2s})$ and a layer-contraction $q^\star = 2\delta_{2s}/(1-\delta_{2s}) < 1$ when $\delta_{2s}<1/3$ .

Extending beyond ISTA

LISTA's full parameterization $\{\mathbf{W}_t,\mathbf{S}_t\}$ has more degrees of freedom than the scaled-ISTA family. The end-to-end training optimum thus dominates the scaled-ISTA bound, giving $q < q^\star$ .

Example: LAMP for Bernoulli-Gaussian Recovery

Design a 10-layer LAMP network for recovering Bernoulli-Gaussian signals ( $\rho=0.1$ , unit-variance active components) from measurements $\ntn{obs} = \mathbf{A}\mathbf{x} + \mathbf{w}$ with $\mathbf{A} \in \mathbb{R}^{M\times N}$ , $M/N = 0.5$ , $N=500$ , SNR $=20$ dB. Describe the trainable parameters, loss, and expected behaviour.

Solution

Parameter layout

Each layer $t\in\{0,\dots,9\}$ has $\mathbf{B}_t \in \mathbb{R}^{500\times 250}$ (125k params), a soft-threshold $\lambda_t \in \mathbb{R}_+$ , and an Onsager scalar $b_t \in \mathbb{R}$ . Total $\approx 1.25\mathrm{M}$ parameters — still lightweight vs. typical DNNs.

Loss

End-to-end MSE: $\mathcal{L}(\{\mathbf{B}_t,\lambda_t,b_t\}) = \mathbb{E}\|\mathbf{x}^{10} - \mathbf{x}\|_2^2$ averaged over a training set of $10^4$ $(\mathbf{x},\ntn{obs})$ pairs sampled with $\mathbf{x}\sim\text{BG}(\rho)$ and a fixed $\mathbf{A}$ .

Expected outcome

After training (a few thousand SGD steps), the learned $b_t$ values track $\delta^{-1}\langle\eta'\rangle$ closely in the first few layers and diverge later — effectively cranking up the Onsager feedback as the estimate gets refined. Reaches AMP fixed-point MSE in 10 layers, vs. 40+ for vanilla AMP.

⚠️Engineering Note

Training Tips for Unrolled Networks

Training unrolled message-passing networks is straightforward in principle but has a handful of recurring pitfalls:

Layer-wise greedy warm-start. Train layer 0, freeze it, then add and train layer 1, and so on. This avoids the vanishing- gradient problem that plagues end-to-end training of deep unrolled networks.
Tied vs. untied weights. Weight tying ( $\mathbf{B}_t \equiv \mathbf{B}$ ) cuts parameters by $T\times$ and generalizes better when data is scarce; untied weights win on large datasets.
Matrix-specific vs. matrix-agnostic training. If $\mathbf{A}$ is fixed (e.g., a physical MRI encoder), train on that single matrix. If $\mathbf{A}$ varies per sample (e.g., random masks), train over the ensemble. The two regimes give different learned parameters.
Loss curriculum. Start with a soft loss (per-layer MSE averaged over $t$ ) and anneal to the final-layer loss; this stabilizes early training.
Initialisation from analytical parameters. Always initialize $\mathbf{B}_0$ to $\mathbf{A}^{\mathsf{H}}$ and $b_0$ to the analytical Onsager coefficient. Random init often fails to recover convergence even after training.

LISTA vs ISTA Convergence

Compare reconstruction MSE as a function of the number of (un)rolled iterations for ISTA with the optimal fixed step size and LISTA with learned per-layer parameters. The learned network reaches ISTA's asymptotic MSE in a small fraction of the layers.

Parameters

Number of layers

T

16

Sparsity

\rho

0.1

SNR (dB)20

LAMP: MSE vs Layer Count

Visualize the final MSE achieved by LAMP as the number of unrolled layers grows. Compare against fixed-parameter AMP at the same iteration count. Notice how LAMP saturates faster and to a lower floor, especially for structured sensing matrices where AMP struggles.

Parameters

Number of layers12

Sensing ensemble

SNR (dB)20

Common Mistake: Overfitting to a Single Sensing Matrix

Mistake:

Training an unrolled LAMP/LDVAMP network with a single realization of $\mathbf{A}$ drawn from the distribution of interest, and then deploying it on different realizations from the same distribution. The learned weights encode the idiosyncrasies of the training matrix and collapse on novel ones.

Correction:

Decide upfront whether the sensing matrix is fixed (e.g., a calibrated imaging system, a trained sparse code) or random per sample (e.g., random masks, fresh pilot realizations). In the fixed case, training with the single matrix is correct. In the random case, resample $\mathbf{A}$ every mini-batch during training so that the learned parameters generalize over the ensemble. Mismatch between training and deployment is a leading cause of disappointing unrolled-network results in practice.

LISTA

Learned ISTA — an unrolled neural network with the ISTA iteration as its layer template. Parameters $(\mathbf{W}_t,\mathbf{S}_t,\lambda_t)$ are trained end-to-end to minimize reconstruction MSE. Achieves ISTA's asymptotic accuracy in far fewer layers than iterations.

LAMP

Learned AMP — an unrolled AMP iteration with learnable feedback matrix $\mathbf{B}_t$ , denoiser parameters, and Onsager scalar $b_t$ . Retains AMP's interpretability while adapting to empirical signal and matrix distributions.

Deep unfolding (algorithm unrolling)

A design paradigm that converts an iterative algorithm into a deep neural network by (i) unrolling a fixed number of iterations into layers, (ii) declaring per-layer parameters as trainable, and (iii) fitting them by end-to-end back-propagation. Combines the inductive bias of classical algorithms with the adaptivity of learned models.

Related: LISTA, LAMP

Quick Check

What is the principal advantage of LISTA over ISTA when both are run for $T$ iterations / layers?

LISTA has a strictly convex loss function, while ISTA does not.

LISTA learns per-layer parameters by end-to-end training, so it reaches low MSE in far fewer layers.

LISTA does not require a sparsity assumption, while ISTA does.

LISTA provably recovers the exact LASSO solution, while ISTA only approximates it.

Correction:

LISTA learns per-layer parameters by end-to-end training, so it reaches low MSE in far fewer layers.

Correct. LISTA breaks ISTA's "one operator for all iterations" constraint. The per-layer parameters adapt to the empirical signal-measurement distribution.

Why This Matters: Unrolled VAMP for Wireless Channel Estimation

Pilot-based channel estimation in OFDM and massive-MIMO uplinks often reduces to a structured compressed-sensing problem: a sparse delay-Doppler-angular channel observed through a partial DFT or Kronecker dictionary. LDVAMP is a natural fit — the LMMSE step uses the known dictionary, while the learned prior denoiser captures dataset-specific channel statistics (clustering of multipath components, angular selectivity, Doppler coherence) that an analytical prior would miss.

The CommIT group has explored unrolled-VAMP pipelines for RF imaging and joint channel-activity estimation in unsourced random access, where the mix of known structure (sensing operator) and unknown data-driven priors (channel clusters) is exactly where unrolled networks outperform both hand-designed message passing and generic deep learning.

See full treatment in Chapter 27، Section sec-lista-imaging

Learned Message Passing and Deep Unfolding

From Hand-Designed to Learned Message Passing

Definition: LISTA — Learned ISTA

Definition: LAMP — Learned AMP

Definition: LDVAMP — Learned Denoising VAMP

Theorem: Linear Convergence Rate of LISTA

Coupling LISTA layers to ISTA iterates

Layerwise optimal $\alpha_t$

Extending beyond ISTA

Example: LAMP for Bernoulli-Gaussian Recovery

Parameter layout

Loss

Expected outcome

Training Tips for Unrolled Networks

LISTA vs ISTA Convergence

Parameters

LAMP: MSE vs Layer Count

Parameters

Common Mistake: Overfitting to a Single Sensing Matrix

LISTA

LAMP

Deep unfolding (algorithm unrolling)

Quick Check

Why This Matters: Unrolled VAMP for Wireless Channel Estimation

Definition:
LISTA — Learned ISTA

Definition:
LAMP — Learned AMP

Definition:
LDVAMP — Learned Denoising VAMP