Ferkans — Interactive Telecom Tutor

Why AMP?

Chapters 17 and 18 established ISTA (Iterative Soft-Thresholding Algorithm) as the canonical first-order method for LASSO-type sparse recovery. ISTA is simple and convergent, but in the proportional asymptotic regime ( $M, N \to \infty$ with $\delta = M/N$ fixed) its per-iteration behaviour does not admit a sharp closed-form characterisation. Each iterate $\hat{\mathbf{x}}^t$ has a residual distribution that drifts away from Gaussianity because of feedback through $\mathbf{A}^{H}\mathbf{A}$ .

Approximate Message Passing (AMP), introduced by Donoho, Maleki, and Montanari (2009), repairs this defect with a single additive term: the Onsager correction. This term is inherited from the TAP (Thouless- Anderson-Palmer) equations of spin-glass physics. With the correction in place, the effective noise at the denoiser input at every iteration is exactly Gaussian in the large-system limit, which makes the algorithm analytically tractable through a one-dimensional recursion called state evolution.

This section derives AMP as a careful approximation to loopy belief propagation on the dense factor graph of the compressed-sensing problem, explains the role of the Onsager term, and states the algorithm precisely.

Definition:
Linear Observation Model

We observe $\mathbf{y} = \mathbf{A}\,\mathbf{x} + \mathbf{w}, \qquad \mathbf{A} \in \mathbb{R}^{M \times N}, \quad \mathbf{w} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_M),$ with $\delta = M/N \in (0,1]$ fixed. The entries of $\mathbf{A}$ are i.i.d. $\mathcal{N}(0, 1/M)$ . The signal $\mathbf{x}$ has i.i.d. entries from a prior $p_X$ (possibly sparse, e.g. Bernoulli--Gaussian). The goal is to recover $\mathbf{x}$ from $\mathbf{y}$ .

The normalisation $\mathbb{E}[A_{ij}^2] = 1/M$ is the AMP convention and guarantees that $\|\mathbf{A}\mathbf{x}\|^2 \to \|\mathbf{x}\|^2/\delta$ by the law of large numbers. Different references scale differently — check before comparing formulas.

Definition:
Denoiser

A denoiser is a scalar function $\eta: \mathbb{R} \times \Theta \to \mathbb{R}$ , $(u, \theta) \mapsto \eta(u; \theta)$ , applied componentwise to a vector input. We require $\eta(\cdot; \theta)$ to be Lipschitz and piecewise differentiable in its first argument. The derivative $\eta'(u;\theta) = \partial \eta / \partial u$ is the sensitivity of the denoiser.

Common choices include soft-thresholding $\eta_{\mathrm{st}}(u;\lambda) = \mathrm{sign}(u)(|u|-\lambda)_+$ , the posterior mean $\eta_{\mathrm{mmse}}(u;\tau^2) = \mathbb{E}[X|X+\tau Z=u]$ , and, more recently, neural networks trained for denoising in AWGN.

Onsager correction

A memory term $\frac{1}{\delta}\langle\eta'(\cdot)\rangle \mathbf{r}^{t-1}$ added to the AMP residual. It cancels the bias that loopy feedback through $\mathbf{A}^{H}\mathbf{A}$ would otherwise inject into the residual, thereby Gaussianising the denoiser input in the large-system limit.

Related: AMP iteration, State Evolution

AMP iteration

The two-line update $\hat{\mathbf{x}}^{t+1} = \eta(\mathbf{A}^{H}\mathbf{r}^t + \hat{\mathbf{x}}^t; \theta_t)$ and $\mathbf{r}^t = \mathbf{y} - \mathbf{A}\hat{\mathbf{x}}^t + \delta^{-1}\langle\eta'\rangle\mathbf{r}^{t-1}$ that jointly constitute approximate message passing.

Sketch: From Loopy BP to AMP

Write the posterior as $p(\mathbf{x}|\mathbf{y}) \propto \prod_i p_X(x_i) \prod_a \mathcal{N}(y_a; \mathbf{a}_a^T\mathbf{x}, \sigma^2)$ and run sum- product on the bipartite factor graph with $N$ variable nodes and $M$ factor nodes. Each factor-to-variable message is a convolution of $N-1$ independent distributions; by the CLT, in the large- $N$ regime it is approximately Gaussian and is characterised by its mean and variance alone.

Tracking $2MN$ Gaussian messages is still too expensive. The relaxed-BP simplification replaces message $m_{a \to i}$ by its Taylor expansion around the marginal means, which produces an algorithm of complexity $O(MN)$ per iteration. However, relaxed BP retains a bias that grows with the effective correlation between iterations.

AMP is obtained by a further, careful cancellation: the Onsager term $\delta^{-1}\langle\eta'\rangle\mathbf{r}^{t-1}$ removes precisely the bias that relaxed BP would accumulate. The full derivation is algebraic and is spelled out in Bayati--Montanari (2011, App. A) and in Rangan's GAMP paper.

Approximate Message Passing (AMP)

Complexity: Per iteration: one

M \times N

matrix-vector product

\mathbf{A}\hat{\mathbf{x}}^{t+1}

and one

N \times M

product

\mathbf{A}^{H}\mathbf{r}^t

. Total:

O(MN)

flops, identical to ISTA.

Input: y in R^M, A in R^{M x N}, denoiser eta(.;theta), iterations T

Output: estimate x_hat

Initialize: x_hat_0 = 0 in R^N, r_{-1} = 0 in R^M, r_0 = y

delta = M / N

for t = 0, 1, ..., T-1:

# Denoising step

u_t = A^T r_t + x_hat_t # pseudo-data in R^N

x_hat_{t+1} = eta(u_t; theta_t) # apply denoiser

# Residual update with Onsager correction

b_t = (1/delta) * mean(eta'(u_t; theta_t))

r_{t+1} = y - A x_hat_{t+1} + b_t * r_t

return x_hat_T

The only structural difference from ISTA is the term $b_t \mathbf{r}^t$ in the residual. Drop it and you recover ISTA with gradient step $\mathbf{A}^H\mathbf{r}$ .

,

What the Onsager Term Actually Does

The key observation is that $\mathbf{A}^{H}\mathbf{r}^t$ is not white Gaussian noise added to $\mathbf{x}$ — it is correlated with $\hat{\mathbf{x}}^t$ through the history of iterates. ISTA ignores this correlation, which is why its per-iteration distribution is not Gaussian and why sharp analytical predictions for ISTA are hard to obtain.

The Onsager term $b_t \mathbf{r}^t$ is precisely the first-order correction that removes this correlation. After its inclusion, one can prove (see Section 20.2) that $\mathbf{A}^{H}\mathbf{r}^t + \hat{\mathbf{x}}^t \stackrel{d}{=} \mathbf{x} + \tau_t \mathbf{Z}, \qquad \mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_N),$ for a deterministic scalar $\tau_t$ — the pseudo-data looks like the signal plus independent Gaussian noise. This is the property that justifies calling $\eta$ a denoiser.

Common Mistake: Forgetting the Onsager Correction

Mistake:

A common mistake when implementing AMP from scratch is to drop the term $b_t \mathbf{r}^t$ — "it looks like ISTA anyway". The resulting algorithm is ISTA in disguise and does not exhibit the sharp phase-transition behaviour that makes AMP theoretically attractive.

Correction:

The Onsager term is the defining feature of AMP, not a bookkeeping detail. It costs one vector scale-and-add per iteration and changes the asymptotic behaviour from loose to tight. Always verify that $b_t$ is recomputed at every iteration as the empirical mean of $\eta'$ on the current pseudo-data.

Example: AMP with Soft-Thresholding (L1-AMP)

Specialise AMP to the soft-threshold denoiser $\eta(u;\lambda) = \mathrm{sign}(u)(|u|-\lambda)_+$ and write out the resulting iteration. Compute $\langle\eta'\rangle$ in closed form.

Solution

Denoiser and its derivative

The soft-threshold denoiser is $\eta(u;\lambda) = \begin{cases} u - \lambda & u > \lambda \\ 0 & |u| \le \lambda \\ u+\lambda & u < -\lambda \end{cases}$ Its derivative is the indicator function $\eta'(u;\lambda) = \mathbf{1}\{|u| > \lambda\}$ , so $\langle \eta'(\mathbf{u};\lambda)\rangle = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\{|u_i|>\lambda\} = \frac{\#\{i : u_i \text{ is above threshold}\}}{N}.$ This is exactly the fraction of components AMP declares non-zero at the current iteration.

L1-AMP iteration

Plugging into the generic AMP iteration: $\hat{\mathbf{x}}^{t+1} = \eta_{\mathrm{st}}\!\left(\mathbf{A}^{H}\mathbf{r}^t + \hat{\mathbf{x}}^t;\; \lambda_t\right),$ $\mathbf{r}^{t+1} = \mathbf{y} - \mathbf{A}\hat{\mathbf{x}}^{t+1} + \frac{1}{\delta}\,\frac{\|\hat{\mathbf{x}}^{t+1}\|_0}{N}\,\mathbf{r}^t.$ The Onsager coefficient is simply the sparsity level of the current estimate, divided by $\delta$ .

Tuning the threshold

A common choice is $\lambda_t = \alpha \hat{\tau}_t$ where $\hat{\tau}_t^2 = \|\mathbf{r}^t\|^2 / M$ is an empirical estimate of the effective noise variance and $\alpha$ is tuned to match the expected sparsity of $\mathbf{x}$ . Section 20.2 shows how state evolution determines the optimal $\alpha$ analytically.

AMP Iteration Trajectory

Run AMP with a soft-threshold denoiser on a synthetic Bernoulli--Gaussian signal and plot the normalised MSE $\|\hat{\mathbf{x}}^t - \mathbf{x}\|^2 / \|\mathbf{x}\|^2$ versus iteration $t$ . Compare against ISTA (same update, Onsager term removed). For $\delta$ large enough relative to the sparsity, AMP converges in a handful of iterations; ISTA needs hundreds.

Parameters

N1000

Signal dimension

\delta = M/N

0.5

Undersampling ratio

\rho

0.15

Sparsity (fraction of non-zeros)

SNR (dB)30

Signal-to-noise ratio

Iterations20

Soft-Thresholding Denoiser

Visualise $\eta_{\mathrm{st}}(u;\lambda) = \mathrm{sign}(u)(|u|-\lambda)_+$ together with its derivative $\eta'(u;\lambda) = \mathbf{1}\{|u|>\lambda\}$ . Adjust $\lambda$ to see how the dead-zone widens and the Onsager coefficient $\langle\eta'\rangle$ changes accordingly.

Parameters

\lambda

1

Threshold

⚠️Engineering Note

Implementation Notes for AMP

In practice, AMP is startlingly simple to implement — a clean MATLAB/Python version is about 15 lines. The subtle implementation points are:

Matrix normalisation. Use $\mathbb{E}[A_{ij}^2]=1/M$ . If you use a different normalisation, the Onsager coefficient and the state- evolution formulas change accordingly.
Threshold schedule. Tuning $\lambda_t$ matters more than the exact denoiser form. Parameter-free AMP (Montanari 2012) sets $\lambda_t = \alpha \hat{\tau}_t$ with $\alpha \approx 1.1681\ldots$ (the minimax-optimal value for equal-power sparse vectors).
Divergence monitoring. Track $\|\mathbf{r}^t\|^2/M$ iteration by iteration. If it stops monotonically decreasing, the matrix is not sufficiently i.i.d. and AMP is diverging — consider damping or switch to OAMP/VAMP (Chapter 21).
Finite-sample effects. The Gaussianity of the pseudo-data is an asymptotic prediction. For $N \lesssim 200$ expect noticeable deviations; for $N \ge 2000$ the state-evolution prediction tracks empirical MSE to within 1-2%.

Practical Constraints

•
Dense $\mathbf{A}$ with i.i.d. (approximately) Gaussian entries
•
Per-iteration cost $O(MN)$
•
Requires Lipschitz differentiable denoiser

Historical Note: TAP Equations and Onsager in Spin Glasses

1936-2009

The Onsager correction term is named after Lars Onsager, who first identified a reaction-field correction in his 1936 work on polar liquids. Thouless, Anderson, and Palmer re-derived an analogous term in 1977 for the Sherrington--Kirkpatrick (SK) spin-glass model, where it is essential for getting the right mean-field equations on dense, random interaction graphs.

Donoho, Maleki, and Montanari (2009) observed that the compressed-sensing problem with i.i.d. Gaussian $\mathbf{A}$ has exactly the structure of an SK model — each observation couples to all unknowns with independent random strengths — and imported the TAP construction into signal processing. The name "Approximate Message Passing" comes from the alternative derivation as an approximation to loopy belief propagation on the dense factor graph; the two perspectives give the same algorithm.

Quick Check

What is the primary role of the Onsager term in AMP?

Accelerate convergence by Nesterov-style momentum

Ensure the denoiser input $\mathbf{A}^{H}\mathbf{r}^t+\hat{\mathbf{x}}^t$ is distributed as the true signal plus independent Gaussian noise

Regularise $\mathbf{x}$ to enforce sparsity

Guarantee global convergence even for ill-conditioned $\mathbf{A}$

Correction:

Ensure the denoiser input

\mathbf{A}^{H}\mathbf{r}^t+\hat{\mathbf{x}}^t

is distributed as the true signal plus independent Gaussian noise

Correct. Without the Onsager term the feedback through $\\mathbf{A}^{H}\\mathbf{A}$ creates correlation between the residual and the iterate, and the denoiser input is no longer Gaussian. Adding $b_t \\mathbf{r}^t$ cancels this correlation to leading order.

Quick Check

What is the per-iteration computational complexity of AMP?

$O(MN)$ , identical to ISTA

$O(MN^2)$ because of the Onsager correction

$O(N^3)$ because of a matrix inversion

$O(N\log N)$ using FFT

Correction:

O(MN)

, identical to ISTA

AMP requires only two matrix-vector products per iteration, the same as ISTA. The Onsager coefficient $b_t$ is a single scalar, negligible in comparison.

Key Takeaway

AMP is ISTA plus one extra term. The extra term — the Onsager correction $\delta^{-1}\langle\eta'\rangle\,\mathbf{r}^{t-1}$ — is what turns a generic proximal-gradient iteration into an algorithm with a tight, one- dimensional, predictable trajectory in the large-system limit.

AMP from First Principles

Why AMP?

Definition: Linear Observation Model

Definition: Denoiser

Onsager correction

AMP iteration

Sketch: From Loopy BP to AMP

Approximate Message Passing (AMP)

What the Onsager Term Actually Does

Common Mistake: Forgetting the Onsager Correction

Example: AMP with Soft-Thresholding (L1-AMP)

Denoiser and its derivative

L1-AMP iteration

Tuning the threshold

AMP Iteration Trajectory

Parameters

Soft-Thresholding Denoiser

Parameters

Implementation Notes for AMP

Historical Note: TAP Equations and Onsager in Spin Glasses

Quick Check

Quick Check

Key Takeaway

Definition:
Linear Observation Model

Definition:
Denoiser