AMP from First Principles

Why AMP?

Chapters 17 and 18 established ISTA (Iterative Soft-Thresholding Algorithm) as the canonical first-order method for LASSO-type sparse recovery. ISTA is simple and convergent, but in the proportional asymptotic regime (M,NM, N \to \infty with δ=M/N\delta = M/N fixed) its per-iteration behaviour does not admit a sharp closed-form characterisation. Each iterate x^t\hat{\mathbf{x}}^t has a residual distribution that drifts away from Gaussianity because of feedback through AHA\mathbf{A}^{H}\mathbf{A}.

Approximate Message Passing (AMP), introduced by Donoho, Maleki, and Montanari (2009), repairs this defect with a single additive term: the Onsager correction. This term is inherited from the TAP (Thouless- Anderson-Palmer) equations of spin-glass physics. With the correction in place, the effective noise at the denoiser input at every iteration is exactly Gaussian in the large-system limit, which makes the algorithm analytically tractable through a one-dimensional recursion called state evolution.

This section derives AMP as a careful approximation to loopy belief propagation on the dense factor graph of the compressed-sensing problem, explains the role of the Onsager term, and states the algorithm precisely.

Definition:

Linear Observation Model

We observe y=Ax+w,ARM×N,wN(0,σ2IM),\mathbf{y} = \mathbf{A}\,\mathbf{x} + \mathbf{w}, \qquad \mathbf{A} \in \mathbb{R}^{M \times N}, \quad \mathbf{w} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_M), with δ=M/N(0,1]\delta = M/N \in (0,1] fixed. The entries of A\mathbf{A} are i.i.d. N(0,1/M)\mathcal{N}(0, 1/M). The signal x\mathbf{x} has i.i.d. entries from a prior pXp_X (possibly sparse, e.g. Bernoulli--Gaussian). The goal is to recover x\mathbf{x} from y\mathbf{y}.

The normalisation E[Aij2]=1/M\mathbb{E}[A_{ij}^2] = 1/M is the AMP convention and guarantees that Ax2x2/δ\|\mathbf{A}\mathbf{x}\|^2 \to \|\mathbf{x}\|^2/\delta by the law of large numbers. Different references scale differently — check before comparing formulas.

Definition:

Denoiser

A denoiser is a scalar function η:R×ΘR\eta: \mathbb{R} \times \Theta \to \mathbb{R}, (u,θ)η(u;θ)(u, \theta) \mapsto \eta(u; \theta), applied componentwise to a vector input. We require η(;θ)\eta(\cdot; \theta) to be Lipschitz and piecewise differentiable in its first argument. The derivative η(u;θ)=η/u\eta'(u;\theta) = \partial \eta / \partial u is the sensitivity of the denoiser.

Common choices include soft-thresholding ηst(u;λ)=sign(u)(uλ)+\eta_{\mathrm{st}}(u;\lambda) = \mathrm{sign}(u)(|u|-\lambda)_+, the posterior mean ηmmse(u;τ2)=E[XX+τZ=u]\eta_{\mathrm{mmse}}(u;\tau^2) = \mathbb{E}[X|X+\tau Z=u], and, more recently, neural networks trained for denoising in AWGN.

Onsager correction

A memory term 1δη()rt1\frac{1}{\delta}\langle\eta'(\cdot)\rangle \mathbf{r}^{t-1} added to the AMP residual. It cancels the bias that loopy feedback through AHA\mathbf{A}^{H}\mathbf{A} would otherwise inject into the residual, thereby Gaussianising the denoiser input in the large-system limit.

Related: AMP iteration, State Evolution

AMP iteration

The two-line update x^t+1=η(AHrt+x^t;θt)\hat{\mathbf{x}}^{t+1} = \eta(\mathbf{A}^{H}\mathbf{r}^t + \hat{\mathbf{x}}^t; \theta_t) and rt=yAx^t+δ1ηrt1\mathbf{r}^t = \mathbf{y} - \mathbf{A}\hat{\mathbf{x}}^t + \delta^{-1}\langle\eta'\rangle\mathbf{r}^{t-1} that jointly constitute approximate message passing.

Related: Onsager correction, State Evolution

Sketch: From Loopy BP to AMP

Write the posterior as p(xy)ipX(xi)aN(ya;aaTx,σ2)p(\mathbf{x}|\mathbf{y}) \propto \prod_i p_X(x_i) \prod_a \mathcal{N}(y_a; \mathbf{a}_a^T\mathbf{x}, \sigma^2) and run sum- product on the bipartite factor graph with NN variable nodes and MM factor nodes. Each factor-to-variable message is a convolution of N1N-1 independent distributions; by the CLT, in the large-NN regime it is approximately Gaussian and is characterised by its mean and variance alone.

Tracking 2MN2MN Gaussian messages is still too expensive. The relaxed-BP simplification replaces message maim_{a \to i} by its Taylor expansion around the marginal means, which produces an algorithm of complexity O(MN)O(MN) per iteration. However, relaxed BP retains a bias that grows with the effective correlation between iterations.

AMP is obtained by a further, careful cancellation: the Onsager term δ1ηrt1\delta^{-1}\langle\eta'\rangle\mathbf{r}^{t-1} removes precisely the bias that relaxed BP would accumulate. The full derivation is algebraic and is spelled out in Bayati--Montanari (2011, App. A) and in Rangan's GAMP paper.

Approximate Message Passing (AMP)

Complexity: Per iteration: one M×NM \times N matrix-vector product Ax^t+1\mathbf{A}\hat{\mathbf{x}}^{t+1} and one N×MN \times M product AHrt\mathbf{A}^{H}\mathbf{r}^t. Total: O(MN)O(MN) flops, identical to ISTA.
Input: y in R^M, A in R^{M x N}, denoiser eta(.;theta), iterations T
Output: estimate x_hat
Initialize: x_hat_0 = 0 in R^N, r_{-1} = 0 in R^M, r_0 = y
delta = M / N
for t = 0, 1, ..., T-1:
# Denoising step
u_t = A^T r_t + x_hat_t # pseudo-data in R^N
x_hat_{t+1} = eta(u_t; theta_t) # apply denoiser
# Residual update with Onsager correction
b_t = (1/delta) * mean(eta'(u_t; theta_t))
r_{t+1} = y - A x_hat_{t+1} + b_t * r_t
return x_hat_T

The only structural difference from ISTA is the term btrtb_t \mathbf{r}^t in the residual. Drop it and you recover ISTA with gradient step AHr\mathbf{A}^H\mathbf{r}.

,

What the Onsager Term Actually Does

The key observation is that AHrt\mathbf{A}^{H}\mathbf{r}^t is not white Gaussian noise added to x\mathbf{x} — it is correlated with x^t\hat{\mathbf{x}}^t through the history of iterates. ISTA ignores this correlation, which is why its per-iteration distribution is not Gaussian and why sharp analytical predictions for ISTA are hard to obtain.

The Onsager term btrtb_t \mathbf{r}^t is precisely the first-order correction that removes this correlation. After its inclusion, one can prove (see Section 20.2) that AHrt+x^t=dx+τtZ,ZN(0,IN),\mathbf{A}^{H}\mathbf{r}^t + \hat{\mathbf{x}}^t \stackrel{d}{=} \mathbf{x} + \tau_t \mathbf{Z}, \qquad \mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_N), for a deterministic scalar τt\tau_t — the pseudo-data looks like the signal plus independent Gaussian noise. This is the property that justifies calling η\eta a denoiser.

Common Mistake: Forgetting the Onsager Correction

Mistake:

A common mistake when implementing AMP from scratch is to drop the term btrtb_t \mathbf{r}^t — "it looks like ISTA anyway". The resulting algorithm is ISTA in disguise and does not exhibit the sharp phase-transition behaviour that makes AMP theoretically attractive.

Correction:

The Onsager term is the defining feature of AMP, not a bookkeeping detail. It costs one vector scale-and-add per iteration and changes the asymptotic behaviour from loose to tight. Always verify that btb_t is recomputed at every iteration as the empirical mean of η\eta' on the current pseudo-data.

Example: AMP with Soft-Thresholding (L1-AMP)

Specialise AMP to the soft-threshold denoiser η(u;λ)=sign(u)(uλ)+\eta(u;\lambda) = \mathrm{sign}(u)(|u|-\lambda)_+ and write out the resulting iteration. Compute η\langle\eta'\rangle in closed form.

AMP Iteration Trajectory

Run AMP with a soft-threshold denoiser on a synthetic Bernoulli--Gaussian signal and plot the normalised MSE x^tx2/x2\|\hat{\mathbf{x}}^t - \mathbf{x}\|^2 / \|\mathbf{x}\|^2 versus iteration tt. Compare against ISTA (same update, Onsager term removed). For δ\delta large enough relative to the sparsity, AMP converges in a handful of iterations; ISTA needs hundreds.

Parameters
1000

Signal dimension

0.5

Undersampling ratio

0.15

Sparsity (fraction of non-zeros)

30

Signal-to-noise ratio

20

Soft-Thresholding Denoiser

Visualise ηst(u;λ)=sign(u)(uλ)+\eta_{\mathrm{st}}(u;\lambda) = \mathrm{sign}(u)(|u|-\lambda)_+ together with its derivative η(u;λ)=1{u>λ}\eta'(u;\lambda) = \mathbf{1}\{|u|>\lambda\}. Adjust λ\lambda to see how the dead-zone widens and the Onsager coefficient η\langle\eta'\rangle changes accordingly.

Parameters
1

Threshold

⚠️Engineering Note

Implementation Notes for AMP

In practice, AMP is startlingly simple to implement — a clean MATLAB/Python version is about 15 lines. The subtle implementation points are:

  1. Matrix normalisation. Use E[Aij2]=1/M\mathbb{E}[A_{ij}^2]=1/M. If you use a different normalisation, the Onsager coefficient and the state- evolution formulas change accordingly.
  2. Threshold schedule. Tuning λt\lambda_t matters more than the exact denoiser form. Parameter-free AMP (Montanari 2012) sets λt=ατ^t\lambda_t = \alpha \hat{\tau}_t with α1.1681\alpha \approx 1.1681\ldots (the minimax-optimal value for equal-power sparse vectors).
  3. Divergence monitoring. Track rt2/M\|\mathbf{r}^t\|^2/M iteration by iteration. If it stops monotonically decreasing, the matrix is not sufficiently i.i.d. and AMP is diverging — consider damping or switch to OAMP/VAMP (Chapter 21).
  4. Finite-sample effects. The Gaussianity of the pseudo-data is an asymptotic prediction. For N200N \lesssim 200 expect noticeable deviations; for N2000N \ge 2000 the state-evolution prediction tracks empirical MSE to within 1-2%.
Practical Constraints
  • Dense A\mathbf{A} with i.i.d. (approximately) Gaussian entries

  • Per-iteration cost O(MN)O(MN)

  • Requires Lipschitz differentiable denoiser

Historical Note: TAP Equations and Onsager in Spin Glasses

1936-2009

The Onsager correction term is named after Lars Onsager, who first identified a reaction-field correction in his 1936 work on polar liquids. Thouless, Anderson, and Palmer re-derived an analogous term in 1977 for the Sherrington--Kirkpatrick (SK) spin-glass model, where it is essential for getting the right mean-field equations on dense, random interaction graphs.

Donoho, Maleki, and Montanari (2009) observed that the compressed-sensing problem with i.i.d. Gaussian A\mathbf{A} has exactly the structure of an SK model — each observation couples to all unknowns with independent random strengths — and imported the TAP construction into signal processing. The name "Approximate Message Passing" comes from the alternative derivation as an approximation to loopy belief propagation on the dense factor graph; the two perspectives give the same algorithm.

Quick Check

What is the primary role of the Onsager term in AMP?

Accelerate convergence by Nesterov-style momentum

Ensure the denoiser input AHrt+x^t\mathbf{A}^{H}\mathbf{r}^t+\hat{\mathbf{x}}^t is distributed as the true signal plus independent Gaussian noise

Regularise x\mathbf{x} to enforce sparsity

Guarantee global convergence even for ill-conditioned A\mathbf{A}

Quick Check

What is the per-iteration computational complexity of AMP?

O(MN)O(MN), identical to ISTA

O(MN2)O(MN^2) because of the Onsager correction

O(N3)O(N^3) because of a matrix inversion

O(NlogN)O(N\log N) using FFT

Key Takeaway

AMP is ISTA plus one extra term. The extra term — the Onsager correction δ1ηrt1\delta^{-1}\langle\eta'\rangle\,\mathbf{r}^{t-1} — is what turns a generic proximal-gradient iteration into an algorithm with a tight, one- dimensional, predictable trajectory in the large-system limit.