Automatic Differentiation for Inverse Problems

Why Automatic Differentiation Matters for Imaging

Three developments in modern computational imaging require efficient, exact gradient computation through complex computational graphs:

  1. Unrolled algorithms. An ISTA or ADMM solver unrolled for TT iterations becomes a TT-layer neural network whose parameters (step sizes, regularization weights, denoiser parameters) are optimized end-to-end. Training requires backpropagation through all TT iterations.

  2. Differentiable rendering. The forward model y=Ac+w\mathbf{y} = \mathbf{A}\mathbf{c} + \mathbf{w} must be differentiable with respect to scene parameters (not just c\mathbf{c}, but also geometry, material properties) for gradient-based optimization.

  3. PnP and RED convergence analysis. The convergence guarantees of Plug-and-Play and Regularization-by-Denoising depend on the spectral properties of the denoiser's Jacobian JD\mathbf{J}_\mathcal{D}. Computing this Jacobian for a neural denoiser requires AD.

Manual gradient derivation is error-prone and algorithm-specific. Automatic differentiation provides exact gradients for arbitrary compositions of differentiable operations.

Definition:

Automatic Differentiation

Automatic differentiation (AD) computes the derivative of a function f:Rn→Rmf : \mathbb{R}^n \to \mathbb{R}^m implemented as a computer program by decomposing it into elementary operations (addition, multiplication, exp, sin, etc.) and applying the chain rule systematically.

Unlike symbolic differentiation (which manipulates mathematical expressions and can produce exponentially large formulas), AD operates on the numeric computation itself and always produces exact derivatives (up to floating-point precision).

Unlike numerical differentiation (finite differences), AD does not suffer from the tradeoff between truncation error (large step) and round-off error (small step).

AD is not a single algorithm but a family of strategies. The two primary modes β€” forward and reverse β€” differ in which direction they traverse the chain rule, and this difference has profound implications for computational cost.

Definition:

Jacobian, JVP, and VJP

For a function f:Rn→Rmf : \mathbb{R}^n \to \mathbb{R}^m, the Jacobian at point x\mathbf{x} is the m×nm \times n matrix

[Jf(x)]ij=βˆ‚fiβˆ‚xj.[\mathbf{J}_f(\mathbf{x})]_{ij} = \frac{\partial f_i}{\partial x_j}.

Two fundamental operations avoid forming the full Jacobian:

  1. Jacobian-vector product (JVP): Jf(x) v\mathbf{J}_f(\mathbf{x})\,\mathbf{v} for a given tangent vector v∈Rn\mathbf{v} \in \mathbb{R}^n. Cost: one forward pass through the computation.

  2. Vector-Jacobian product (VJP): Jf(x)T u\mathbf{J}_f(\mathbf{x})^T\,\mathbf{u} for a given cotangent vector u∈Rm\mathbf{u} \in \mathbb{R}^m. Cost: one reverse pass (backpropagation).

The key asymmetry: a JVP maps one input perturbation to all outputs. A VJP maps one output perturbation to all inputs.

,

Theorem: Forward-Mode AD and the JVP

Let f=fL∘fLβˆ’1βˆ˜β‹―βˆ˜f1f = f_L \circ f_{L-1} \circ \cdots \circ f_1 be a composition of LL differentiable functions. Forward-mode AD computes the JVP Jf v\mathbf{J}_f\,\mathbf{v} by propagating the tangent vector forward through the chain:

xΛ™0=v,xΛ™β„“=Jfβ„“(xβ„“βˆ’1) xΛ™β„“βˆ’1,β„“=1,…,L.\dot{\mathbf{x}}_0 = \mathbf{v}, \qquad \dot{\mathbf{x}}_\ell = \mathbf{J}_{f_\ell}(\mathbf{x}_{\ell-1}) \,\dot{\mathbf{x}}_{\ell-1}, \quad \ell = 1, \ldots, L.

The output xΛ™L=Jf v\dot{\mathbf{x}}_L = \mathbf{J}_f\,\mathbf{v}.

Cost: one JVP evaluation costs O(1)O(1) times the cost of evaluating ff itself (typically 22--3Γ—3\times).

Forward-mode AD tracks how a small perturbation at the input propagates through each layer of the computation. At each step, it multiplies by the local Jacobian β€” which is never formed explicitly.

Theorem: Reverse-Mode AD and the VJP (Backpropagation)

Let f=fLβˆ˜β‹―βˆ˜f1f = f_L \circ \cdots \circ f_1 as above. Reverse-mode AD computes the VJP JfT u\mathbf{J}_f^T\,\mathbf{u} by propagating the cotangent vector backward through the chain:

xΛ‰L=u,xΛ‰β„“βˆ’1=Jfβ„“(xβ„“βˆ’1)T xΛ‰β„“,β„“=L,…,1.\bar{\mathbf{x}}_L = \mathbf{u}, \qquad \bar{\mathbf{x}}_{\ell-1} = \mathbf{J}_{f_\ell}(\mathbf{x}_{\ell-1})^T \,\bar{\mathbf{x}}_\ell, \quad \ell = L, \ldots, 1.

The output xΛ‰0=JfT u\bar{\mathbf{x}}_0 = \mathbf{J}_f^T\,\mathbf{u}.

Cost: one VJP evaluation costs O(1)O(1) times the cost of evaluating ff, but requires storing all intermediate values {xβ„“}\{\mathbf{x}_\ell\} from the forward pass (or recomputing them).

Reverse-mode AD tracks how a small perturbation at the output propagates backward to each input. This is exactly backpropagation in neural networks. A single reverse pass gives the gradient with respect to all nn inputs simultaneously.

When to Use Forward vs Reverse Mode

The choice depends on the input-output dimensions of ff:

  • Forward mode computes one column of Jf\mathbf{J}_f per pass. Efficient when nβ‰ͺmn \ll m (few inputs, many outputs). Cost to compute the full Jacobian: O(n)O(n) forward passes.

  • Reverse mode computes one row of Jf\mathbf{J}_f per pass. Efficient when mβ‰ͺnm \ll n (many inputs, few outputs). Cost to compute the full gradient (when m=1m = 1): one reverse pass.

For imaging: the loss function L(ΞΈ)=βˆ₯A(ΞΈ)cβˆ’yβˆ₯2\mathcal{L}(\theta) = \|\mathbf{A}(\theta)\mathbf{c} - \mathbf{y}\|^2 maps θ∈Rn\theta \in \mathbb{R}^n (many parameters) to a scalar loss. Reverse mode (backpropagation) is overwhelmingly more efficient: one reverse pass gives the full gradient βˆ‡ΞΈL\nabla_\theta \mathcal{L}, whereas forward mode would require nn passes.

However, for computing the Jacobian of a denoiser D:RQ→RQ\mathcal{D} : \mathbb{R}^Q \to \mathbb{R}^Q, both modes require QQ passes (the Jacobian is square). In this case, JVPs may be preferred because they avoid storing the full computation graph.

Example: Gradient of an Unrolled ISTA Loss

Consider TT iterations of ISTA applied to min⁑c12βˆ₯Acβˆ’yβˆ₯2+Ξ»βˆ₯cβˆ₯1\min_{\mathbf{c}} \frac{1}{2}\|\mathbf{A}\mathbf{c} - \mathbf{y}\|^2 + \lambda\|\mathbf{c}\|_1 with learnable step size ΞΌ\mu and regularization weight Ξ»\lambda. Derive the gradient of the reconstruction loss L=βˆ₯c(T)βˆ’ctrueβˆ₯2\mathcal{L} = \|\mathbf{c}^{(T)} - \mathbf{c}_{\text{true}}\|^2 with respect to ΞΌ\mu using reverse-mode AD.

Definition:

Gradient Checkpointing

For an unrolled algorithm with TT iterations, reverse-mode AD requires storing all TT intermediate states {c(0),…,c(Tβˆ’1)}\{\mathbf{c}^{(0)}, \ldots, \mathbf{c}^{(T-1)}\} during the forward pass. For a QQ-voxel image, this costs O(TQ)O(TQ) memory.

Gradient checkpointing reduces memory to O(T Q)O(\sqrt{T}\,Q) by storing only T\sqrt{T} evenly spaced checkpoints and recomputing the intermediate states during backpropagation. The tradeoff: memory decreases by T\sqrt{T}, but computation increases by at most a factor of 2.

In PyTorch, this is implemented via torch.utils.checkpoint.checkpoint().

For a typical unrolled ADMM with T=20T = 20 iterations on a 1282128^2 image: standard backprop requires 20Γ—1282Γ—16β‰ˆ520 \times 128^2 \times 16 \approx 5 MB. Checkpointing reduces this to 20Γ—1282Γ—16β‰ˆ1.2\sqrt{20} \times 128^2 \times 16 \approx 1.2 MB. The savings become critical for 3D problems (Q=1283Q = 128^3) or deeper unrollings (T=100T = 100+).

Quick Check

To compute the gradient of a scalar loss L(θ)\mathcal{L}(\theta) where θ∈R10,000\theta \in \mathbb{R}^{10{,}000}, how many passes does each AD mode require?

Forward: 10,000 passes; Reverse: 1 pass

Forward: 1 pass; Reverse: 10,000 passes

Both: 1 pass

Both: 10,000 passes

Common Mistake: AD Through Non-Smooth Operations

Mistake:

Blindly applying AD through non-differentiable operations such as soft-thresholding SΟ„(β‹…)\mathcal{S}_\tau(\cdot), hard-thresholding, or ReLU activations, expecting the gradient to always be well-defined.

Correction:

Soft-thresholding SΟ„(x)=sign(x)(∣xβˆ£βˆ’Ο„)+\mathcal{S}_\tau(x) = \text{sign}(x)(|x| - \tau)_+ is differentiable everywhere except at ∣x∣=Ο„|x| = \tau. At this point, the subdifferential is the interval [0,1][0, 1].

In practice, AD frameworks (PyTorch, JAX) define the gradient at non-differentiable points using a convention (e.g., the right derivative, or zero). This works well for optimization but can cause issues for:

  • Convergence analysis that requires Lipschitz gradients.
  • Jacobian computation of denoisers near kinks.
  • Second-order methods that assume twice-differentiability.

When rigorous differentiability is needed, consider smooth approximations: the Moreau envelope replaces βˆ₯β‹…βˆ₯1\|\cdot\|_1 with a differentiable function, and the softplus function log⁑(1+ex)\log(1 + e^x) smoothly approximates ReLU.

Computing the Denoiser Jacobian for PnP Convergence

The convergence of Plug-and-Play (PnP) algorithms depends on the spectral radius of the denoiser's Jacobian JD(x)\mathbf{J}_{\mathcal{D}}(\mathbf{x}). Specifically, for PnP-ISTA to converge, a sufficient condition is that the denoiser is firmly nonexpansive, meaning βˆ₯JDβˆ₯≀1\|\mathbf{J}_{\mathcal{D}}\| \leq 1.

For a neural denoiser with QQ input/output dimensions, the full Jacobian is a QΓ—QQ \times Q matrix. Computing it via AD requires QQ JVPs (forward mode) or QQ VJPs (reverse mode). For Q=16,384Q = 16{,}384, this is expensive but feasible as a diagnostic (not during training).

Power iteration on the Jacobian provides a cheaper alternative: it approximates the spectral norm βˆ₯JDβˆ₯\|\mathbf{J}_{\mathcal{D}}\| using only ∼20\sim 20 JVPs, each costing one forward pass through the denoiser.

πŸŽ“CommIT Contribution(2025)

Learned Algorithm Unrolling for RF Imaging

CommIT Group, G. Caire β€” CommIT NPJ

The CommIT group's work on multi-sensor RF imaging fusion uses an unrolled OAMP algorithm where the denoiser at each iteration is a learned neural network. Training this architecture end-to-end requires backpropagation through both the OAMP update steps and the denoiser β€” a direct application of the reverse-mode AD framework developed in this section.

The key insight from their work: by constraining the learned denoiser to be nonexpansive (via spectral normalization of the network weights), convergence of the overall algorithm is guaranteed regardless of the training data distribution.

commit-contributionlearned-imagingOAMP

Why This Matters: AD in Differentiable Channel Estimation

Automatic differentiation is not unique to imaging β€” it plays an increasingly central role in communications system design. Differentiable channel estimation treats the entire signal processing chain (pilot design, channel estimation, equalization, detection) as a differentiable computation graph. End-to-end training optimizes all components jointly, including the pilot sequences themselves.

The AD framework of this section applies directly: the forward model through the channel is analogous to the imaging forward operator, and the loss function (e.g., symbol error rate or mutual information estimate) plays the role of the reconstruction loss.

Automatic differentiation (AD)

A family of techniques for computing exact derivatives of functions implemented as computer programs, by systematic application of the chain rule to elementary operations.

Related: {{Ref:Gloss Jvp}}, {{Ref:Gloss Vjp}}

Jacobian-vector product (JVP)

The product Jfv\mathbf{J}_f \mathbf{v} computed by forward-mode AD. Gives the directional derivative of ff along v\mathbf{v}.

Vector-Jacobian product (VJP)

The product JfTu\mathbf{J}_f^T \mathbf{u} computed by reverse-mode AD (backpropagation). When the output is scalar (m=1m = 1), the VJP with u=1u = 1 gives the full gradient.

Key Takeaway

Automatic differentiation provides exact gradients through arbitrary compositions of differentiable operations. Reverse mode (backpropagation) is the workhorse for imaging: it computes the full gradient of a scalar loss in a single backward pass. Forward mode is efficient for JVPs (e.g., power iteration on the denoiser Jacobian). Gradient checkpointing trades computation for memory in deep unrollings. These tools are essential for training unrolled algorithms, differentiable rendering, and analyzing PnP convergence.