Automatic Differentiation for Inverse Problems
Why Automatic Differentiation Matters for Imaging
Three developments in modern computational imaging require efficient, exact gradient computation through complex computational graphs:
-
Unrolled algorithms. An ISTA or ADMM solver unrolled for iterations becomes a -layer neural network whose parameters (step sizes, regularization weights, denoiser parameters) are optimized end-to-end. Training requires backpropagation through all iterations.
-
Differentiable rendering. The forward model must be differentiable with respect to scene parameters (not just , but also geometry, material properties) for gradient-based optimization.
-
PnP and RED convergence analysis. The convergence guarantees of Plug-and-Play and Regularization-by-Denoising depend on the spectral properties of the denoiser's Jacobian . Computing this Jacobian for a neural denoiser requires AD.
Manual gradient derivation is error-prone and algorithm-specific. Automatic differentiation provides exact gradients for arbitrary compositions of differentiable operations.
Definition: Automatic Differentiation
Automatic Differentiation
Automatic differentiation (AD) computes the derivative of a function implemented as a computer program by decomposing it into elementary operations (addition, multiplication, exp, sin, etc.) and applying the chain rule systematically.
Unlike symbolic differentiation (which manipulates mathematical expressions and can produce exponentially large formulas), AD operates on the numeric computation itself and always produces exact derivatives (up to floating-point precision).
Unlike numerical differentiation (finite differences), AD does not suffer from the tradeoff between truncation error (large step) and round-off error (small step).
AD is not a single algorithm but a family of strategies. The two primary modes β forward and reverse β differ in which direction they traverse the chain rule, and this difference has profound implications for computational cost.
Definition: Jacobian, JVP, and VJP
Jacobian, JVP, and VJP
For a function , the Jacobian at point is the matrix
Two fundamental operations avoid forming the full Jacobian:
-
Jacobian-vector product (JVP): for a given tangent vector . Cost: one forward pass through the computation.
-
Vector-Jacobian product (VJP): for a given cotangent vector . Cost: one reverse pass (backpropagation).
The key asymmetry: a JVP maps one input perturbation to all outputs. A VJP maps one output perturbation to all inputs.
Theorem: Forward-Mode AD and the JVP
Let be a composition of differentiable functions. Forward-mode AD computes the JVP by propagating the tangent vector forward through the chain:
The output .
Cost: one JVP evaluation costs times the cost of evaluating itself (typically --).
Forward-mode AD tracks how a small perturbation at the input propagates through each layer of the computation. At each step, it multiplies by the local Jacobian β which is never formed explicitly.
Chain rule decomposition
By the chain rule,
The JVP is computed by associating the matrix-vector products from right to left: , then , etc.
Cost analysis
Each is computed alongside the forward evaluation of (dual number arithmetic or tangent propagation). The overhead is a constant factor per elementary operation.
Theorem: Reverse-Mode AD and the VJP (Backpropagation)
Let as above. Reverse-mode AD computes the VJP by propagating the cotangent vector backward through the chain:
The output .
Cost: one VJP evaluation costs times the cost of evaluating , but requires storing all intermediate values from the forward pass (or recomputing them).
Reverse-mode AD tracks how a small perturbation at the output propagates backward to each input. This is exactly backpropagation in neural networks. A single reverse pass gives the gradient with respect to all inputs simultaneously.
Transposed chain rule
.
The VJP is computed by associating from right to left in the transposed order: , then , etc.
Memory requirement
Computing requires the intermediate value (to evaluate the local Jacobian). These must either be stored during the forward pass ( memory) or recomputed (gradient checkpointing).
When to Use Forward vs Reverse Mode
The choice depends on the input-output dimensions of :
-
Forward mode computes one column of per pass. Efficient when (few inputs, many outputs). Cost to compute the full Jacobian: forward passes.
-
Reverse mode computes one row of per pass. Efficient when (many inputs, few outputs). Cost to compute the full gradient (when ): one reverse pass.
For imaging: the loss function maps (many parameters) to a scalar loss. Reverse mode (backpropagation) is overwhelmingly more efficient: one reverse pass gives the full gradient , whereas forward mode would require passes.
However, for computing the Jacobian of a denoiser , both modes require passes (the Jacobian is square). In this case, JVPs may be preferred because they avoid storing the full computation graph.
Example: Gradient of an Unrolled ISTA Loss
Consider iterations of ISTA applied to with learnable step size and regularization weight . Derive the gradient of the reconstruction loss with respect to using reverse-mode AD.
Unrolled computation graph
Each ISTA iteration is
where is the soft-thresholding operator. This defines a composition where each depends on and .
Reverse-mode backpropagation
The gradient accumulates contributions from each iteration via the chain rule:
where is the adjoint state computed by backpropagation.
Local derivative
The derivative of with respect to involves the sub-gradient of the soft-thresholding operator and the gradient step:
In PyTorch, all of this is handled automatically by the autograd engine.
Definition: Gradient Checkpointing
Gradient Checkpointing
For an unrolled algorithm with iterations, reverse-mode AD requires storing all intermediate states during the forward pass. For a -voxel image, this costs memory.
Gradient checkpointing reduces memory to by storing only evenly spaced checkpoints and recomputing the intermediate states during backpropagation. The tradeoff: memory decreases by , but computation increases by at most a factor of 2.
In PyTorch, this is implemented via
torch.utils.checkpoint.checkpoint().
For a typical unrolled ADMM with iterations on a image: standard backprop requires MB. Checkpointing reduces this to MB. The savings become critical for 3D problems () or deeper unrollings (+).
Quick Check
To compute the gradient of a scalar loss where , how many passes does each AD mode require?
Forward: 10,000 passes; Reverse: 1 pass
Forward: 1 pass; Reverse: 10,000 passes
Both: 1 pass
Both: 10,000 passes
Forward mode computes one directional derivative per pass ( passes for the full gradient). Reverse mode computes the full gradient in a single backward pass.
Common Mistake: AD Through Non-Smooth Operations
Mistake:
Blindly applying AD through non-differentiable operations such as soft-thresholding , hard-thresholding, or ReLU activations, expecting the gradient to always be well-defined.
Correction:
Soft-thresholding is differentiable everywhere except at . At this point, the subdifferential is the interval .
In practice, AD frameworks (PyTorch, JAX) define the gradient at non-differentiable points using a convention (e.g., the right derivative, or zero). This works well for optimization but can cause issues for:
- Convergence analysis that requires Lipschitz gradients.
- Jacobian computation of denoisers near kinks.
- Second-order methods that assume twice-differentiability.
When rigorous differentiability is needed, consider smooth approximations: the Moreau envelope replaces with a differentiable function, and the softplus function smoothly approximates ReLU.
Computing the Denoiser Jacobian for PnP Convergence
The convergence of Plug-and-Play (PnP) algorithms depends on the spectral radius of the denoiser's Jacobian . Specifically, for PnP-ISTA to converge, a sufficient condition is that the denoiser is firmly nonexpansive, meaning .
For a neural denoiser with input/output dimensions, the full Jacobian is a matrix. Computing it via AD requires JVPs (forward mode) or VJPs (reverse mode). For , this is expensive but feasible as a diagnostic (not during training).
Power iteration on the Jacobian provides a cheaper alternative: it approximates the spectral norm using only JVPs, each costing one forward pass through the denoiser.
Learned Algorithm Unrolling for RF Imaging
The CommIT group's work on multi-sensor RF imaging fusion uses an unrolled OAMP algorithm where the denoiser at each iteration is a learned neural network. Training this architecture end-to-end requires backpropagation through both the OAMP update steps and the denoiser β a direct application of the reverse-mode AD framework developed in this section.
The key insight from their work: by constraining the learned denoiser to be nonexpansive (via spectral normalization of the network weights), convergence of the overall algorithm is guaranteed regardless of the training data distribution.
Why This Matters: AD in Differentiable Channel Estimation
Automatic differentiation is not unique to imaging β it plays an increasingly central role in communications system design. Differentiable channel estimation treats the entire signal processing chain (pilot design, channel estimation, equalization, detection) as a differentiable computation graph. End-to-end training optimizes all components jointly, including the pilot sequences themselves.
The AD framework of this section applies directly: the forward model through the channel is analogous to the imaging forward operator, and the loss function (e.g., symbol error rate or mutual information estimate) plays the role of the reconstruction loss.
Automatic differentiation (AD)
A family of techniques for computing exact derivatives of functions implemented as computer programs, by systematic application of the chain rule to elementary operations.
Related: {{Ref:Gloss Jvp}}, {{Ref:Gloss Vjp}}
Jacobian-vector product (JVP)
The product computed by forward-mode AD. Gives the directional derivative of along .
Vector-Jacobian product (VJP)
The product computed by reverse-mode AD (backpropagation). When the output is scalar (), the VJP with gives the full gradient.
Key Takeaway
Automatic differentiation provides exact gradients through arbitrary compositions of differentiable operations. Reverse mode (backpropagation) is the workhorse for imaging: it computes the full gradient of a scalar loss in a single backward pass. Forward mode is efficient for JVPs (e.g., power iteration on the denoiser Jacobian). Gradient checkpointing trades computation for memory in deep unrollings. These tools are essential for training unrolled algorithms, differentiable rendering, and analyzing PnP convergence.