Ferkans — Interactive Telecom Tutor

Why Automatic Differentiation Matters for Imaging

Three developments in modern computational imaging require efficient, exact gradient computation through complex computational graphs:

Unrolled algorithms. An ISTA or ADMM solver unrolled for $T$ iterations becomes a $T$ -layer neural network whose parameters (step sizes, regularization weights, denoiser parameters) are optimized end-to-end. Training requires backpropagation through all $T$ iterations.
Differentiable rendering. The forward model $\mathbf{y} = \mathbf{A}\mathbf{c} + \mathbf{w}$ must be differentiable with respect to scene parameters (not just $\mathbf{c}$ , but also geometry, material properties) for gradient-based optimization.
PnP and RED convergence analysis. The convergence guarantees of Plug-and-Play and Regularization-by-Denoising depend on the spectral properties of the denoiser's Jacobian $\mathbf{J}_\mathcal{D}$ . Computing this Jacobian for a neural denoiser requires AD.

Manual gradient derivation is error-prone and algorithm-specific. Automatic differentiation provides exact gradients for arbitrary compositions of differentiable operations.

Definition:
Automatic Differentiation

Automatic differentiation (AD) computes the derivative of a function $f : \mathbb{R}^n \to \mathbb{R}^m$ implemented as a computer program by decomposing it into elementary operations (addition, multiplication, exp, sin, etc.) and applying the chain rule systematically.

Unlike symbolic differentiation (which manipulates mathematical expressions and can produce exponentially large formulas), AD operates on the numeric computation itself and always produces exact derivatives (up to floating-point precision).

Unlike numerical differentiation (finite differences), AD does not suffer from the tradeoff between truncation error (large step) and round-off error (small step).

AD is not a single algorithm but a family of strategies. The two primary modes — forward and reverse — differ in which direction they traverse the chain rule, and this difference has profound implications for computational cost.

Definition:
Jacobian, JVP, and VJP

For a function $f : \mathbb{R}^n \to \mathbb{R}^m$ , the Jacobian at point $\mathbf{x}$ is the $m \times n$ matrix

$[\mathbf{J}_f(\mathbf{x})]_{ij} = \frac{\partial f_i}{\partial x_j}.$

Two fundamental operations avoid forming the full Jacobian:

Jacobian-vector product (JVP): $\mathbf{J}_f(\mathbf{x})\,\mathbf{v}$ for a given tangent vector $\mathbf{v} \in \mathbb{R}^n$ . Cost: one forward pass through the computation.
Vector-Jacobian product (VJP): $\mathbf{J}_f(\mathbf{x})^T\,\mathbf{u}$ for a given cotangent vector $\mathbf{u} \in \mathbb{R}^m$ . Cost: one reverse pass (backpropagation).

The key asymmetry: a JVP maps one input perturbation to all outputs. A VJP maps one output perturbation to all inputs.

,

Theorem: Forward-Mode AD and the JVP

Let $f = f_L \circ f_{L-1} \circ \cdots \circ f_1$ be a composition of $L$ differentiable functions. Forward-mode AD computes the JVP $\mathbf{J}_f\,\mathbf{v}$ by propagating the tangent vector forward through the chain:

$\dot{\mathbf{x}}_0 = \mathbf{v}, \qquad \dot{\mathbf{x}}_\ell = \mathbf{J}_{f_\ell}(\mathbf{x}_{\ell-1}) \,\dot{\mathbf{x}}_{\ell-1}, \quad \ell = 1, \ldots, L.$

The output $\dot{\mathbf{x}}_L = \mathbf{J}_f\,\mathbf{v}$ .

Cost: one JVP evaluation costs $O(1)$ times the cost of evaluating $f$ itself (typically $2$ -- $3\times$ ).

Forward-mode AD tracks how a small perturbation at the input propagates through each layer of the computation. At each step, it multiplies by the local Jacobian — which is never formed explicitly.

Proof

Chain rule decomposition

By the chain rule,

$\mathbf{J}_f = \mathbf{J}_{f_L} \cdot \mathbf{J}_{f_{L-1}} \cdots \mathbf{J}_{f_1}.$

The JVP $\mathbf{J}_f \mathbf{v}$ is computed by associating the matrix-vector products from right to left: $\mathbf{J}_{f_1}\mathbf{v}$ , then $\mathbf{J}_{f_2}(\mathbf{J}_{f_1}\mathbf{v})$ , etc.

Cost analysis

Each $\mathbf{J}_{f_\ell}\,\dot{\mathbf{x}}_{\ell-1}$ is computed alongside the forward evaluation of $f_\ell$ (dual number arithmetic or tangent propagation). The overhead is a constant factor per elementary operation. $\blacksquare$

Theorem: Reverse-Mode AD and the VJP (Backpropagation)

Let $f = f_L \circ \cdots \circ f_1$ as above. Reverse-mode AD computes the VJP $\mathbf{J}_f^T\,\mathbf{u}$ by propagating the cotangent vector backward through the chain:

$\bar{\mathbf{x}}_L = \mathbf{u}, \qquad \bar{\mathbf{x}}_{\ell-1} = \mathbf{J}_{f_\ell}(\mathbf{x}_{\ell-1})^T \,\bar{\mathbf{x}}_\ell, \quad \ell = L, \ldots, 1.$

The output $\bar{\mathbf{x}}_0 = \mathbf{J}_f^T\,\mathbf{u}$ .

Cost: one VJP evaluation costs $O(1)$ times the cost of evaluating $f$ , but requires storing all intermediate values $\{\mathbf{x}_\ell\}$ from the forward pass (or recomputing them).

Reverse-mode AD tracks how a small perturbation at the output propagates backward to each input. This is exactly backpropagation in neural networks. A single reverse pass gives the gradient with respect to all $n$ inputs simultaneously.

Proof

Transposed chain rule

$\mathbf{J}_f^T = \mathbf{J}_{f_1}^T \cdot \mathbf{J}_{f_2}^T \cdots \mathbf{J}_{f_L}^T$ .

The VJP $\mathbf{J}_f^T \mathbf{u}$ is computed by associating from right to left in the transposed order: $\mathbf{J}_{f_L}^T \mathbf{u}$ , then $\mathbf{J}_{f_{L-1}}^T (\mathbf{J}_{f_L}^T \mathbf{u})$ , etc.

Memory requirement

Computing $\mathbf{J}_{f_\ell}^T \bar{\mathbf{x}}_\ell$ requires the intermediate value $\mathbf{x}_{\ell-1}$ (to evaluate the local Jacobian). These must either be stored during the forward pass ( $O(L)$ memory) or recomputed (gradient checkpointing). $\blacksquare$

When to Use Forward vs Reverse Mode

The choice depends on the input-output dimensions of $f$ :

Forward mode computes one column of $\mathbf{J}_f$ per pass. Efficient when $n \ll m$ (few inputs, many outputs). Cost to compute the full Jacobian: $O(n)$ forward passes.
Reverse mode computes one row of $\mathbf{J}_f$ per pass. Efficient when $m \ll n$ (many inputs, few outputs). Cost to compute the full gradient (when $m = 1$ ): one reverse pass.

For imaging: the loss function $\mathcal{L}(\theta) = \|\mathbf{A}(\theta)\mathbf{c} - \mathbf{y}\|^2$ maps $\theta \in \mathbb{R}^n$ (many parameters) to a scalar loss. Reverse mode (backpropagation) is overwhelmingly more efficient: one reverse pass gives the full gradient $\nabla_\theta \mathcal{L}$ , whereas forward mode would require $n$ passes.

However, for computing the Jacobian of a denoiser $\mathcal{D} : \mathbb{R}^Q \to \mathbb{R}^Q$ , both modes require $Q$ passes (the Jacobian is square). In this case, JVPs may be preferred because they avoid storing the full computation graph.

Example: Gradient of an Unrolled ISTA Loss

Consider $T$ iterations of ISTA applied to $\min_{\mathbf{c}} \frac{1}{2}\|\mathbf{A}\mathbf{c} - \mathbf{y}\|^2 + \lambda\|\mathbf{c}\|_1$ with learnable step size $\mu$ and regularization weight $\lambda$ . Derive the gradient of the reconstruction loss $\mathcal{L} = \|\mathbf{c}^{(T)} - \mathbf{c}_{\text{true}}\|^2$ with respect to $\mu$ using reverse-mode AD.

Solution

Unrolled computation graph

Each ISTA iteration is

$\mathbf{c}^{(t+1)} = \mathcal{S}_{\mu\lambda}\!\left( \mathbf{c}^{(t)} - \mu\,\mathbf{A}^{H} (\mathbf{A}\mathbf{c}^{(t)} - \mathbf{y})\right),$

where $\mathcal{S}_\tau$ is the soft-thresholding operator. This defines a composition $\mathbf{c}^{(T)} = g_T \circ g_{T-1} \circ \cdots \circ g_1(\mathbf{c}^{(0)})$ where each $g_t$ depends on $\mu$ and $\lambda$ .

Reverse-mode backpropagation

The gradient $\partial\mathcal{L}/\partial\mu$ accumulates contributions from each iteration via the chain rule:

$\frac{\partial\mathcal{L}}{\partial\mu} = \sum_{t=0}^{T-1} \bar{\mathbf{c}}^{(t+1),T} \cdot \frac{\partial g_{t+1}}{\partial\mu},$

where $\bar{\mathbf{c}}^{(t+1),T} = \frac{\partial\mathcal{L}}{\partial\mathbf{c}^{(t+1)}}$ is the adjoint state computed by backpropagation.

Local derivative

The derivative of $g_t$ with respect to $\mu$ involves the sub-gradient of the soft-thresholding operator and the gradient step:

$\frac{\partial g_t}{\partial\mu} = \text{diag}(\mathbf{1}_{|\cdot| > \mu\lambda}) \cdot \left(-\mathbf{A}^{H}(\mathbf{A}\mathbf{c}^{(t-1)} - \mathbf{y})\right).$

In PyTorch, all of this is handled automatically by the autograd engine.

Definition:
Gradient Checkpointing

For an unrolled algorithm with $T$ iterations, reverse-mode AD requires storing all $T$ intermediate states $\{\mathbf{c}^{(0)}, \ldots, \mathbf{c}^{(T-1)}\}$ during the forward pass. For a $Q$ -voxel image, this costs $O(TQ)$ memory.

Gradient checkpointing reduces memory to $O(\sqrt{T}\,Q)$ by storing only $\sqrt{T}$ evenly spaced checkpoints and recomputing the intermediate states during backpropagation. The tradeoff: memory decreases by $\sqrt{T}$ , but computation increases by at most a factor of 2.

In PyTorch, this is implemented via torch.utils.checkpoint.checkpoint().

For a typical unrolled ADMM with $T = 20$ iterations on a $128^2$ image: standard backprop requires $20 \times 128^2 \times 16 \approx 5$ MB. Checkpointing reduces this to $\sqrt{20} \times 128^2 \times 16 \approx 1.2$ MB. The savings become critical for 3D problems ( $Q = 128^3$ ) or deeper unrollings ( $T = 100$ +).

Quick Check

To compute the gradient of a scalar loss $\mathcal{L}(\theta)$ where $\theta \in \mathbb{R}^{10{,}000}$ , how many passes does each AD mode require?

Forward: 10,000 passes; Reverse: 1 pass

Forward: 1 pass; Reverse: 10,000 passes

Both: 1 pass

Both: 10,000 passes

Correction:

Forward: 10,000 passes; Reverse: 1 pass

Forward mode computes one directional derivative per pass ( $n = 10{,}000$ passes for the full gradient). Reverse mode computes the full gradient in a single backward pass.

Common Mistake: AD Through Non-Smooth Operations

Mistake:

Blindly applying AD through non-differentiable operations such as soft-thresholding $\mathcal{S}_\tau(\cdot)$ , hard-thresholding, or ReLU activations, expecting the gradient to always be well-defined.

Correction:

Soft-thresholding $\mathcal{S}_\tau(x) = \text{sign}(x)(|x| - \tau)_+$ is differentiable everywhere except at $|x| = \tau$ . At this point, the subdifferential is the interval $[0, 1]$ .

In practice, AD frameworks (PyTorch, JAX) define the gradient at non-differentiable points using a convention (e.g., the right derivative, or zero). This works well for optimization but can cause issues for:

Convergence analysis that requires Lipschitz gradients.
Jacobian computation of denoisers near kinks.
Second-order methods that assume twice-differentiability.

When rigorous differentiability is needed, consider smooth approximations: the Moreau envelope replaces $\|\cdot\|_1$ with a differentiable function, and the softplus function $\log(1 + e^x)$ smoothly approximates ReLU.

Computing the Denoiser Jacobian for PnP Convergence

The convergence of Plug-and-Play (PnP) algorithms depends on the spectral radius of the denoiser's Jacobian $\mathbf{J}_{\mathcal{D}}(\mathbf{x})$ . Specifically, for PnP-ISTA to converge, a sufficient condition is that the denoiser is firmly nonexpansive, meaning $\|\mathbf{J}_{\mathcal{D}}\| \leq 1$ .

For a neural denoiser with $Q$ input/output dimensions, the full Jacobian is a $Q \times Q$ matrix. Computing it via AD requires $Q$ JVPs (forward mode) or $Q$ VJPs (reverse mode). For $Q = 16{,}384$ , this is expensive but feasible as a diagnostic (not during training).

Power iteration on the Jacobian provides a cheaper alternative: it approximates the spectral norm $\|\mathbf{J}_{\mathcal{D}}\|$ using only $\sim 20$ JVPs, each costing one forward pass through the denoiser.

🎓CommIT Contribution(2025)

Learned Algorithm Unrolling for RF Imaging

CommIT Group, G. Caire — CommIT NPJ

The CommIT group's work on multi-sensor RF imaging fusion uses an unrolled OAMP algorithm where the denoiser at each iteration is a learned neural network. Training this architecture end-to-end requires backpropagation through both the OAMP update steps and the denoiser — a direct application of the reverse-mode AD framework developed in this section.

The key insight from their work: by constraining the learned denoiser to be nonexpansive (via spectral normalization of the network weights), convergence of the overall algorithm is guaranteed regardless of the training data distribution.

commit-contributionlearned-imagingOAMP

Why This Matters: AD in Differentiable Channel Estimation

Automatic differentiation is not unique to imaging — it plays an increasingly central role in communications system design. Differentiable channel estimation treats the entire signal processing chain (pilot design, channel estimation, equalization, detection) as a differentiable computation graph. End-to-end training optimizes all components jointly, including the pilot sequences themselves.

The AD framework of this section applies directly: the forward model through the channel is analogous to the imaging forward operator, and the loss function (e.g., symbol error rate or mutual information estimate) plays the role of the reconstruction loss.

Automatic differentiation (AD)

A family of techniques for computing exact derivatives of functions implemented as computer programs, by systematic application of the chain rule to elementary operations.

Jacobian-vector product (JVP)

The product $\mathbf{J}_f \mathbf{v}$ computed by forward-mode AD. Gives the directional derivative of $f$ along $\mathbf{v}$ .

Vector-Jacobian product (VJP)

The product $\mathbf{J}_f^T \mathbf{u}$ computed by reverse-mode AD (backpropagation). When the output is scalar ( $m = 1$ ), the VJP with $u = 1$ gives the full gradient.

Key Takeaway

Automatic differentiation provides exact gradients through arbitrary compositions of differentiable operations. Reverse mode (backpropagation) is the workhorse for imaging: it computes the full gradient of a scalar loss in a single backward pass. Forward mode is efficient for JVPs (e.g., power iteration on the denoiser Jacobian). Gradient checkpointing trades computation for memory in deep unrollings. These tools are essential for training unrolled algorithms, differentiable rendering, and analyzing PnP convergence.

Automatic Differentiation for Inverse Problems