GPU Acceleration for Imaging

Why GPUs for Imaging?

Even with the Kronecker vec trick, realistic 3D imaging problems require billions of floating-point operations per reconstruction. An ADMM solver running 200 iterations on a 2563256^3-voxel scene must evaluate the forward operator and its adjoint 400 times. On a CPU, this takes hours; on a GPU, minutes.

The key observation is that the matrix-vector products in imaging are embarrassingly parallel: each output element depends on a weighted sum over many input elements, and these sums are independent. This maps perfectly to the GPU's massively parallel architecture, where thousands of threads execute simultaneously.

Definition:

Matrix-Free Linear Operators

A matrix-free operator represents the linear map c↦Ac\mathbf{c} \mapsto \mathbf{A}\mathbf{c} as a function rather than a stored matrix. The forward operation y=Ac\mathbf{y} = \mathbf{A}\mathbf{c} is implemented as a callable that computes the matrix-vector product on-the-fly, and the adjoint AHy\mathbf{A}^{H}\mathbf{y} as a second callable.

Interface:

forward(c)β†’y,adjoint(y)β†’c^.\texttt{forward}(\mathbf{c}) \to \mathbf{y}, \qquad \texttt{adjoint}(\mathbf{y}) \to \hat{\mathbf{c}}.

The operator never stores the MΓ—QM \times Q matrix explicitly. It stores only the data needed to compute the product: Kronecker factors, steering vectors, frequency phases, and path-loss coefficients.

For the RF imaging system with M=2048M = 2048 and Q=16,384Q = 16{,}384, the full sensing matrix occupies MΓ—QΓ—16M \times Q \times 16 bytes β‰ˆ536\approx 536 MB in complex128. In complex64 (sufficient for imaging), it is β‰ˆ268\approx 268 MB per Tx-Rx pair. With 6 pairs, that is 1.6 GB β€” manageable but wasteful. For 3D problems (Q=1283Q = 128^3), the matrix would require terabytes.

Example: Matrix-Free Kronecker Operator in NumPy

Implement the forward and adjoint operations for A=A1βŠ—A2\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2} as matrix-free callables, using only the stored factors A1∈CM1Γ—N1\mathbf{A}_{1} \in \mathbb{C}^{M_1 \times N_1} and A2∈CM2Γ—N2\mathbf{A}_{2} \in \mathbb{C}^{M_2 \times N_2}.

Definition:

GPU Computing Frameworks for Imaging

Two frameworks dominate GPU-accelerated scientific computing in Python:

CuPy β€” A drop-in replacement for NumPy that executes array operations on NVIDIA GPUs. The API is nearly identical to NumPy: cupy.array, cupy.fft.fftn, cupy.linalg.norm. CuPy is ideal when the code is already written in NumPy and the goal is to accelerate it with minimal changes.

PyTorch β€” A deep-learning framework with first-class support for automatic differentiation (Section 4.3) and GPU tensors. PyTorch's torch.Tensor objects track computation graphs for backpropagation. PyTorch is the right choice when the reconstruction algorithm involves learnable parameters (e.g., unrolled networks, learned regularizers).

Both frameworks support complex-valued arithmetic (critical for RF imaging), batched operations (for multi-frequency/multi-pair processing), and mixed-precision computation (float16/float32 for speed, float64 for accuracy-critical steps).

Theorem: Parallelism in Kronecker-Factored Operations

The factored computation vec(A2CA1T)\text{vec}(\mathbf{A}_{2} \mathbf{C} \mathbf{A}_{1}^{T}) consists of two matrix multiplications:

  1. T=A2C\mathbf{T} = \mathbf{A}_{2} \mathbf{C}: an M2Γ—N2M_2 \times N_2 matrix times an N2Γ—N1N_2 \times N_1 matrix.
  2. R=TA1T\mathbf{R} = \mathbf{T} \mathbf{A}_{1}^{T}: an M2Γ—N1M_2 \times N_1 matrix times an N1Γ—M1N_1 \times M_1 matrix.

Each matrix multiplication has O(MNK)O(M N K) independent multiply-accumulate operations that can execute in parallel. On a GPU with PP processors, the wall-clock time scales as O(MNK/P)O(MNK / P) for sufficiently large matrices.

GPUs achieve their speed advantage not through faster individual operations, but through massive parallelism. A modern GPU has thousands of processing cores. Matrix multiplication is the ideal workload because it has a high arithmetic intensity (many FLOPs per byte of memory traffic), which keeps the cores busy.

Batched Multi-Frequency Forward Operator on GPU

Complexity: O(Fβ‹…N1(M2N2+M1M2))O(F \cdot N_1(M_2 N_2 + M_1 M_2)) work, executed in parallel across FF frequencies
Input: Kronecker factors {A1,f,A2,f}f=1F\{\mathbf{A}_{1,f}, \mathbf{A}_{2,f}\}_{f=1}^{F}
on GPU; image c\mathbf{c} on GPU
Output: Stacked measurements
y=[y1T,…,yFT]T\mathbf{y} = [\mathbf{y}_{1}^{T}, \ldots, \mathbf{y}_{F}^{T}]^T
1. Reshape c\mathbf{c} into C∈CN2Γ—N1\mathbf{C} \in \mathbb{C}^{N_2 \times N_1} (view, no copy)
2. Stack {A2,f}\{\mathbf{A}_{2,f}\} into batch tensor
A2∈CFΓ—M2Γ—N2\mathcal{A}_2 \in \mathbb{C}^{F \times M_2 \times N_2}
3. T←torch.bmm(A2,C.expand(F,N2,N1))\mathcal{T} \leftarrow \texttt{torch.bmm}(\mathcal{A}_2, \mathbf{C}.\texttt{expand}(F, N_2, N_1))
\qquad (batched matmul, all frequencies in parallel)
4. Stack {A1,fT}\{\mathbf{A}_{1,f}^{T}\} into
B1∈CFΓ—N1Γ—M1\mathcal{B}_1 \in \mathbb{C}^{F \times N_1 \times M_1}
5. R←torch.bmm(T,B1)\mathcal{R} \leftarrow \texttt{torch.bmm}(\mathcal{T}, \mathcal{B}_1)
6. y←R.reshape(Fβ‹…M2β‹…M1)\mathbf{y} \leftarrow \mathcal{R}.\texttt{reshape}(F \cdot M_2 \cdot M_1)

The torch.bmm operation executes all FF matrix multiplications in a single GPU kernel launch, amortizing the kernel launch overhead across frequencies. For F=2F = 2 this overhead is negligible, but for wideband systems with F=64F = 64--512512 subcarriers, batching provides a 55--10Γ—10\times speedup over sequential per-frequency calls.

⚠️Engineering Note

GPU Memory Management for Large-Scale 3D Imaging

A 3D RF imaging problem with Q=2563β‰ˆ1.7Γ—107Q = 256^3 \approx 1.7 \times 10^7 voxels and 6 Tx-Rx pairs generates sensing data that far exceeds GPU memory (typically 16--80 GB).

Strategies:

  1. Never store the full operator. Use matrix-free evaluation exclusively. The Kronecker factors for all pairs fit in ∼100\sim 100 MB.

  2. Process pairs sequentially. Each pair's forward/adjoint evaluation is independent. Keep one pair's data on GPU at a time; overlap host-device transfers with computation using CUDA streams.

  3. Mixed precision. Use complex64 (float32 real and imaginary parts) for all operator evaluations. The imaging noise floor is typically βˆ’30-30 to βˆ’60-60 dB, far above the float32 precision of ∼10βˆ’7\sim 10^{-7}.

  4. Gradient checkpointing. For unrolled algorithms (Section 4.3), do not store all intermediate iterates. Recompute forward passes during backpropagation to trade computation for memory.

Example: GPU vs CPU Timing for RF Imaging Forward Operator

Compare the wall-clock time for evaluating the forward operator Ac\mathbf{A}\mathbf{c} for a single Tx-Rx pair with M=2048M = 2048, Q=16,384Q = 16{,}384, F=2F = 2 subcarriers on: (a) CPU with dense matrix-vector product, (b) CPU with Kronecker vec trick, (c) GPU with Kronecker vec trick using CuPy.

CuPy vs PyTorch for RF Imaging

FeatureCuPyPyTorch
API compatibilityNearly identical to NumPyDistinct tensor API
Automatic differentiationNo native supportBuilt-in autograd
Complex number supportFull (mirrors NumPy)Full (since v1.7)
FFT supportcupy.fft (mirrors scipy.fft)torch.fft
Batched matrix multiplycupy.matmul with batch dimstorch.bmm / torch.matmul
Migration effortMinimal (change import)Moderate (API differences)
Best use caseAccelerating existing NumPy codeLearned reconstruction / AD

Common Mistake: Host-Device Transfer Overhead

Mistake:

Repeatedly transferring data between CPU and GPU within an iterative loop β€” for example, copying the iterate to CPU for a proximal step, then back to GPU for the gradient step.

Correction:

Host-device transfers over PCIe have bandwidth ∼12\sim 12 GB/s with ∼10β€…β€ŠΞΌ\sim 10\;\mus latency per transfer. In an iterative algorithm running hundreds of iterations, these transfers accumulate and can dominate the total runtime.

Solution: Keep all data on the GPU for the entire reconstruction. Implement proximal operators (soft thresholding, projection onto convex sets) directly on the GPU. Only transfer the final result back to CPU. If a specific operation lacks a GPU implementation, consider rewriting it rather than round-tripping through CPU.

Quick Check

A complex64 matrix of size MΓ—QM \times Q with M=2048M = 2048 and Q=16,384Q = 16{,}384 occupies approximately how much GPU memory?

268 MB

536 MB

134 MB

67 MB

Historical Note: From Graphics to General-Purpose Computing

2007--present

GPU computing for scientific applications began in earnest around 2007 with NVIDIA's release of CUDA. The idea of using graphics hardware for non-graphics computation had been explored since the early 2000s β€” researchers would encode scientific data as textures and use shader programs to perform computations. CUDA replaced this awkward paradigm with a C-like programming model.

The impact on computational imaging was immediate. CT reconstruction that previously required overnight batch processing could now run in real time during patient scanning. MRI reconstruction with compressed sensing, which had been impractical due to the iterative algorithms involved, became clinically feasible. RF imaging, with its similar computational structure, inherits these same benefits.

Matrix-free operator

A linear operator represented as a pair of functions (forward and adjoint) rather than an explicitly stored matrix. Essential when the matrix is too large to store in memory.

Related: {{Ref:Gloss Kronecker Product}}

Batched operation

A computation that applies the same operation to multiple independent data items simultaneously on a GPU, amortizing kernel launch overhead and maximizing hardware utilization.

Key Takeaway

Matrix-free operators β€” implementing A\mathbf{A} and AH\mathbf{A}^{H} as functions rather than stored matrices β€” are essential for large-scale imaging. GPUs accelerate these operators through massive parallelism, and batched operations across frequencies or Tx-Rx pairs amortize overhead. CuPy provides the fastest migration path from NumPy code; PyTorch is preferred when automatic differentiation is needed. The critical rule: keep all data on the GPU throughout the reconstruction loop and avoid host-device transfers.