GPU Acceleration for Imaging
Why GPUs for Imaging?
Even with the Kronecker vec trick, realistic 3D imaging problems require billions of floating-point operations per reconstruction. An ADMM solver running 200 iterations on a -voxel scene must evaluate the forward operator and its adjoint 400 times. On a CPU, this takes hours; on a GPU, minutes.
The key observation is that the matrix-vector products in imaging are embarrassingly parallel: each output element depends on a weighted sum over many input elements, and these sums are independent. This maps perfectly to the GPU's massively parallel architecture, where thousands of threads execute simultaneously.
Definition: Matrix-Free Linear Operators
Matrix-Free Linear Operators
A matrix-free operator represents the linear map as a function rather than a stored matrix. The forward operation is implemented as a callable that computes the matrix-vector product on-the-fly, and the adjoint as a second callable.
Interface:
The operator never stores the matrix explicitly. It stores only the data needed to compute the product: Kronecker factors, steering vectors, frequency phases, and path-loss coefficients.
For the RF imaging system with and , the full sensing matrix occupies bytes MB in complex128. In complex64 (sufficient for imaging), it is MB per Tx-Rx pair. With 6 pairs, that is 1.6 GB β manageable but wasteful. For 3D problems (), the matrix would require terabytes.
Example: Matrix-Free Kronecker Operator in NumPy
Implement the forward and adjoint operations for as matrix-free callables, using only the stored factors and .
Forward operation
def forward(c, A1, A2):
N2, N1 = A2.shape[1], A1.shape[1]
C = c.reshape(N2, N1)
return (A2 @ C @ A1.T).ravel()
This computes in two matrix multiplications.
Adjoint operation
def adjoint(y, A1, A2):
M2, M1 = A2.shape[0], A1.shape[0]
Y = y.reshape(M2, M1)
return (A2.conj().T @ Y @ A1.conj()).ravel()
This computes .
Verification
The adjoint relationship should be verified numerically:
c = np.random.randn(N1*N2) + 1j*np.random.randn(N1*N2)
y = np.random.randn(M1*M2) + 1j*np.random.randn(M1*M2)
lhs = np.vdot(forward(c, A1, A2), y)
rhs = np.vdot(c, adjoint(y, A1, A2))
assert np.abs(lhs - rhs) < 1e-10
Definition: GPU Computing Frameworks for Imaging
GPU Computing Frameworks for Imaging
Two frameworks dominate GPU-accelerated scientific computing in Python:
CuPy β A drop-in replacement for NumPy that executes array
operations on NVIDIA GPUs. The API is nearly identical to NumPy:
cupy.array, cupy.fft.fftn, cupy.linalg.norm. CuPy is
ideal when the code is already written in NumPy and the goal is
to accelerate it with minimal changes.
PyTorch β A deep-learning framework with first-class support
for automatic differentiation (Section 4.3) and GPU tensors.
PyTorch's torch.Tensor objects track computation graphs for
backpropagation. PyTorch is the right choice when the
reconstruction algorithm involves learnable parameters (e.g.,
unrolled networks, learned regularizers).
Both frameworks support complex-valued arithmetic (critical for RF imaging), batched operations (for multi-frequency/multi-pair processing), and mixed-precision computation (float16/float32 for speed, float64 for accuracy-critical steps).
Theorem: Parallelism in Kronecker-Factored Operations
The factored computation consists of two matrix multiplications:
- : an matrix times an matrix.
- : an matrix times an matrix.
Each matrix multiplication has independent multiply-accumulate operations that can execute in parallel. On a GPU with processors, the wall-clock time scales as for sufficiently large matrices.
GPUs achieve their speed advantage not through faster individual operations, but through massive parallelism. A modern GPU has thousands of processing cores. Matrix multiplication is the ideal workload because it has a high arithmetic intensity (many FLOPs per byte of memory traffic), which keeps the cores busy.
Work-depth analysis
For a matrix product with , :
- Total work: (sequential FLOPs).
- Depth: (critical path with tree reduction).
- Parallelism: .
For our problem with , , : parallelism , well matched to GPUs with -- cores.
Batched Multi-Frequency Forward Operator on GPU
Complexity: work, executed in parallel across frequenciesThe torch.bmm operation executes all matrix multiplications
in a single GPU kernel launch, amortizing the kernel launch
overhead across frequencies. For this overhead is
negligible, but for wideband systems with --
subcarriers, batching provides a -- speedup over
sequential per-frequency calls.
GPU Memory Management for Large-Scale 3D Imaging
A 3D RF imaging problem with voxels and 6 Tx-Rx pairs generates sensing data that far exceeds GPU memory (typically 16--80 GB).
Strategies:
-
Never store the full operator. Use matrix-free evaluation exclusively. The Kronecker factors for all pairs fit in MB.
-
Process pairs sequentially. Each pair's forward/adjoint evaluation is independent. Keep one pair's data on GPU at a time; overlap host-device transfers with computation using CUDA streams.
-
Mixed precision. Use
complex64(float32 real and imaginary parts) for all operator evaluations. The imaging noise floor is typically to dB, far above the float32 precision of . -
Gradient checkpointing. For unrolled algorithms (Section 4.3), do not store all intermediate iterates. Recompute forward passes during backpropagation to trade computation for memory.
Example: GPU vs CPU Timing for RF Imaging Forward Operator
Compare the wall-clock time for evaluating the forward operator for a single Tx-Rx pair with , , subcarriers on: (a) CPU with dense matrix-vector product, (b) CPU with Kronecker vec trick, (c) GPU with Kronecker vec trick using CuPy.
CPU dense product
The full has complex entries. A complex matrix-vector product costs real FLOPs. On a modern CPU at GFLOP/s (complex), this takes ms per evaluation.
CPU Kronecker vec trick
Two matrix multiplications: and , totaling complex FLOPs per subcarrier. With : FLOPs, giving ms. Speedup: .
GPU Kronecker vec trick
The same FLOPs on a GPU with TFLOP/s throughput takes s in theory. In practice, kernel launch overhead dominates at this small problem size, giving s. The GPU advantage emerges for larger problems: at , GPU is faster than CPU vec trick.
CuPy vs PyTorch for RF Imaging
| Feature | CuPy | PyTorch |
|---|---|---|
| API compatibility | Nearly identical to NumPy | Distinct tensor API |
| Automatic differentiation | No native support | Built-in autograd |
| Complex number support | Full (mirrors NumPy) | Full (since v1.7) |
| FFT support | cupy.fft (mirrors scipy.fft) | torch.fft |
| Batched matrix multiply | cupy.matmul with batch dims | torch.bmm / torch.matmul |
| Migration effort | Minimal (change import) | Moderate (API differences) |
| Best use case | Accelerating existing NumPy code | Learned reconstruction / AD |
Common Mistake: Host-Device Transfer Overhead
Mistake:
Repeatedly transferring data between CPU and GPU within an iterative loop β for example, copying the iterate to CPU for a proximal step, then back to GPU for the gradient step.
Correction:
Host-device transfers over PCIe have bandwidth GB/s with s latency per transfer. In an iterative algorithm running hundreds of iterations, these transfers accumulate and can dominate the total runtime.
Solution: Keep all data on the GPU for the entire reconstruction. Implement proximal operators (soft thresholding, projection onto convex sets) directly on the GPU. Only transfer the final result back to CPU. If a specific operation lacks a GPU implementation, consider rewriting it rather than round-tripping through CPU.
Quick Check
A complex64 matrix of size with and occupies approximately how much GPU memory?
268 MB
536 MB
134 MB
67 MB
MB.
Historical Note: From Graphics to General-Purpose Computing
2007--presentGPU computing for scientific applications began in earnest around 2007 with NVIDIA's release of CUDA. The idea of using graphics hardware for non-graphics computation had been explored since the early 2000s β researchers would encode scientific data as textures and use shader programs to perform computations. CUDA replaced this awkward paradigm with a C-like programming model.
The impact on computational imaging was immediate. CT reconstruction that previously required overnight batch processing could now run in real time during patient scanning. MRI reconstruction with compressed sensing, which had been impractical due to the iterative algorithms involved, became clinically feasible. RF imaging, with its similar computational structure, inherits these same benefits.
Matrix-free operator
A linear operator represented as a pair of functions (forward and adjoint) rather than an explicitly stored matrix. Essential when the matrix is too large to store in memory.
Related: {{Ref:Gloss Kronecker Product}}
Batched operation
A computation that applies the same operation to multiple independent data items simultaneously on a GPU, amortizing kernel launch overhead and maximizing hardware utilization.
Key Takeaway
Matrix-free operators β implementing and as functions rather than stored matrices β are essential for large-scale imaging. GPUs accelerate these operators through massive parallelism, and batched operations across frequencies or Tx-Rx pairs amortize overhead. CuPy provides the fastest migration path from NumPy code; PyTorch is preferred when automatic differentiation is needed. The critical rule: keep all data on the GPU throughout the reconstruction loop and avoid host-device transfers.