Ferkans — Interactive Telecom Tutor

Why GPUs for Imaging?

Even with the Kronecker vec trick, realistic 3D imaging problems require billions of floating-point operations per reconstruction. An ADMM solver running 200 iterations on a $256^3$ -voxel scene must evaluate the forward operator and its adjoint 400 times. On a CPU, this takes hours; on a GPU, minutes.

The key observation is that the matrix-vector products in imaging are embarrassingly parallel: each output element depends on a weighted sum over many input elements, and these sums are independent. This maps perfectly to the GPU's massively parallel architecture, where thousands of threads execute simultaneously.

Definition:
Matrix-Free Linear Operators

A matrix-free operator represents the linear map $\mathbf{c} \mapsto \mathbf{A}\mathbf{c}$ as a function rather than a stored matrix. The forward operation $\mathbf{y} = \mathbf{A}\mathbf{c}$ is implemented as a callable that computes the matrix-vector product on-the-fly, and the adjoint $\mathbf{A}^{H}\mathbf{y}$ as a second callable.

Interface:

$\texttt{forward}(\mathbf{c}) \to \mathbf{y}, \qquad \texttt{adjoint}(\mathbf{y}) \to \hat{\mathbf{c}}.$

The operator never stores the $M \times Q$ matrix explicitly. It stores only the data needed to compute the product: Kronecker factors, steering vectors, frequency phases, and path-loss coefficients.

For the RF imaging system with $M = 2048$ and $Q = 16{,}384$ , the full sensing matrix occupies $M \times Q \times 16$ bytes $\approx 536$ MB in complex128. In complex64 (sufficient for imaging), it is $\approx 268$ MB per Tx-Rx pair. With 6 pairs, that is 1.6 GB — manageable but wasteful. For 3D problems ( $Q = 128^3$ ), the matrix would require terabytes.

Example: Matrix-Free Kronecker Operator in NumPy

Implement the forward and adjoint operations for $\mathbf{A} = \mathbf{A}_{1} \otimes \mathbf{A}_{2}$ as matrix-free callables, using only the stored factors $\mathbf{A}_{1} \in \mathbb{C}^{M_1 \times N_1}$ and $\mathbf{A}_{2} \in \mathbb{C}^{M_2 \times N_2}$ .

Solution

Forward operation

def forward(c, A1, A2):
    N2, N1 = A2.shape[1], A1.shape[1]
    C = c.reshape(N2, N1)
    return (A2 @ C @ A1.T).ravel()

This computes $\text{vec}(\mathbf{A}_{2} \mathbf{C} \mathbf{A}_{1}^{T})$ in two matrix multiplications.

Adjoint operation

def adjoint(y, A1, A2):
    M2, M1 = A2.shape[0], A1.shape[0]
    Y = y.reshape(M2, M1)
    return (A2.conj().T @ Y @ A1.conj()).ravel()

This computes $\text{vec}(\mathbf{A}_{2}^{H} \mathbf{Y} \overline{\mathbf{A}_{1}})$ .

Verification

The adjoint relationship $\langle \mathbf{A}\mathbf{c}, \mathbf{y} \rangle = \langle \mathbf{c}, \mathbf{A}^{H}\mathbf{y} \rangle$ should be verified numerically:

c = np.random.randn(N1*N2) + 1j*np.random.randn(N1*N2)
y = np.random.randn(M1*M2) + 1j*np.random.randn(M1*M2)
lhs = np.vdot(forward(c, A1, A2), y)
rhs = np.vdot(c, adjoint(y, A1, A2))
assert np.abs(lhs - rhs) < 1e-10

Definition:
GPU Computing Frameworks for Imaging

Two frameworks dominate GPU-accelerated scientific computing in Python:

CuPy — A drop-in replacement for NumPy that executes array operations on NVIDIA GPUs. The API is nearly identical to NumPy: cupy.array, cupy.fft.fftn, cupy.linalg.norm. CuPy is ideal when the code is already written in NumPy and the goal is to accelerate it with minimal changes.

PyTorch — A deep-learning framework with first-class support for automatic differentiation (Section 4.3) and GPU tensors. PyTorch's torch.Tensor objects track computation graphs for backpropagation. PyTorch is the right choice when the reconstruction algorithm involves learnable parameters (e.g., unrolled networks, learned regularizers).

Both frameworks support complex-valued arithmetic (critical for RF imaging), batched operations (for multi-frequency/multi-pair processing), and mixed-precision computation (float16/float32 for speed, float64 for accuracy-critical steps).

Theorem: Parallelism in Kronecker-Factored Operations

The factored computation $\text{vec}(\mathbf{A}_{2} \mathbf{C} \mathbf{A}_{1}^{T})$ consists of two matrix multiplications:

$\mathbf{T} = \mathbf{A}_{2} \mathbf{C}$ : an $M_2 \times N_2$ matrix times an $N_2 \times N_1$ matrix.
$\mathbf{R} = \mathbf{T} \mathbf{A}_{1}^{T}$ : an $M_2 \times N_1$ matrix times an $N_1 \times M_1$ matrix.

Each matrix multiplication has $O(M N K)$ independent multiply-accumulate operations that can execute in parallel. On a GPU with $P$ processors, the wall-clock time scales as $O(MNK / P)$ for sufficiently large matrices.

GPUs achieve their speed advantage not through faster individual operations, but through massive parallelism. A modern GPU has thousands of processing cores. Matrix multiplication is the ideal workload because it has a high arithmetic intensity (many FLOPs per byte of memory traffic), which keeps the cores busy.

Proof

Work-depth analysis

For a matrix product $\mathbf{C} = \mathbf{A}\mathbf{B}$ with $\mathbf{A} \in \mathbb{C}^{m \times k}$ , $\mathbf{B} \in \mathbb{C}^{k \times n}$ :

Total work: $W = O(mkn)$ (sequential FLOPs).
Depth: $D = O(\log k)$ (critical path with tree reduction).
Parallelism: $W/D = O(mn k / \log k)$ .

For our problem with $m = M_2 = 32$ , $k = N_2 = 128$ , $n = N_1 = 128$ : parallelism $\approx 7.5 \times 10^4$ , well matched to GPUs with $\sim 10^3$ -- $10^4$ cores. $\blacksquare$

Batched Multi-Frequency Forward Operator on GPU

Complexity:

O(F \cdot N_1(M_2 N_2 + M_1 M_2))

work, executed in parallel across

F

frequencies

Input: Kronecker factors

\{\mathbf{A}_{1,f}, \mathbf{A}_{2,f}\}_{f=1}^{F}

on GPU; image

\mathbf{c}

on GPU

Output: Stacked measurements

\mathbf{y} = [\mathbf{y}_{1}^{T}, \ldots, \mathbf{y}_{F}^{T}]^T

1. Reshape

\mathbf{c}

into

\mathbf{C} \in \mathbb{C}^{N_2 \times N_1}

(view, no copy)

2. Stack

\{\mathbf{A}_{2,f}\}

into batch tensor

\mathcal{A}_2 \in \mathbb{C}^{F \times M_2 \times N_2}

3.

\mathcal{T} \leftarrow \texttt{torch.bmm}(\mathcal{A}_2, \mathbf{C}.\texttt{expand}(F, N_2, N_1))

\qquad

(batched matmul, all frequencies in parallel)

4. Stack

\{\mathbf{A}_{1,f}^{T}\}

into

\mathcal{B}_1 \in \mathbb{C}^{F \times N_1 \times M_1}

5.

\mathcal{R} \leftarrow \texttt{torch.bmm}(\mathcal{T}, \mathcal{B}_1)

6.

\mathbf{y} \leftarrow \mathcal{R}.\texttt{reshape}(F \cdot M_2 \cdot M_1)

The torch.bmm operation executes all $F$ matrix multiplications in a single GPU kernel launch, amortizing the kernel launch overhead across frequencies. For $F = 2$ this overhead is negligible, but for wideband systems with $F = 64$ -- $512$ subcarriers, batching provides a $5$ -- $10\times$ speedup over sequential per-frequency calls.

⚠️Engineering Note

GPU Memory Management for Large-Scale 3D Imaging

A 3D RF imaging problem with $Q = 256^3 \approx 1.7 \times 10^7$ voxels and 6 Tx-Rx pairs generates sensing data that far exceeds GPU memory (typically 16--80 GB).

Strategies:

Never store the full operator. Use matrix-free evaluation exclusively. The Kronecker factors for all pairs fit in $\sim 100$ MB.
Process pairs sequentially. Each pair's forward/adjoint evaluation is independent. Keep one pair's data on GPU at a time; overlap host-device transfers with computation using CUDA streams.
Mixed precision. Use complex64 (float32 real and imaginary parts) for all operator evaluations. The imaging noise floor is typically $-30$ to $-60$ dB, far above the float32 precision of $\sim 10^{-7}$ .
Gradient checkpointing. For unrolled algorithms (Section 4.3), do not store all intermediate iterates. Recompute forward passes during backpropagation to trade computation for memory.

Example: GPU vs CPU Timing for RF Imaging Forward Operator

Compare the wall-clock time for evaluating the forward operator $\mathbf{A}\mathbf{c}$ for a single Tx-Rx pair with $M = 2048$ , $Q = 16{,}384$ , $F = 2$ subcarriers on: (a) CPU with dense matrix-vector product, (b) CPU with Kronecker vec trick, (c) GPU with Kronecker vec trick using CuPy.

Solution

CPU dense product

The full $\bar{\mathbf{V}}_k$ has $2048 \times 16{,}384$ complex entries. A complex matrix-vector product costs $2 \times 2 \times M \times Q \approx 1.3 \times 10^8$ real FLOPs. On a modern CPU at $\sim 10$ GFLOP/s (complex), this takes $\sim 13$ ms per evaluation.

CPU Kronecker vec trick

Two matrix multiplications: $32 \times 128 \times 128$ and $32 \times 128 \times 32$ , totaling $\sim 6.6 \times 10^5$ complex FLOPs per subcarrier. With $F = 2$ : $\sim 1.3 \times 10^6$ FLOPs, giving $\sim 0.13$ ms. Speedup: $\sim 100\times$ .

GPU Kronecker vec trick

The same $1.3 \times 10^6$ FLOPs on a GPU with $\sim 10$ TFLOP/s throughput takes $\sim 0.13\;\mu$ s in theory. In practice, kernel launch overhead dominates at this small problem size, giving $\sim 50\;\mu$ s. The GPU advantage emerges for larger problems: at $Q = 256^2$ , GPU is $\sim 20\times$ faster than CPU vec trick.

CuPy vs PyTorch for RF Imaging

Feature	CuPy	PyTorch
API compatibility	Nearly identical to NumPy	Distinct tensor API
Automatic differentiation	No native support	Built-in autograd
Complex number support	Full (mirrors NumPy)	Full (since v1.7)
FFT support	cupy.fft (mirrors scipy.fft)	torch.fft
Batched matrix multiply	cupy.matmul with batch dims	torch.bmm / torch.matmul
Migration effort	Minimal (change import)	Moderate (API differences)
Best use case	Accelerating existing NumPy code	Learned reconstruction / AD

Common Mistake: Host-Device Transfer Overhead

Mistake:

Repeatedly transferring data between CPU and GPU within an iterative loop — for example, copying the iterate to CPU for a proximal step, then back to GPU for the gradient step.

Correction:

Host-device transfers over PCIe have bandwidth $\sim 12$ GB/s with $\sim 10\;\mu$ s latency per transfer. In an iterative algorithm running hundreds of iterations, these transfers accumulate and can dominate the total runtime.

Solution: Keep all data on the GPU for the entire reconstruction. Implement proximal operators (soft thresholding, projection onto convex sets) directly on the GPU. Only transfer the final result back to CPU. If a specific operation lacks a GPU implementation, consider rewriting it rather than round-tripping through CPU.

Quick Check

A complex64 matrix of size $M \times Q$ with $M = 2048$ and $Q = 16{,}384$ occupies approximately how much GPU memory?

268 MB

536 MB

134 MB

67 MB

Correction:

268 MB

$M \times Q \times 8\;\text{bytes} = 2048 \times 16{,}384 \times 8 \approx 268$ MB.

Historical Note: From Graphics to General-Purpose Computing

2007--present

GPU computing for scientific applications began in earnest around 2007 with NVIDIA's release of CUDA. The idea of using graphics hardware for non-graphics computation had been explored since the early 2000s — researchers would encode scientific data as textures and use shader programs to perform computations. CUDA replaced this awkward paradigm with a C-like programming model.

The impact on computational imaging was immediate. CT reconstruction that previously required overnight batch processing could now run in real time during patient scanning. MRI reconstruction with compressed sensing, which had been impractical due to the iterative algorithms involved, became clinically feasible. RF imaging, with its similar computational structure, inherits these same benefits.

Matrix-free operator

A linear operator represented as a pair of functions (forward and adjoint) rather than an explicitly stored matrix. Essential when the matrix is too large to store in memory.

Batched operation

A computation that applies the same operation to multiple independent data items simultaneously on a GPU, amortizing kernel launch overhead and maximizing hardware utilization.

Key Takeaway

Matrix-free operators — implementing $\mathbf{A}$ and $\mathbf{A}^{H}$ as functions rather than stored matrices — are essential for large-scale imaging. GPUs accelerate these operators through massive parallelism, and batched operations across frequencies or Tx-Rx pairs amortize overhead. CuPy provides the fastest migration path from NumPy code; PyTorch is preferred when automatic differentiation is needed. The critical rule: keep all data on the GPU throughout the reconstruction loop and avoid host-device transfers.

GPU Acceleration for Imaging