Chapter Summary

Key Points

1.
CuPy is a drop-in GPU replacement for NumPy. Replace import numpy as np with import cupy as cp and most code runs on the GPU unchanged. Use cp.get_array_module(x) to write functions that work on both CPU and GPU. The key constraint is data transfer: move data to the GPU once, compute everything there, and transfer results back once. PCIe latency (~10 us) and bandwidth (12-32 GB/s) make round-tripping deadly for performance.
2.
GPU linear algebra shines for large matrices. CuPy dispatches to cuBLAS and cuSOLVER, which are GPU-native equivalents of BLAS/LAPACK. The crossover point varies by operation: matmul wins above n~128, solve above n~256, SVD above n~512. Batched operations (e.g., SVD of 1024 channel matrices) are the GPU's killer application, delivering 20-50x speedups by processing all matrices concurrently.
3.
GPU FFTs are memory-bandwidth bound. cuFFT delivers 10-50x speedups over FFTW for large transforms. Since FFTs have low arithmetic intensity, use float32/complex64 to nearly double throughput by halving memory traffic. Cache FFT plans by warming up before timed loops. For OFDM and radar, batch all subcarrier/range-bin FFTs into a single call.
4.
Sparse on GPU: build in COO, compute in CSR. CuPy's sparse module mirrors SciPy's API, backed by cuSPARSE. GPU sparse matvec wins over dense when matrix density is below ~1-5%. For structured sparse matrices (Laplacians, banded systems), the GPU sparse path can be 5-20x faster than SciPy on CPU.
5.
Custom CUDA kernels extend CuPy beyond NumPy. ElementwiseKernel (per-element ops), ReductionKernel (aggregations), and RawKernel (full CUDA C) let you write custom GPU operations. ElementwiseKernel handles 90% of custom needs; RawKernel is for shared memory and complex thread cooperation. Always include bounds checks in RawKernels.
6.
Never form the full Kronecker product — especially on GPU. The vec trick $(\mathbf{A} \otimes \mathbf{B})\mathrm{vec}(\mathbf{X}) = \mathrm{vec}(\mathbf{B}\mathbf{X}\mathbf{A}^T)$ reduces $O(n^4)$ to $O(n^3)$ , and cuBLAS matrix multiply delivers another 10-50x on top. This combination of algorithmic optimization and hardware acceleration is the paradigm for high-performance scientific computing.

Looking Ahead

Chapter 12 introduces PyTorch tensors, which extend GPU arrays with automatic differentiation. Where CuPy mirrors NumPy for numerical computing, PyTorch adds the gradient tracking needed for machine learning. The GPU fundamentals from this chapter — transfer overhead, memory pools, batched operations, and the roofline model — apply directly to PyTorch workflows.

The Kronecker Matvec on GPU Exercises