References & Further Reading

References

R. Okuta, Y. Unno, D. Nishino, S. Hido, and C. Nagata, CuPy: A NumPy-Compatible Library for NVIDIA GPU Calculations, NIPS Workshop on Machine Learning Systems, 2017
The original CuPy paper from Preferred Networks. Describes the design philosophy of NumPy API compatibility and the memory pool architecture.
NVIDIA Corporation, CUDA C++ Programming Guide, 2024. [Link]
The definitive reference for CUDA programming concepts: thread hierarchy, memory model, synchronization, and performance optimization. Essential for writing RawKernels.
CuPy Community, CuPy Documentation, 2024. [Link]
Official CuPy documentation covering all modules: ndarray, linalg, fft, sparse, and custom kernels. The user guide includes migration tips from NumPy.
NVIDIA Corporation, cuBLAS Library Documentation, 2024. [Link]
Documentation for NVIDIA's GPU-accelerated BLAS library. CuPy's `linalg` module dispatches to cuBLAS for matrix operations and cuSOLVER for decompositions.
NVIDIA Corporation, cuFFT Library Documentation, 2024. [Link]
Documentation for NVIDIA's GPU FFT library. Covers plan caching, real-to-complex transforms, and batched FFTs — all accessible through CuPy's fft module.
S. Williams, A. Waterman, and D. Patterson, Roofline: An Insightful Visual Performance Model for Multicore Architectures, Communications of the ACM, 2009
Introduces the roofline model for understanding whether a kernel is compute-bound or memory-bound. Essential for predicting GPU performance of FFTs, matvecs, and custom kernels.
C. F. Van Loan, The Ubiquitous Kronecker Product, Journal of Computational and Applied Mathematics, 2000
Survey of Kronecker product properties and the vec trick. Covers applications in signal processing, statistics, and multidimensional problems.