References & Further Reading
References
- R. Okuta, Y. Unno, D. Nishino, S. Hido, and C. Nagata, CuPy: A NumPy-Compatible Library for NVIDIA GPU Calculations, NIPS Workshop on Machine Learning Systems, 2017
The original CuPy paper from Preferred Networks. Describes the design philosophy of NumPy API compatibility and the memory pool architecture.
- NVIDIA Corporation, CUDA C++ Programming Guide, 2024. [Link]
The definitive reference for CUDA programming concepts: thread hierarchy, memory model, synchronization, and performance optimization. Essential for writing RawKernels.
- CuPy Community, CuPy Documentation, 2024. [Link]
Official CuPy documentation covering all modules: ndarray, linalg, fft, sparse, and custom kernels. The user guide includes migration tips from NumPy.
- NVIDIA Corporation, cuBLAS Library Documentation, 2024. [Link]
Documentation for NVIDIA's GPU-accelerated BLAS library. CuPy's `linalg` module dispatches to cuBLAS for matrix operations and cuSOLVER for decompositions.
- NVIDIA Corporation, cuFFT Library Documentation, 2024. [Link]
Documentation for NVIDIA's GPU FFT library. Covers plan caching, real-to-complex transforms, and batched FFTs — all accessible through CuPy's fft module.
- S. Williams, A. Waterman, and D. Patterson, Roofline: An Insightful Visual Performance Model for Multicore Architectures, Communications of the ACM, 2009
Introduces the roofline model for understanding whether a kernel is compute-bound or memory-bound. Essential for predicting GPU performance of FFTs, matvecs, and custom kernels.
- C. F. Van Loan, The Ubiquitous Kronecker Product, Journal of Computational and Applied Mathematics, 2000
Survey of Kronecker product properties and the vec trick. Covers applications in signal processing, statistics, and multidimensional problems.
Further Reading
CuPy vs other GPU array libraries
CuPy comparison guide (https://docs.cupy.dev/en/stable/overview.html)
Compares CuPy with JAX, PyTorch, and Numba for GPU computing. CuPy excels for NumPy-compatible scientific computing; JAX adds JIT and autodiff; PyTorch targets deep learning.
Advanced CUDA optimization
NVIDIA CUDA Best Practices Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/)
Deep dive into memory coalescing, occupancy optimization, and warp-level primitives. Relevant when writing performance-critical RawKernels.
GPU-accelerated sparse solvers
cuSPARSE documentation (https://docs.nvidia.com/cuda/cusparse/)
Covers advanced sparse operations beyond CuPy's wrapper: incomplete factorization preconditioners, sparse triangular solve, and batched sparse operations.
Multi-GPU and distributed computing
CuPy multi-GPU guide and NCCL integration
For problems that exceed a single GPU's memory, CuPy supports multi-GPU arrays and inter-GPU communication via NCCL (NVIDIA Collective Communications Library).