References & Further Reading
References
- D. B. Kirk and W. W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, Morgan Kaufmann, 2016
The standard textbook on GPU computing and CUDA programming. Covers architecture, memory hierarchy, parallel patterns, and optimization strategies. Directly relevant to Sections 10.1 and 10.2.
- J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming, Wrox, 2014
Practical guide to CUDA programming with emphasis on performance optimization, memory management, and profiling. Covers advanced topics like streams, events, and multi-GPU.
- NVIDIA Corporation, CUDA C++ Programming Guide, NVIDIA, 2024
The official CUDA programming guide. Essential reference for thread hierarchy, memory model, and hardware specifications. Available at https://docs.nvidia.com/cuda/cuda-c-programming-guide/.
- NVIDIA Corporation, CUDA C++ Best Practices Guide, NVIDIA, 2024
Companion to the programming guide focusing on optimization: memory coalescing, occupancy, instruction-level parallelism, and profiling methodology.
- S. Williams, A. Waterman, and D. Patterson, Roofline: An Insightful Visual Performance Model for Multicore Architectures, Communications of the ACM, 52(4), 2009
The original roofline model paper. Introduces the compute/memory-bound classification used throughout this chapter. Essential reading for performance analysis.
- PyTorch Contributors, PyTorch CUDA Semantics, 2024
Official documentation on PyTorch's CUDA integration: asynchronous execution, streams, memory management, and profiling. Available at https://pytorch.org/docs/stable/notes/cuda.html.
Further Reading
CUDA programming for Python developers
Numba CUDA documentation (https://numba.readthedocs.io/en/stable/cuda/)
Numba allows writing CUDA kernels directly in Python without C/C++ code. Useful for custom GPU operations not available in CuPy or PyTorch.
AMD GPU computing
AMD ROCm Documentation (https://rocm.docs.amd.com/)
ROCm is AMD's open-source GPU computing platform. The HIP API is largely source-compatible with CUDA, enabling cross-vendor GPU code.
GPU computing for scientific simulation
A. Klockner et al., *PyCUDA and PyOpenCL*, Parallel Computing, 2012
PyCUDA provides fine-grained GPU control from Python, including custom kernels, memory management, and automatic code generation. Useful when CuPy/PyTorch abstractions are too high-level.
Performance profiling deep dive
NVIDIA Nsight Systems documentation
Nsight Systems provides timeline-based profiling showing CPU/GPU activity, memory transfers, and kernel launches. Essential for optimizing complex multi-GPU applications.