References & Further Reading

References

D. B. Kirk and W. W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, Morgan Kaufmann, 2016
The standard textbook on GPU computing and CUDA programming. Covers architecture, memory hierarchy, parallel patterns, and optimization strategies. Directly relevant to Sections 10.1 and 10.2.
J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming, Wrox, 2014
Practical guide to CUDA programming with emphasis on performance optimization, memory management, and profiling. Covers advanced topics like streams, events, and multi-GPU.
NVIDIA Corporation, CUDA C++ Programming Guide, NVIDIA, 2024
The official CUDA programming guide. Essential reference for thread hierarchy, memory model, and hardware specifications. Available at https://docs.nvidia.com/cuda/cuda-c-programming-guide/.
NVIDIA Corporation, CUDA C++ Best Practices Guide, NVIDIA, 2024
Companion to the programming guide focusing on optimization: memory coalescing, occupancy, instruction-level parallelism, and profiling methodology.
S. Williams, A. Waterman, and D. Patterson, Roofline: An Insightful Visual Performance Model for Multicore Architectures, Communications of the ACM, 52(4), 2009
The original roofline model paper. Introduces the compute/memory-bound classification used throughout this chapter. Essential reading for performance analysis.
PyTorch Contributors, PyTorch CUDA Semantics, 2024
Official documentation on PyTorch's CUDA integration: asynchronous execution, streams, memory management, and profiling. Available at https://pytorch.org/docs/stable/notes/cuda.html.