References & Further Reading

References

  1. D. B. Kirk and W. W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, Morgan Kaufmann, 2016

    The standard textbook on GPU computing and CUDA programming. Covers architecture, memory hierarchy, parallel patterns, and optimization strategies. Directly relevant to Sections 10.1 and 10.2.

  2. J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming, Wrox, 2014

    Practical guide to CUDA programming with emphasis on performance optimization, memory management, and profiling. Covers advanced topics like streams, events, and multi-GPU.

  3. NVIDIA Corporation, CUDA C++ Programming Guide, NVIDIA, 2024

    The official CUDA programming guide. Essential reference for thread hierarchy, memory model, and hardware specifications. Available at https://docs.nvidia.com/cuda/cuda-c-programming-guide/.

  4. NVIDIA Corporation, CUDA C++ Best Practices Guide, NVIDIA, 2024

    Companion to the programming guide focusing on optimization: memory coalescing, occupancy, instruction-level parallelism, and profiling methodology.

  5. S. Williams, A. Waterman, and D. Patterson, Roofline: An Insightful Visual Performance Model for Multicore Architectures, Communications of the ACM, 52(4), 2009

    The original roofline model paper. Introduces the compute/memory-bound classification used throughout this chapter. Essential reading for performance analysis.

  6. PyTorch Contributors, PyTorch CUDA Semantics, 2024

    Official documentation on PyTorch's CUDA integration: asynchronous execution, streams, memory management, and profiling. Available at https://pytorch.org/docs/stable/notes/cuda.html.

Further Reading

  • CUDA programming for Python developers

    Numba CUDA documentation (https://numba.readthedocs.io/en/stable/cuda/)

    Numba allows writing CUDA kernels directly in Python without C/C++ code. Useful for custom GPU operations not available in CuPy or PyTorch.

  • AMD GPU computing

    AMD ROCm Documentation (https://rocm.docs.amd.com/)

    ROCm is AMD's open-source GPU computing platform. The HIP API is largely source-compatible with CUDA, enabling cross-vendor GPU code.

  • GPU computing for scientific simulation

    A. Klockner et al., *PyCUDA and PyOpenCL*, Parallel Computing, 2012

    PyCUDA provides fine-grained GPU control from Python, including custom kernels, memory management, and automatic code generation. Useful when CuPy/PyTorch abstractions are too high-level.

  • Performance profiling deep dive

    NVIDIA Nsight Systems documentation

    Nsight Systems provides timeline-based profiling showing CPU/GPU activity, memory transfers, and kernel launches. Essential for optimizing complex multi-GPU applications.