Prerequisites & Notation

Before You Begin

This chapter introduces GPU computing concepts from the ground up, but assumes comfort with NumPy arrays, basic Python profiling, and a working understanding of computer memory (RAM, cache). If any of these feel unfamiliar, review the linked material first.

NumPy array creation, slicing, and vectorized operations (Chapter 5)(Review ch05)
Self-check: Can you explain why a + b on NumPy arrays is faster than a Python for-loop?
Linear algebra operations with NumPy and SciPy (Chapter 6)(Review ch06)
Self-check: Can you compute a matrix-vector product using the @ operator?
Basic understanding of computer memory: RAM, cache, bus
Self-check: Do you know why accessing contiguous memory is faster than random access?
Python virtual environments and conda
Self-check: Can you create a conda environment and install packages into it?

Notation for This Chapter

Symbols and conventions introduced in this chapter. GPU-specific terminology uses NVIDIA's CUDA nomenclature, which is the industry standard even when discussing vendor-neutral concepts.

Symbol	Meaning	Introduced
$SM$	Streaming Multiprocessor — the basic compute unit on an NVIDIA GPU	s01
$CUDA core$	A single-precision floating-point execution unit within an SM	s01
$N_{\mathrm{threads}}$	Total number of threads in a kernel launch	s02
$B_x, B_y, B_z$	Block dimensions (threads per block in each direction)	s02
$G_x, G_y, G_z$	Grid dimensions (blocks per grid in each direction)	s02
$HBM$	High Bandwidth Memory — main GPU memory (global memory)	s01
$T_{\mathrm{xfer}}$	Host-to-device (or device-to-host) transfer time	s02
$T_{\mathrm{kernel}}$	Kernel execution time on the GPU	s04

← Ch 9 GPU Architecture for the Programmer