Prerequisites & Notation
Before You Begin
This chapter assumes familiarity with GPU computing concepts (Chapter 10), CuPy array programming (Chapter 11), and PyTorch tensor operations (Chapter 12). You should be comfortable moving data between CPU and GPU and writing basic GPU-accelerated code.
- GPU architecture: threads, blocks, warps, memory hierarchy (Chapter 10)(Review ch10)
Self-check: Can you explain why GPU global memory access should be coalesced?
- CuPy arrays and GPU memory management (Chapter 11)(Review ch11)
Self-check: Can you allocate a CuPy array and transfer it back to NumPy?
- PyTorch tensors, autograd, and neural network modules (Chapter 12)(Review ch12)
Self-check: Can you build a simple
nn.Moduleand run a forward pass on GPU? - Basic profiling and timing concepts (Chapter 4)(Review ch04)
Self-check: Have you used
time.perf_counter()or%%timeitto measure execution time?
Notation for This Chapter
Symbols and conventions introduced in this chapter. Memory sizes are in bytes unless stated otherwise; throughput is in FLOPS or samples/second.
| Symbol | Meaning | Introduced |
|---|---|---|
| GPU memory allocated (bytes) | s01 | |
| GPU memory reserved by the caching allocator (bytes) | s01 | |
| Batch size (number of independent samples processed together) | s02 | |
| The -th matrix in a batched operation | s02 | |
| Machine epsilon for a given floating-point format | s03 | |
| IEEE 754 single precision, half precision, and bfloat16 formats | s03 | |
| Number of DataLoader worker processes | s04 | |
| Number of GPUs in a distributed setup | s05 | |
| Computation time and communication time per iteration | s05 |