Prerequisites & Notation

Before You Begin

This chapter assumes familiarity with GPU computing concepts (Chapter 10), CuPy array programming (Chapter 11), and PyTorch tensor operations (Chapter 12). You should be comfortable moving data between CPU and GPU and writing basic GPU-accelerated code.

GPU architecture: threads, blocks, warps, memory hierarchy (Chapter 10)(Review ch10)
Self-check: Can you explain why GPU global memory access should be coalesced?
CuPy arrays and GPU memory management (Chapter 11)(Review ch11)
Self-check: Can you allocate a CuPy array and transfer it back to NumPy?
PyTorch tensors, autograd, and neural network modules (Chapter 12)(Review ch12)
Self-check: Can you build a simple nn.Module and run a forward pass on GPU?
Basic profiling and timing concepts (Chapter 4)(Review ch04)
Self-check: Have you used time.perf_counter() or %%timeit to measure execution time?

Notation for This Chapter

Symbols and conventions introduced in this chapter. Memory sizes are in bytes unless stated otherwise; throughput is in FLOPS or samples/second.

Symbol	Meaning	Introduced
$M_{\text{alloc}}$	GPU memory allocated (bytes)	s01
$M_{\text{reserved}}$	GPU memory reserved by the caching allocator (bytes)	s01
$B$	Batch size (number of independent samples processed together)	s02
$\mathbf{A}_i \in \mathbb{R}^{m \times k}$	The $i$ -th matrix in a batched operation	s02
$\epsilon_{\text{mach}}$	Machine epsilon for a given floating-point format	s03
$FP32, FP16, BF16$	IEEE 754 single precision, half precision, and bfloat16 formats	s03
$W$	Number of DataLoader worker processes	s04
$N_{\text{GPU}}$	Number of GPUs in a distributed setup	s05
$T_{\text{comp}}, T_{\text{comm}}$	Computation time and communication time per iteration	s05

← Ch 12 Memory Management on GPU