Prerequisites & Notation

Before You Begin

This chapter assumes familiarity with GPU computing concepts (Chapter 10), CuPy array programming (Chapter 11), and PyTorch tensor operations (Chapter 12). You should be comfortable moving data between CPU and GPU and writing basic GPU-accelerated code.

  • GPU architecture: threads, blocks, warps, memory hierarchy (Chapter 10)(Review ch10)

    Self-check: Can you explain why GPU global memory access should be coalesced?

  • CuPy arrays and GPU memory management (Chapter 11)(Review ch11)

    Self-check: Can you allocate a CuPy array and transfer it back to NumPy?

  • PyTorch tensors, autograd, and neural network modules (Chapter 12)(Review ch12)

    Self-check: Can you build a simple nn.Module and run a forward pass on GPU?

  • Basic profiling and timing concepts (Chapter 4)(Review ch04)

    Self-check: Have you used time.perf_counter() or %%timeit to measure execution time?

Notation for This Chapter

Symbols and conventions introduced in this chapter. Memory sizes are in bytes unless stated otherwise; throughput is in FLOPS or samples/second.

SymbolMeaningIntroduced
MallocM_{\text{alloc}}GPU memory allocated (bytes)s01
MreservedM_{\text{reserved}}GPU memory reserved by the caching allocator (bytes)s01
BBBatch size (number of independent samples processed together)s02
Ai∈RmΓ—k\mathbf{A}_i \in \mathbb{R}^{m \times k}The ii-th matrix in a batched operations02
Ο΅mach\epsilon_{\text{mach}}Machine epsilon for a given floating-point formats03
FP32,FP16,BF16FP32, FP16, BF16IEEE 754 single precision, half precision, and bfloat16 formatss03
WWNumber of DataLoader worker processess04
NGPUN_{\text{GPU}}Number of GPUs in a distributed setups05
Tcomp,TcommT_{\text{comp}}, T_{\text{comm}}Computation time and communication time per iterations05