Chapter Summary

Key Points

1.
GPUs are throughput machines, not latency machines. A GPU hides memory latency by keeping thousands of threads in flight. It contains thousands of simple CUDA cores organized into Streaming Multiprocessors (SMs), each with shared memory, registers, and warp schedulers. Understanding this hierarchy — cores, warps (32 threads), blocks, grid — is essential for predicting GPU performance.
2.
Most scientific workloads are memory-bound. The roofline model shows that element-wise operations, reductions, and matrix-vector products have low arithmetic intensity and are limited by memory bandwidth, not compute. Only operations like matrix-matrix multiplication (GEMM) with $I = O(n)$ can saturate GPU compute. Optimize memory access patterns (coalescing, shared memory reuse) before increasing parallelism.
3.
Data transfers are the silent killer. PCIe bandwidth (16-64 GB/s) is 30-125x slower than GPU memory bandwidth. Upload data once, perform all operations on the GPU, and download results once. A pipeline with 5 operations and 5 round trips can be 5x slower than one with a single round trip, even with the same GPU kernels.
4.
Correct GPU benchmarking requires synchronization and warmup. GPU operations are asynchronous: Python returns before the GPU finishes. Always call torch.cuda.synchronize() before and after timed regions. Always run 2-3 warmup iterations to exclude CUDA context initialization. Report the median of multiple runs, not the first measurement.
5.
Use conda for GPU environment setup. Install PyTorch with conda install pytorch pytorch-cuda=XX.X -c pytorch -c nvidia. Always verify with torch.cuda.is_available(). The #1 setup failure is version mismatch between driver (check with nvidia-smi), CUDA toolkit, and Python library builds. The driver is backward compatible — newer drivers support older CUDA runtimes.
6.
Profile before optimizing. Use torch.profiler for operation-level analysis, nvidia-smi/nvitop for system monitoring, and CUDA Events for precise kernel timing. GPU performance intuition is unreliable — profiling reveals whether you are limited by compute, memory bandwidth, data transfer, kernel launch overhead, or low occupancy.

Looking Ahead

Chapter 11 introduces CuPy, which mirrors NumPy's API on the GPU. You will apply the architecture knowledge from this chapter to understand why some CuPy operations achieve 100x speedups while others barely break even. Chapter 12 covers PyTorch tensors for GPU-accelerated computation, building on the profiling techniques from Section 10.4.

Profiling GPU Code Exercises