Exercises

ex-sp-ch10-01

Easy

Compute the arithmetic intensity I=W/BI = W/B (FLOP/byte) for the following operations on float32 arrays of length nn:

  1. Vector copy: b = a.copy()
  2. Scalar multiply: b = 3.0 * a
  3. Dot product: s = np.dot(a, b) Classify each as memory-bound or compute-bound on an A100 (Iβˆ—β‰ˆ9.75I^* \approx 9.75 FLOP/byte).

ex-sp-ch10-02

Easy

For an array of n=100,000n = 100{,}000 float32 elements with block size B=256B = 256:

  1. How many blocks are in the grid?
  2. How many total threads are launched?
  3. How many threads are "wasted" (don't process data)?
  4. How many warps are there in total?

ex-sp-ch10-03

Easy

An NVIDIA A100 has 108 SMs, each supporting up to 2048 threads. A kernel launches 50,000 threads with a block size of 256.

  1. How many blocks are created?
  2. What is the maximum occupancy if each SM gets 8 blocks?
  3. What percentage of SMs are utilized?

ex-sp-ch10-04

Easy

Write a Python script that checks your GPU setup:

  1. Print the NVIDIA driver version and CUDA version from nvidia-smi
  2. Check if PyTorch can see the GPU
  3. Print GPU name, total memory, and CUDA version
  4. Run a simple GPU computation and verify the result

ex-sp-ch10-05

Easy

Explain why each of these GPU timing measurements is incorrect and fix each one:

# Measurement A
t0 = time.time()
y = x_gpu @ x_gpu
print(time.time() - t0)

# Measurement B
torch.cuda.synchronize()
t0 = time.time()
y = x_gpu @ x_gpu
torch.cuda.synchronize()
print(time.time() - t0)  # first run ever

ex-sp-ch10-06

Medium

Using the roofline model, determine the maximum achievable FLOP/s for the following operations on an A100 (19.5 TFLOPS, 2039 GB/s bandwidth):

  1. Element-wise exp(x) on float32 (read nn, write nn, ~8 FLOPs per element)
  2. Matrix-matrix multiply of two 1024Γ—10241024 \times 1024 float32 matrices
  3. Batched vector addition of 1000 vectors of length 10000

For each, state whether the operation is memory-bound or compute-bound.

ex-sp-ch10-07

Medium

A warp of 32 threads executes a loop where thread ii accesses data[i * stride] for stride values of 1, 2, 4, 8, 16, 32. For each stride, calculate:

  1. The number of 128-byte cache lines touched by the warp
  2. The effective bandwidth utilization (useful bytes / total bytes fetched) Assume float32 elements (4 bytes each).

ex-sp-ch10-08

Medium

Write a Python script using PyTorch that benchmarks matrix multiplication on CPU vs GPU for sizes n=128,256,512,1024,2048,4096n = 128, 256, 512, 1024, 2048, 4096. Plot the speedup (CPU time / GPU time) vs matrix size. Include proper warmup and synchronization.

ex-sp-ch10-09

Medium

A training pipeline has the following profile:

  • Data loading: 200 ms (CPU)
  • Data augmentation: 50 ms (CPU)
  • Host-to-device transfer: 10 ms
  • Forward pass: 30 ms (GPU)
  • Backward pass: 50 ms (GPU)
  • Optimizer step: 5 ms (GPU)
  • Device-to-host (loss): 0.1 ms
  1. What is the total iteration time?
  2. What is the GPU utilization?
  3. Apply Amdahl's law: what speedup would a 2x faster GPU give?
  4. Suggest two optimizations with estimated impact.

ex-sp-ch10-10

Medium

Use torch.profiler to profile 10 iterations of a matrix multiplication (2048Γ—20482048 \times 2048) and answer:

  1. What is the kernel name for matmul?
  2. What percentage of time is CUDA vs CPU?
  3. What is the CUDA memory allocated? Print the profiler table sorted by CUDA time.

ex-sp-ch10-11

Hard

Build a "GPU calculator" that determines whether a given computation should run on CPU or GPU based on the roofline model:

def should_use_gpu(flops, bytes_moved, cpu_gflops, gpu_gflops,
                   gpu_bandwidth_gbs, pcie_bandwidth_gbs):
    """Return True if GPU is faster, accounting for transfer."""

Test it for: vector add (n=1000 to n=10M), GEMM (n=100 to n=4096), and batch FFT (batch=1 to batch=10000, size=1024).

ex-sp-ch10-12

Hard

Write a benchmark that measures effective memory bandwidth for GPU memory operations at different data sizes (1 KB to 1 GB). Plot the achieved bandwidth vs data size and explain why small transfers achieve much lower bandwidth than large ones.

ex-sp-ch10-13

Hard

Build a comprehensive GPU diagnostic script that reports:

  1. Driver version, CUDA version, GPU name
  2. Total/free GPU memory
  3. Number of SMs and max threads per SM
  4. Measured memory bandwidth (copy benchmark)
  5. Measured compute throughput (GEMM benchmark)
  6. PCIe transfer bandwidth (host-device copy) Format the output as a table.

ex-sp-ch10-14

Challenge

Implement a Python roofline analysis tool that:

  1. Auto-detects GPU specs (peak FLOPS, memory bandwidth)
  2. Benchmarks a user-provided function at multiple input sizes
  3. Measures both FLOP count and bytes moved
  4. Plots the function on the roofline diagram
  5. Automatically classifies the bottleneck and suggests optimizations

Test on: element-wise operations, GEMM, batched SVD, and FFT.

ex-sp-ch10-15

Challenge

Create a GPU memory transfer optimizer that:

  1. Traces all host-device transfers in a PyTorch program
  2. Identifies redundant transfers (data sent to GPU then immediately back)
  3. Detects small transfers that should be batched
  4. Suggests pinned memory for frequently transferred tensors
  5. Reports total transfer time and potential savings

Use torch.cuda.memory._record_memory_history() or manual hooks.