Exercises
ex-sp-ch10-01
EasyCompute the arithmetic intensity (FLOP/byte) for the following operations on float32 arrays of length :
- Vector copy:
b = a.copy() - Scalar multiply:
b = 3.0 * a - Dot product:
s = np.dot(a, b)Classify each as memory-bound or compute-bound on an A100 ( FLOP/byte).
Count bytes read AND bytes written.
A dot product reads floats and performs FLOPs (multiply + add).
Analysis
- Copy: 0 FLOPs, bytes (read + write) -> . Memory-bound.
- Scalar multiply: FLOPs, bytes -> . Memory-bound.
- Dot product: FLOPs, bytes -> . Memory-bound. All three are far below , confirming they are memory-bound.
ex-sp-ch10-02
EasyFor an array of float32 elements with block size :
- How many blocks are in the grid?
- How many total threads are launched?
- How many threads are "wasted" (don't process data)?
- How many warps are there in total?
.
Total threads = .
Computation
- blocks
- Total threads:
- Wasted: threads (0.096%)
- Total warps: warps
ex-sp-ch10-03
EasyAn NVIDIA A100 has 108 SMs, each supporting up to 2048 threads. A kernel launches 50,000 threads with a block size of 256.
- How many blocks are created?
- What is the maximum occupancy if each SM gets 8 blocks?
- What percentage of SMs are utilized?
Blocks = total_threads / block_size.
Occupancy = active_threads / max_threads per SM.
Computation
- Blocks = blocks (rounding up)
- 8 blocks x 256 threads = 2048 threads per SM = 100% occupancy
- With 196 blocks and 8 blocks per SM, need SMs. Utilization = . Most SMs are idle. Need threads for full utilization.
ex-sp-ch10-04
EasyWrite a Python script that checks your GPU setup:
- Print the NVIDIA driver version and CUDA version from
nvidia-smi - Check if PyTorch can see the GPU
- Print GPU name, total memory, and CUDA version
- Run a simple GPU computation and verify the result
Use torch.cuda.get_device_properties(0) for device info.
Implementation
import subprocess
import torch
# 1. nvidia-smi
result = subprocess.run(['nvidia-smi', '--query-gpu=driver_version,name,memory.total',
'--format=csv,noheader'],
capture_output=True, text=True)
print(f"nvidia-smi: {result.stdout.strip()}")
# 2-3. PyTorch GPU info
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
props = torch.cuda.get_device_properties(0)
print(f"GPU: {props.name}")
print(f"Memory: {props.total_mem / 1e9:.1f} GB")
print(f"SMs: {props.multi_processor_count}")
print(f"CUDA: {torch.version.cuda}")
# 4. Test computation
x = torch.randn(1000, 1000, device='cuda')
y = x @ x
torch.cuda.synchronize()
print(f"Test matmul: shape={y.shape}, OK")
ex-sp-ch10-05
EasyExplain why each of these GPU timing measurements is incorrect and fix each one:
# Measurement A
t0 = time.time()
y = x_gpu @ x_gpu
print(time.time() - t0)
# Measurement B
torch.cuda.synchronize()
t0 = time.time()
y = x_gpu @ x_gpu
torch.cuda.synchronize()
print(time.time() - t0) # first run ever
Consider asynchronous execution and warmup.
Analysis
A: Missing synchronize() β measures kernel launch time (~0.1 ms),
not GPU compute time. Fix: add torch.cuda.synchronize() after the operation.
B: No warmup β the first CUDA operation includes context initialization (100-500 ms extra). Fix: run 2-3 warmup iterations before timing.
Correct version:
# Warmup
for _ in range(3):
_ = x_gpu @ x_gpu
torch.cuda.synchronize()
# Timed
torch.cuda.synchronize()
t0 = time.perf_counter()
y = x_gpu @ x_gpu
torch.cuda.synchronize()
print(f"{time.perf_counter() - t0:.4f}s")
ex-sp-ch10-06
MediumUsing the roofline model, determine the maximum achievable FLOP/s for the following operations on an A100 (19.5 TFLOPS, 2039 GB/s bandwidth):
- Element-wise
exp(x)on float32 (read , write , ~8 FLOPs per element) - Matrix-matrix multiply of two float32 matrices
- Batched vector addition of 1000 vectors of length 10000
For each, state whether the operation is memory-bound or compute-bound.
, then .
Analysis
- exp(x): FLOP/byte. TFLOPS. Memory-bound.
- GEMM: . TFLOPS. Compute-bound.
- Vector add: . TFLOPS. Memory-bound.
ex-sp-ch10-07
MediumA warp of 32 threads executes a loop where thread accesses
data[i * stride] for stride values of 1, 2, 4, 8, 16, 32.
For each stride, calculate:
- The number of 128-byte cache lines touched by the warp
- The effective bandwidth utilization (useful bytes / total bytes fetched) Assume float32 elements (4 bytes each).
With stride , the address range is bytes.
Each cache line is 128 bytes = 32 float32 elements.
Analysis
| Stride | Range (bytes) | Cache Lines | Useful/Fetched |
|---|---|---|---|
| 1 | 128 | 1 | 128/128 = 100% |
| 2 | 256 | 2 | 128/256 = 50% |
| 4 | 512 | 4 | 128/512 = 25% |
| 8 | 1024 | 8 | 128/1024 = 12.5% |
| 16 | 2048 | 16 | 128/2048 = 6.25% |
| 32 | 4096 | 32 | 128/4096 = 3.125% |
Stride-1 access is perfectly coalesced (1 transaction). Stride-32 wastes 97% of memory bandwidth.
ex-sp-ch10-08
MediumWrite a Python script using PyTorch that benchmarks matrix multiplication on CPU vs GPU for sizes . Plot the speedup (CPU time / GPU time) vs matrix size. Include proper warmup and synchronization.
Use torch.cuda.synchronize() for GPU timing.
Use CUDA Events for precise GPU timing.
Implementation
import torch
import time
import numpy as np
sizes = [128, 256, 512, 1024, 2048, 4096]
cpu_times, gpu_times = [], []
for n in sizes:
# CPU
x_cpu = torch.randn(n, n)
times = []
for _ in range(5):
t0 = time.perf_counter()
_ = x_cpu @ x_cpu
times.append(time.perf_counter() - t0)
cpu_times.append(np.median(times))
# GPU
x_gpu = torch.randn(n, n, device='cuda')
for _ in range(3): # warmup
_ = x_gpu @ x_gpu
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
times = []
for _ in range(10):
start.record()
_ = x_gpu @ x_gpu
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end) / 1000)
gpu_times.append(np.median(times))
speedups = [c/g for c, g in zip(cpu_times, gpu_times)]
for n, s in zip(sizes, speedups):
print(f"n={n:5d}: speedup = {s:.1f}x")
ex-sp-ch10-09
MediumA training pipeline has the following profile:
- Data loading: 200 ms (CPU)
- Data augmentation: 50 ms (CPU)
- Host-to-device transfer: 10 ms
- Forward pass: 30 ms (GPU)
- Backward pass: 50 ms (GPU)
- Optimizer step: 5 ms (GPU)
- Device-to-host (loss): 0.1 ms
- What is the total iteration time?
- What is the GPU utilization?
- Apply Amdahl's law: what speedup would a 2x faster GPU give?
- Suggest two optimizations with estimated impact.
GPU utilization = GPU time / total time.
Consider pipelining data loading with GPU compute.
Analysis
- Total: ms
- GPU util:
- Amdahl's: . . A 2x faster GPU gives only 14% overall speedup!
- Optimizations:
- GPU data augmentation: saves 50 ms, total -> 295 ms (16% faster)
- Prefetch/pipeline data loading: overlap 200 ms with GPU, total -> ~145 ms (58% faster)
ex-sp-ch10-10
MediumUse torch.profiler to profile 10 iterations of a matrix
multiplication () and answer:
- What is the kernel name for matmul?
- What percentage of time is CUDA vs CPU?
- What is the CUDA memory allocated? Print the profiler table sorted by CUDA time.
Use ProfilerActivity.CPU and ProfilerActivity.CUDA.
The kernel name is typically aten::mm or sm80_xmma_gemm.
Implementation
import torch
from torch.profiler import profile, ProfilerActivity
x = torch.randn(2048, 2048, device='cuda')
# Warmup
for _ in range(3):
_ = x @ x
torch.cuda.synchronize()
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
) as prof:
for _ in range(10):
y = x @ x
print(prof.key_averages().table(
sort_by="cuda_time_total", row_limit=10))
ex-sp-ch10-11
HardBuild a "GPU calculator" that determines whether a given computation should run on CPU or GPU based on the roofline model:
def should_use_gpu(flops, bytes_moved, cpu_gflops, gpu_gflops,
gpu_bandwidth_gbs, pcie_bandwidth_gbs):
"""Return True if GPU is faster, accounting for transfer."""
Test it for: vector add (n=1000 to n=10M), GEMM (n=100 to n=4096), and batch FFT (batch=1 to batch=10000, size=1024).
GPU time = max(flops/gpu_gflops, bytes_moved/bandwidth) + 2*data_size/pcie_bw.
CPU time = flops/cpu_gflops.
Implementation
def should_use_gpu(flops, bytes_moved, data_bytes,
cpu_gflops=100, gpu_gflops=19500,
gpu_bw_gbs=2039, pcie_bw_gbs=25):
cpu_time = flops / (cpu_gflops * 1e9)
gpu_compute = flops / (gpu_gflops * 1e9)
gpu_memtime = bytes_moved / (gpu_bw_gbs * 1e9)
gpu_kernel = max(gpu_compute, gpu_memtime)
gpu_transfer = 2 * data_bytes / (pcie_bw_gbs * 1e9)
gpu_total = gpu_kernel + gpu_transfer
return gpu_total < cpu_time, cpu_time, gpu_total
import numpy as np
for n in [1000, 10000, 100000, 1000000, 10000000]:
flops = n
bytes_moved = 12 * n
data_bytes = 12 * n
use, tc, tg = should_use_gpu(flops, bytes_moved, data_bytes)
print(f"VecAdd n={n:>10,}: GPU={'yes' if use else 'no':3s} "
f"CPU={tc*1e6:.1f}us GPU={tg*1e6:.1f}us")
ex-sp-ch10-12
HardWrite a benchmark that measures effective memory bandwidth for GPU memory operations at different data sizes (1 KB to 1 GB). Plot the achieved bandwidth vs data size and explain why small transfers achieve much lower bandwidth than large ones.
Bandwidth = bytes / time.
Use torch.cuda.Event for precise timing.
Include both read and write bandwidth.
Implementation
import torch
import numpy as np
sizes = [2**k for k in range(10, 31)] # 1 KB to 1 GB
bandwidths = []
for nbytes in sizes:
n = nbytes // 4 # float32
x = torch.randn(n, device='cuda')
y = torch.empty_like(x)
# Warmup
for _ in range(3):
y.copy_(x)
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
reps = max(1, 10_000_000 // n)
start.record()
for _ in range(reps):
y.copy_(x)
end.record()
torch.cuda.synchronize()
time_ms = start.elapsed_time(end) / reps
bw = 2 * nbytes / (time_ms / 1000) / 1e9 # GB/s (read + write)
bandwidths.append(bw)
print(f"{nbytes/1024:>10.0f} KB: {bw:.0f} GB/s")
ex-sp-ch10-13
HardBuild a comprehensive GPU diagnostic script that reports:
- Driver version, CUDA version, GPU name
- Total/free GPU memory
- Number of SMs and max threads per SM
- Measured memory bandwidth (copy benchmark)
- Measured compute throughput (GEMM benchmark)
- PCIe transfer bandwidth (host-device copy) Format the output as a table.
Use torch.cuda.get_device_properties for hardware info.
Use torch.cuda.mem_get_info for memory.
Implementation
import torch
import time
def gpu_diagnostic():
props = torch.cuda.get_device_properties(0)
free, total = torch.cuda.mem_get_info(0)
print("=" * 50)
print("GPU DIAGNOSTIC REPORT")
print("=" * 50)
print(f"GPU: {props.name}")
print(f"CUDA: {torch.version.cuda}")
print(f"SMs: {props.multi_processor_count}")
print(f"Memory: {total/1e9:.1f} GB ({free/1e9:.1f} GB free)")
# Memory BW benchmark
n = 100_000_000
x = torch.randn(n, device='cuda')
y = torch.empty_like(x)
for _ in range(3): y.copy_(x)
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for _ in range(20): y.copy_(x)
end.record()
torch.cuda.synchronize()
bw = 20 * 2 * n * 4 / (start.elapsed_time(end)/1000) / 1e9
print(f"Memory BW: {bw:.0f} GB/s")
# Compute benchmark
m = 4096
a = torch.randn(m, m, device='cuda')
for _ in range(3): _ = a @ a
torch.cuda.synchronize()
start.record()
for _ in range(10): _ = a @ a
end.record()
torch.cuda.synchronize()
tflops = 10 * 2 * m**3 / (start.elapsed_time(end)/1000) / 1e12
print(f"Compute: {tflops:.1f} TFLOPS (FP32)")
gpu_diagnostic()
ex-sp-ch10-14
ChallengeImplement a Python roofline analysis tool that:
- Auto-detects GPU specs (peak FLOPS, memory bandwidth)
- Benchmarks a user-provided function at multiple input sizes
- Measures both FLOP count and bytes moved
- Plots the function on the roofline diagram
- Automatically classifies the bottleneck and suggests optimizations
Test on: element-wise operations, GEMM, batched SVD, and FFT.
Estimate FLOPs from operation type and dimensions.
Measure achieved FLOP/s and compare to the roofline bound.
Design
class RooflineAnalyzer:
def __init__(self):
props = torch.cuda.get_device_properties(0)
self.peak_tflops = self._measure_peak_compute()
self.peak_bw = self._measure_peak_bandwidth()
self.ridge_point = self.peak_tflops * 1e12 / (self.peak_bw * 1e9)
def analyze(self, func, sizes, flop_counter, byte_counter):
results = []
for size in sizes:
args = self._make_args(size)
flops = flop_counter(size)
bytes_moved = byte_counter(size)
time_s = self._benchmark(func, args)
achieved = flops / time_s / 1e9
intensity = flops / bytes_moved
bound = min(self.peak_tflops * 1e3,
intensity * self.peak_bw)
efficiency = achieved / bound * 100
results.append({
'size': size, 'intensity': intensity,
'achieved_gflops': achieved,
'bound_gflops': bound,
'efficiency': efficiency,
'bottleneck': 'compute' if intensity > self.ridge_point else 'memory'
})
return results
ex-sp-ch10-15
ChallengeCreate a GPU memory transfer optimizer that:
- Traces all host-device transfers in a PyTorch program
- Identifies redundant transfers (data sent to GPU then immediately back)
- Detects small transfers that should be batched
- Suggests pinned memory for frequently transferred tensors
- Reports total transfer time and potential savings
Use torch.cuda.memory._record_memory_history() or manual hooks.
Override torch.Tensor.cuda() and .cpu() with logging wrappers.
Track tensor IDs to detect ping-pong transfers.
Design
import torch
from collections import defaultdict
class TransferTracker:
def __init__(self):
self.transfers = []
self.tensor_locations = {}
def track_to_cuda(self, tensor, original_fn):
size = tensor.nelement() * tensor.element_size()
t0 = time.perf_counter()
result = original_fn()
torch.cuda.synchronize()
elapsed = time.perf_counter() - t0
tid = id(tensor)
if self.tensor_locations.get(tid) == 'cuda':
self.transfers.append({
'type': 'redundant', 'size': size, 'time': elapsed
})
self.tensor_locations[tid] = 'cuda'
self.transfers.append({
'type': 'h2d', 'size': size, 'time': elapsed
})
return result
def report(self):
total_bytes = sum(t['size'] for t in self.transfers)
total_time = sum(t['time'] for t in self.transfers)
redundant = [t for t in self.transfers if t['type'] == 'redundant']
small = [t for t in self.transfers if t['size'] < 4096]
print(f"Total transfers: {len(self.transfers)}")
print(f"Total data: {total_bytes/1e6:.1f} MB")
print(f"Total time: {total_time*1000:.1f} ms")
print(f"Redundant: {len(redundant)}")
print(f"Small (<4KB): {len(small)} (batch these!)")