Profiling GPU Code
If You Are Not Profiling, You Are Guessing
GPU code has so many potential bottlenecks β kernel launch overhead, memory transfers, low occupancy, uncoalesced access, warp divergence β that intuition alone is unreliable. Profiling reveals exactly where time is spent and which bottleneck to attack first.
This section covers profiling at three levels:
- Manual timing with
torch.cuda.synchronize() - Python-level with
torch.profilerand TensorBoard - System-level with
nvidia-smiandnvitop
Definition: Asynchronous GPU Execution
Asynchronous GPU Execution
GPU operations in PyTorch and CuPy are asynchronous: the Python thread submits work to the GPU and returns immediately, without waiting for completion. This allows the CPU to prepare the next operation while the GPU is still computing.
import torch
x = torch.randn(10000, 10000, device='cuda')
y = x @ x # returns immediately!
# GPU may still be computing y at this point
This means naive Python timing (time.time()) measures only the
submission time, not the execution time.
Definition: torch.cuda.synchronize()
torch.cuda.synchronize()
torch.cuda.synchronize() blocks the CPU thread until all
previously submitted GPU operations complete. This is the
correct way to time GPU code:
import torch
import time
x = torch.randn(10000, 10000, device='cuda')
torch.cuda.synchronize()
t0 = time.perf_counter()
y = x @ x
torch.cuda.synchronize() # wait for GPU to finish
t1 = time.perf_counter()
print(f"MatMul time: {t1 - t0:.4f} s")
Without synchronize: you measure ~0.0001s (submission only). With synchronize: you measure ~0.05s (actual GPU computation).
CuPy has the equivalent cp.cuda.Stream.null.synchronize() or
cp.cuda.Device().synchronize().
Definition: CUDA Events for Precise Timing
CUDA Events for Precise Timing
CUDA Events provide GPU-side timestamps with microsecond precision, without requiring full synchronization:
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
y = x @ x
end.record()
torch.cuda.synchronize()
print(f"GPU time: {start.elapsed_time(end):.2f} ms")
CUDA Events measure wall-clock time on the GPU itself, avoiding
CPU-GPU synchronization overhead in the measurement. This is
more accurate than time.perf_counter() with synchronize().
Definition: torch.profiler for Detailed Analysis
torch.profiler for Detailed Analysis
torch.profiler captures detailed GPU activity including kernel
execution times, memory operations, and CPU-GPU synchronization:
from torch.profiler import profile, record_function, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
with_stack=True,
) as prof:
with record_function("my_matmul"):
y = x @ x
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
The output shows each operation's CPU time, CUDA time, memory allocation, and call count. This reveals whether time is spent in compute, memory allocation, or synchronization.
Definition: nvidia-smi and nvitop for System Monitoring
nvidia-smi and nvitop for System Monitoring
nvidia-smi is NVIDIA's command-line tool for monitoring GPU status:
# Snapshot
nvidia-smi
# Continuous monitoring (every 1 second)
nvidia-smi dmon -s pucvmet -d 1
Key metrics:
- GPU Util: percentage of time at least one kernel is running
- Memory Util: percentage of GPU memory used
- Temperature: GPU die temperature (throttles above ~83C)
- Power: current power draw vs TDP limit
nvitop is a modern, interactive replacement:
pip install nvitop
nvitop
It provides a htop-like interface showing per-process GPU
usage, memory, and real-time utilization graphs.
Theorem: GPU Warmup Is Required for Accurate Benchmarks
The first execution of a GPU operation is significantly slower than subsequent executions due to:
- CUDA context initialization (first kernel: 100-500 ms)
- Just-in-time (JIT) compilation of PTX to SASS
- Memory allocation and caching setup
A valid GPU benchmark must include at least one warmup iteration that is excluded from timing. The measured time should be the median of repetitions.
The CUDA driver lazily initializes resources. The first kernel launch triggers context creation, module loading, and memory pool initialization. These one-time costs inflate the first measurement by 10-100x.
Example: Correct GPU Benchmarking Pattern
Write a robust GPU benchmark that correctly accounts for asynchronous execution, warmup, and statistical variation.
The wrong way
import time, torch
x = torch.randn(5000, 5000, device='cuda')
t0 = time.time()
y = x @ x
print(f"Time: {time.time() - t0:.4f}s") # WRONG: ~0.0002s
This measures only the kernel launch time, not computation.
The right way
import torch
import time
x = torch.randn(5000, 5000, device='cuda')
# Warmup (excluded from timing)
for _ in range(3):
_ = x @ x
torch.cuda.synchronize()
# Timed runs
times = []
for _ in range(10):
torch.cuda.synchronize()
t0 = time.perf_counter()
y = x @ x
torch.cuda.synchronize()
times.append(time.perf_counter() - t0)
import numpy as np
print(f"Median: {np.median(times)*1000:.2f} ms")
print(f"Std: {np.std(times)*1000:.2f} ms")
Even better: CUDA Events
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
times = []
for _ in range(10):
start.record()
y = x @ x
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
print(f"Median: {np.median(times):.2f} ms")
Example: Using torch.profiler to Find Bottlenecks
Profile a neural network forward pass to identify whether the bottleneck is compute, memory allocation, or data transfer.
Set up profiling
import torch
from torch.profiler import profile, ProfilerActivity
model = torch.nn.Sequential(
torch.nn.Linear(1024, 2048),
torch.nn.ReLU(),
torch.nn.Linear(2048, 512),
).cuda()
x = torch.randn(256, 1024, device='cuda')
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
) as prof:
for _ in range(10):
y = model(x)
print(prof.key_averages().table(
sort_by="cuda_time_total", row_limit=10))
Export to TensorBoard
prof.export_chrome_trace("trace.json")
# Or for TensorBoard:
# In profiler setup, add:
# on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')
# Then: tensorboard --logdir=./log
Interpret the output
Key columns to examine:
Self CUDA: time spent in this operation on the GPUCUDA Mem: GPU memory allocated/freed# of Calls: number of invocations
If aten::mm (matrix multiply) dominates: compute-bound.
If aten::copy_ dominates: transfer-bound.
If cudaMalloc appears: memory allocation overhead.
Example: Real-Time GPU Monitoring During Training
Monitor GPU utilization, memory, and temperature during a long training run to ensure the GPU is fully utilized.
Basic monitoring
# In a separate terminal:
watch -n 1 nvidia-smi
Look for:
- GPU-Util > 90%: GPU is well-utilized
- GPU-Util < 50%: data pipeline bottleneck (CPU/IO)
- Memory Usage near 100%: reduce batch size
Detailed metrics
nvidia-smi dmon -s pucvmet -d 1
Columns: power, SM utilization, memory utilization, encoder/decoder utilization, temperature, memory clock.
Interactive with nvitop
pip install nvitop
nvitop -m full
Shows per-process GPU memory, compute usage, and historical utilization graphs.
GPU Benchmark Results Explorer
Compare CPU vs GPU execution times across different operation types and data sizes, showing the crossover point where GPU becomes faster.
Parameters
Quick Check
Why does time.time() give incorrect GPU timing without torch.cuda.synchronize()?
The GPU clock runs at a different speed than the CPU clock
GPU operations are asynchronous β Python returns before the GPU finishes
time.time() cannot access GPU hardware timers
PyTorch caches the result and does not actually compute
CUDA kernels are queued and the Python thread continues immediately. synchronize() blocks until the GPU completes.
Common Mistake: Benchmarking Without Warmup
Mistake:
Timing the first GPU operation and reporting it as representative:
x = torch.randn(5000, 5000, device='cuda')
torch.cuda.synchronize()
t0 = time.perf_counter()
y = x @ x
torch.cuda.synchronize()
print(f"Time: {time.perf_counter() - t0:.4f}s") # 0.5s (inflated!)
The first run includes CUDA context initialization and JIT compilation.
Correction:
Always run 2-3 warmup iterations before timing:
# Warmup
for _ in range(3):
_ = x @ x
torch.cuda.synchronize()
# Now time
Common Mistake: Excessive Synchronization Kills Throughput
Mistake:
Adding torch.cuda.synchronize() after every operation in
production code to "make sure it works":
for layer in model.layers:
x = layer(x)
torch.cuda.synchronize() # blocks CPU-GPU pipeline!
Correction:
Use synchronize() only for timing and debugging. In production,
let the CUDA stream pipeline operations freely:
for layer in model.layers:
x = layer(x)
# synchronize only when you need the result on CPU
torch.cuda.synchronize()
result = x.cpu()
Key Takeaway
Always use torch.cuda.synchronize() before and after timed
regions. Always warm up the GPU before benchmarking. Use
torch.profiler for operation-level analysis and nvidia-smi/nvitop
for system-level monitoring. GPU profiling is not optional β
it is the difference between thinking your code is fast and
knowing it.
CUDA Synchronize
A blocking call (torch.cuda.synchronize()) that waits until all previously submitted GPU operations complete, required for accurate timing.
Related: Asynchronous Execution
Asynchronous Execution
The GPU execution model where the CPU submits work and returns immediately, allowing CPU and GPU to work concurrently on different tasks.
Related: CUDA Synchronize