Profiling GPU Code

If You Are Not Profiling, You Are Guessing

GPU code has so many potential bottlenecks β€” kernel launch overhead, memory transfers, low occupancy, uncoalesced access, warp divergence β€” that intuition alone is unreliable. Profiling reveals exactly where time is spent and which bottleneck to attack first.

This section covers profiling at three levels:

  1. Manual timing with torch.cuda.synchronize()
  2. Python-level with torch.profiler and TensorBoard
  3. System-level with nvidia-smi and nvitop

Definition:

Asynchronous GPU Execution

GPU operations in PyTorch and CuPy are asynchronous: the Python thread submits work to the GPU and returns immediately, without waiting for completion. This allows the CPU to prepare the next operation while the GPU is still computing.

import torch

x = torch.randn(10000, 10000, device='cuda')
y = x @ x  # returns immediately!
# GPU may still be computing y at this point

This means naive Python timing (time.time()) measures only the submission time, not the execution time.

Tmeasured=Tsubmissionβ‰ͺTactualT_{\text{measured}} = T_{\text{submission}} \ll T_{\text{actual}}

Definition:

torch.cuda.synchronize()

torch.cuda.synchronize() blocks the CPU thread until all previously submitted GPU operations complete. This is the correct way to time GPU code:

import torch
import time

x = torch.randn(10000, 10000, device='cuda')

torch.cuda.synchronize()
t0 = time.perf_counter()

y = x @ x

torch.cuda.synchronize()  # wait for GPU to finish
t1 = time.perf_counter()

print(f"MatMul time: {t1 - t0:.4f} s")

Without synchronize: you measure ~0.0001s (submission only). With synchronize: you measure ~0.05s (actual GPU computation).

CuPy has the equivalent cp.cuda.Stream.null.synchronize() or cp.cuda.Device().synchronize().

Definition:

CUDA Events for Precise Timing

CUDA Events provide GPU-side timestamps with microsecond precision, without requiring full synchronization:

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
y = x @ x
end.record()

torch.cuda.synchronize()
print(f"GPU time: {start.elapsed_time(end):.2f} ms")

CUDA Events measure wall-clock time on the GPU itself, avoiding CPU-GPU synchronization overhead in the measurement. This is more accurate than time.perf_counter() with synchronize().

Definition:

torch.profiler for Detailed Analysis

torch.profiler captures detailed GPU activity including kernel execution times, memory operations, and CPU-GPU synchronization:

from torch.profiler import profile, record_function, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    with record_function("my_matmul"):
        y = x @ x

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

The output shows each operation's CPU time, CUDA time, memory allocation, and call count. This reveals whether time is spent in compute, memory allocation, or synchronization.

Definition:

nvidia-smi and nvitop for System Monitoring

nvidia-smi is NVIDIA's command-line tool for monitoring GPU status:

# Snapshot
nvidia-smi

# Continuous monitoring (every 1 second)
nvidia-smi dmon -s pucvmet -d 1

Key metrics:

  • GPU Util: percentage of time at least one kernel is running
  • Memory Util: percentage of GPU memory used
  • Temperature: GPU die temperature (throttles above ~83C)
  • Power: current power draw vs TDP limit

nvitop is a modern, interactive replacement:

pip install nvitop
nvitop

It provides a htop-like interface showing per-process GPU usage, memory, and real-time utilization graphs.

Theorem: GPU Warmup Is Required for Accurate Benchmarks

The first execution of a GPU operation is significantly slower than subsequent executions due to:

  1. CUDA context initialization (first kernel: 100-500 ms)
  2. Just-in-time (JIT) compilation of PTX to SASS
  3. Memory allocation and caching setup

A valid GPU benchmark must include at least one warmup iteration that is excluded from timing. The measured time should be the median of kβ‰₯5k \ge 5 repetitions.

The CUDA driver lazily initializes resources. The first kernel launch triggers context creation, module loading, and memory pool initialization. These one-time costs inflate the first measurement by 10-100x.

Example: Correct GPU Benchmarking Pattern

Write a robust GPU benchmark that correctly accounts for asynchronous execution, warmup, and statistical variation.

Example: Using torch.profiler to Find Bottlenecks

Profile a neural network forward pass to identify whether the bottleneck is compute, memory allocation, or data transfer.

Example: Real-Time GPU Monitoring During Training

Monitor GPU utilization, memory, and temperature during a long training run to ensure the GPU is fully utilized.

GPU Benchmark Results Explorer

Compare CPU vs GPU execution times across different operation types and data sizes, showing the crossover point where GPU becomes faster.

Parameters

Quick Check

Why does time.time() give incorrect GPU timing without torch.cuda.synchronize()?

The GPU clock runs at a different speed than the CPU clock

GPU operations are asynchronous β€” Python returns before the GPU finishes

time.time() cannot access GPU hardware timers

PyTorch caches the result and does not actually compute

Common Mistake: Benchmarking Without Warmup

Mistake:

Timing the first GPU operation and reporting it as representative:

x = torch.randn(5000, 5000, device='cuda')
torch.cuda.synchronize()
t0 = time.perf_counter()
y = x @ x
torch.cuda.synchronize()
print(f"Time: {time.perf_counter() - t0:.4f}s")  # 0.5s (inflated!)

The first run includes CUDA context initialization and JIT compilation.

Correction:

Always run 2-3 warmup iterations before timing:

# Warmup
for _ in range(3):
    _ = x @ x
torch.cuda.synchronize()
# Now time

Common Mistake: Excessive Synchronization Kills Throughput

Mistake:

Adding torch.cuda.synchronize() after every operation in production code to "make sure it works":

for layer in model.layers:
    x = layer(x)
    torch.cuda.synchronize()  # blocks CPU-GPU pipeline!

Correction:

Use synchronize() only for timing and debugging. In production, let the CUDA stream pipeline operations freely:

for layer in model.layers:
    x = layer(x)
# synchronize only when you need the result on CPU
torch.cuda.synchronize()
result = x.cpu()

Key Takeaway

Always use torch.cuda.synchronize() before and after timed regions. Always warm up the GPU before benchmarking. Use torch.profiler for operation-level analysis and nvidia-smi/nvitop for system-level monitoring. GPU profiling is not optional β€” it is the difference between thinking your code is fast and knowing it.

CUDA Synchronize

A blocking call (torch.cuda.synchronize()) that waits until all previously submitted GPU operations complete, required for accurate timing.

Related: Asynchronous Execution

Asynchronous Execution

The GPU execution model where the CPU submits work and returns immediately, allowing CPU and GPU to work concurrently on different tasks.

Related: CUDA Synchronize