Profiling GPU Code

If You Are Not Profiling, You Are Guessing

GPU code has so many potential bottlenecks — kernel launch overhead, memory transfers, low occupancy, uncoalesced access, warp divergence — that intuition alone is unreliable. Profiling reveals exactly where time is spent and which bottleneck to attack first.

This section covers profiling at three levels:

Manual timing with torch.cuda.synchronize()
Python-level with torch.profiler and TensorBoard
System-level with nvidia-smi and nvitop

Definition:
Asynchronous GPU Execution

GPU operations in PyTorch and CuPy are asynchronous: the Python thread submits work to the GPU and returns immediately, without waiting for completion. This allows the CPU to prepare the next operation while the GPU is still computing.

import torch

x = torch.randn(10000, 10000, device='cuda')
y = x @ x  # returns immediately!
# GPU may still be computing y at this point

This means naive Python timing (time.time()) measures only the submission time, not the execution time.

$T_{\text{measured}} = T_{\text{submission}} \ll T_{\text{actual}}$

Definition:
torch.cuda.synchronize()

torch.cuda.synchronize() blocks the CPU thread until all previously submitted GPU operations complete. This is the correct way to time GPU code:

import torch
import time

x = torch.randn(10000, 10000, device='cuda')

torch.cuda.synchronize()
t0 = time.perf_counter()

y = x @ x

torch.cuda.synchronize()  # wait for GPU to finish
t1 = time.perf_counter()

print(f"MatMul time: {t1 - t0:.4f} s")

Without synchronize: you measure ~0.0001s (submission only). With synchronize: you measure ~0.05s (actual GPU computation).

CuPy has the equivalent cp.cuda.Stream.null.synchronize() or cp.cuda.Device().synchronize().

Definition:
CUDA Events for Precise Timing

CUDA Events provide GPU-side timestamps with microsecond precision, without requiring full synchronization:

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
y = x @ x
end.record()

torch.cuda.synchronize()
print(f"GPU time: {start.elapsed_time(end):.2f} ms")

CUDA Events measure wall-clock time on the GPU itself, avoiding CPU-GPU synchronization overhead in the measurement. This is more accurate than time.perf_counter() with synchronize().

Definition:
torch.profiler for Detailed Analysis

torch.profiler captures detailed GPU activity including kernel execution times, memory operations, and CPU-GPU synchronization:

from torch.profiler import profile, record_function, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    with record_function("my_matmul"):
        y = x @ x

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

The output shows each operation's CPU time, CUDA time, memory allocation, and call count. This reveals whether time is spent in compute, memory allocation, or synchronization.

Definition:
nvidia-smi and nvitop for System Monitoring

nvidia-smi is NVIDIA's command-line tool for monitoring GPU status:

# Snapshot
nvidia-smi

# Continuous monitoring (every 1 second)
nvidia-smi dmon -s pucvmet -d 1

Key metrics:

GPU Util: percentage of time at least one kernel is running
Memory Util: percentage of GPU memory used
Temperature: GPU die temperature (throttles above ~83C)
Power: current power draw vs TDP limit

nvitop is a modern, interactive replacement:

pip install nvitop
nvitop

It provides a htop-like interface showing per-process GPU usage, memory, and real-time utilization graphs.

Theorem: GPU Warmup Is Required for Accurate Benchmarks

The first execution of a GPU operation is significantly slower than subsequent executions due to:

CUDA context initialization (first kernel: 100-500 ms)
Just-in-time (JIT) compilation of PTX to SASS
Memory allocation and caching setup

A valid GPU benchmark must include at least one warmup iteration that is excluded from timing. The measured time should be the median of $k \ge 5$ repetitions.

The CUDA driver lazily initializes resources. The first kernel launch triggers context creation, module loading, and memory pool initialization. These one-time costs inflate the first measurement by 10-100x.

Example: Correct GPU Benchmarking Pattern

Write a robust GPU benchmark that correctly accounts for asynchronous execution, warmup, and statistical variation.

Solution

The wrong way

import time, torch

x = torch.randn(5000, 5000, device='cuda')

t0 = time.time()
y = x @ x
print(f"Time: {time.time() - t0:.4f}s")  # WRONG: ~0.0002s

This measures only the kernel launch time, not computation.

The right way

import torch
import time

x = torch.randn(5000, 5000, device='cuda')

# Warmup (excluded from timing)
for _ in range(3):
    _ = x @ x
torch.cuda.synchronize()

# Timed runs
times = []
for _ in range(10):
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    y = x @ x
    torch.cuda.synchronize()
    times.append(time.perf_counter() - t0)

import numpy as np
print(f"Median: {np.median(times)*1000:.2f} ms")
print(f"Std:    {np.std(times)*1000:.2f} ms")

Even better: CUDA Events

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

times = []
for _ in range(10):
    start.record()
    y = x @ x
    end.record()
    torch.cuda.synchronize()
    times.append(start.elapsed_time(end))

print(f"Median: {np.median(times):.2f} ms")

Example: Using torch.profiler to Find Bottlenecks

Profile a neural network forward pass to identify whether the bottleneck is compute, memory allocation, or data transfer.

Solution

Set up profiling

import torch
from torch.profiler import profile, ProfilerActivity

model = torch.nn.Sequential(
    torch.nn.Linear(1024, 2048),
    torch.nn.ReLU(),
    torch.nn.Linear(2048, 512),
).cuda()
x = torch.randn(256, 1024, device='cuda')

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
) as prof:
    for _ in range(10):
        y = model(x)

print(prof.key_averages().table(
    sort_by="cuda_time_total", row_limit=10))

Export to TensorBoard

prof.export_chrome_trace("trace.json")

# Or for TensorBoard:
# In profiler setup, add:
#   on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')
# Then: tensorboard --logdir=./log

Interpret the output

Key columns to examine:

Self CUDA: time spent in this operation on the GPU
CUDA Mem: GPU memory allocated/freed
# of Calls: number of invocations

If aten::mm (matrix multiply) dominates: compute-bound. If aten::copy_ dominates: transfer-bound. If cudaMalloc appears: memory allocation overhead.

Example: Real-Time GPU Monitoring During Training

Monitor GPU utilization, memory, and temperature during a long training run to ensure the GPU is fully utilized.

Solution

Basic monitoring

# In a separate terminal:
watch -n 1 nvidia-smi

Look for:

GPU-Util > 90%: GPU is well-utilized
GPU-Util < 50%: data pipeline bottleneck (CPU/IO)
Memory Usage near 100%: reduce batch size

Detailed metrics

nvidia-smi dmon -s pucvmet -d 1

Columns: power, SM utilization, memory utilization, encoder/decoder utilization, temperature, memory clock.

Interactive with nvitop

pip install nvitop
nvitop -m full

Shows per-process GPU memory, compute usage, and historical utilization graphs.

GPU Benchmark Results Explorer

Compare CPU vs GPU execution times across different operation types and data sizes, showing the crossover point where GPU becomes faster.

Parameters

Quick Check

Why does time.time() give incorrect GPU timing without torch.cuda.synchronize()?

The GPU clock runs at a different speed than the CPU clock

GPU operations are asynchronous — Python returns before the GPU finishes

time.time() cannot access GPU hardware timers

PyTorch caches the result and does not actually compute

Correction:

GPU operations are asynchronous — Python returns before the GPU finishes

CUDA kernels are queued and the Python thread continues immediately. synchronize() blocks until the GPU completes.

Common Mistake: Benchmarking Without Warmup

Mistake:

Timing the first GPU operation and reporting it as representative:

x = torch.randn(5000, 5000, device='cuda')
torch.cuda.synchronize()
t0 = time.perf_counter()
y = x @ x
torch.cuda.synchronize()
print(f"Time: {time.perf_counter() - t0:.4f}s")  # 0.5s (inflated!)

The first run includes CUDA context initialization and JIT compilation.

Correction:

Always run 2-3 warmup iterations before timing:

# Warmup
for _ in range(3):
    _ = x @ x
torch.cuda.synchronize()
# Now time

Common Mistake: Excessive Synchronization Kills Throughput

Mistake:

Adding torch.cuda.synchronize() after every operation in production code to "make sure it works":

for layer in model.layers:
    x = layer(x)
    torch.cuda.synchronize()  # blocks CPU-GPU pipeline!

Correction:

Use synchronize() only for timing and debugging. In production, let the CUDA stream pipeline operations freely:

for layer in model.layers:
    x = layer(x)
# synchronize only when you need the result on CPU
torch.cuda.synchronize()
result = x.cpu()

Key Takeaway

Always use torch.cuda.synchronize() before and after timed regions. Always warm up the GPU before benchmarking. Use torch.profiler for operation-level analysis and nvidia-smi/nvitop for system-level monitoring. GPU profiling is not optional — it is the difference between thinking your code is fast and knowing it.

CUDA Synchronize

A blocking call (torch.cuda.synchronize()) that waits until all previously submitted GPU operations complete, required for accurate timing.

Related: Asynchronous Execution

Asynchronous Execution

The GPU execution model where the CPU submits work and returns immediately, allowing CPU and GPU to work concurrently on different tasks.

Related: CUDA Synchronize

Setting Up the GPU Environment Chapter Summary

Profiling GPU Code

If You Are Not Profiling, You Are Guessing

Definition: Asynchronous GPU Execution

Definition: torch.cuda.synchronize()

Definition: CUDA Events for Precise Timing

Definition: torch.profiler for Detailed Analysis

Definition: nvidia-smi and nvitop for System Monitoring

Theorem: GPU Warmup Is Required for Accurate Benchmarks

Example: Correct GPU Benchmarking Pattern

The wrong way

The right way

Even better: CUDA Events

Example: Using torch.profiler to Find Bottlenecks

Set up profiling

Export to TensorBoard

Interpret the output

Example: Real-Time GPU Monitoring During Training

Basic monitoring

Detailed metrics

Interactive with nvitop

GPU Benchmark Results Explorer

Parameters

Quick Check

Common Mistake: Benchmarking Without Warmup

Common Mistake: Excessive Synchronization Kills Throughput

Key Takeaway

CUDA Synchronize

Asynchronous Execution

Definition:
Asynchronous GPU Execution

Definition:
torch.cuda.synchronize()

Definition:
CUDA Events for Precise Timing

Definition:
torch.profiler for Detailed Analysis

Definition:
nvidia-smi and nvitop for System Monitoring