Data Loading and Streaming
The Data Loading Bottleneck
A fast GPU can process a batch in 10 ms, but loading that batch from
disk can take 50-200 ms. Without proper data loading, the GPU sits
idle 80-95% of the time. PyTorch's DataLoader solves this by
overlapping data loading with GPU computation using multiple worker
processes, prefetching, and pinned memory transfers.
This section covers the Dataset/DataLoader API, performance
tuning, and streaming patterns for datasets too large to fit in memory.
Definition: PyTorch Dataset Interface
PyTorch Dataset Interface
A torch.utils.data.Dataset implements two methods:
class MyDataset(torch.utils.data.Dataset):
def __len__(self):
return N # total number of samples
def __getitem__(self, idx):
# Load and return sample at index idx
return x, y # input tensor, label
Map-style datasets support random access via __getitem__.
Iterable-style datasets (IterableDataset) support only
sequential access, useful for streaming data from network or
generating samples on-the-fly.
Keep __getitem__ lightweight: load from memory-mapped files
(e.g., HDF5, NumPy .npy) rather than parsing text or CSV on
each call.
Definition: DataLoader: Parallel Data Loading Pipeline
DataLoader: Parallel Data Loading Pipeline
DataLoader wraps a Dataset with batching, shuffling, and
parallel loading:
loader = DataLoader(
dataset,
batch_size=64,
shuffle=True,
num_workers=4, # parallel loading processes
pin_memory=True, # page-locked CPU memory for faster transfer
prefetch_factor=2, # batches prefetched per worker
persistent_workers=True, # keep workers alive between epochs
)
for batch_x, batch_y in loader:
batch_x = batch_x.to('cuda', non_blocking=True)
# GPU computation overlaps with next batch loading
The optimal num_workers depends on your system. A good starting
point is the number of CPU cores divided by the number of GPUs.
Definition: Pinned (Page-Locked) Memory
Pinned (Page-Locked) Memory
Pinned memory is CPU memory that is locked in physical RAM (cannot be swapped to disk by the OS). Benefits:
- CPU-to-GPU transfers via DMA are ~2x faster (12 GB/s vs 6 GB/s)
- Enables asynchronous transfers with
non_blocking=True - Required for overlapping data transfer with computation
# DataLoader pins output automatically with pin_memory=True
loader = DataLoader(dataset, pin_memory=True)
# Manual pinning
x_pinned = x.pin_memory()
x_gpu = x_pinned.to('cuda', non_blocking=True)
Do not pin too much memory: pinned memory reduces available RAM for other processes and cannot be swapped, potentially causing system-wide memory pressure.
Definition: Memory-Mapped File Access
Memory-Mapped File Access
For large datasets that do not fit in RAM, memory-mapped files provide random access without loading the entire file:
import numpy as np
# Save dataset to disk
np.save('data.npy', large_array)
# Memory-map: only loads pages when accessed
data = np.load('data.npy', mmap_mode='r')
sample = data[42] # reads only this row from disk
This is essential for datasets in the 10-100 GB range where loading everything into RAM is impractical.
Theorem: Data Pipeline Throughput with Prefetching
Let be the time to load one batch, the GPU computation time, and the number of DataLoader workers. The effective throughput per batch is:
The pipeline is compute-bound when (GPU is the bottleneck) and I/O-bound otherwise.
With workers loading in parallel, the effective loading time is divided by . Prefetching ensures the next batch is ready before the GPU finishes the current batch. The system runs at the speed of the slower stage.
Without prefetching
Each iteration takes (sequential). GPU utilization: .
With $W$ workers and prefetching
Loading is pipelined: while the GPU processes batch , workers prepare batch . Effective loading time per batch is . Total: .
Example: Tuning DataLoader for Maximum Throughput
Find the optimal num_workers and measure the impact of pin_memory
and persistent_workers on data loading throughput.
Define dataset and benchmark
import torch
from torch.utils.data import Dataset, DataLoader
import time
class SyntheticDataset(Dataset):
def __init__(self, n, dim):
self.x = torch.randn(n, dim)
self.y = torch.randint(0, 10, (n,))
def __len__(self):
return len(self.x)
def __getitem__(self, i):
return self.x[i], self.y[i]
dataset = SyntheticDataset(10000, 1024)
Sweep num_workers
for nw in [0, 1, 2, 4, 8]:
loader = DataLoader(
dataset, batch_size=64, num_workers=nw,
pin_memory=True, persistent_workers=(nw > 0),
)
t0 = time.perf_counter()
for x, y in loader:
x = x.to('cuda', non_blocking=True)
elapsed = time.perf_counter() - t0
print(f"workers={nw}: {elapsed:.3f}s")
Measure pin_memory effect
for pm in [False, True]:
loader = DataLoader(
dataset, batch_size=64, num_workers=4,
pin_memory=pm, persistent_workers=True,
)
t0 = time.perf_counter()
for x, y in loader:
x = x.to('cuda', non_blocking=True)
elapsed = time.perf_counter() - t0
print(f"pin_memory={pm}: {elapsed:.3f}s")
DataLoader Pipeline Throughput
Simulate data loading pipeline throughput as a function of num_workers, batch size, and the ratio of I/O time to compute time. Observe when the system transitions from I/O-bound to compute-bound.
Parameters
Data Loading and Streaming
# Code from: ch13/python/data_loading.py
# Load from backend supplements endpointQuick Check
Your GPU trains each batch in 5 ms, but loading a batch takes 40 ms. How many DataLoader workers do you need to keep the GPU fully utilized?
2 workers
4 workers
8 workers
16 workers
With 8 workers, effective load time is 40/8 = 5 ms = compute time. The pipeline is balanced; adding more workers provides diminishing returns.
Common Mistake: Too Many DataLoader Workers
Mistake:
Setting num_workers=64 on a machine with 16 CPU cores. This causes
excessive context switching, memory overhead (each worker holds a copy
of the dataset), and can actually slow down loading.
Correction:
Start with num_workers = os.cpu_count() // num_gpus and tune from
there. Monitor CPU utilization with htop. Use persistent_workers=True
to avoid worker process creation overhead between epochs.
Key Takeaway
Use pin_memory=True and non_blocking=True together for
asynchronous CPU-to-GPU data transfer. Set num_workers to
to balance
the pipeline. Use persistent_workers=True to avoid process
creation overhead.
Prefetching
Loading the next batch of data while the GPU processes the current batch, hiding I/O latency behind computation.
Pinned Memory
CPU memory pages locked in physical RAM, enabling faster DMA transfers to GPU and asynchronous data movement.
Related: Prefetching