Data Loading and Streaming

The Data Loading Bottleneck

A fast GPU can process a batch in 10 ms, but loading that batch from disk can take 50-200 ms. Without proper data loading, the GPU sits idle 80-95% of the time. PyTorch's DataLoader solves this by overlapping data loading with GPU computation using multiple worker processes, prefetching, and pinned memory transfers.

This section covers the Dataset/DataLoader API, performance tuning, and streaming patterns for datasets too large to fit in memory.

Definition:

PyTorch Dataset Interface

A torch.utils.data.Dataset implements two methods:

class MyDataset(torch.utils.data.Dataset):
    def __len__(self):
        return N           # total number of samples

    def __getitem__(self, idx):
        # Load and return sample at index idx
        return x, y        # input tensor, label

Map-style datasets support random access via __getitem__. Iterable-style datasets (IterableDataset) support only sequential access, useful for streaming data from network or generating samples on-the-fly.

Keep __getitem__ lightweight: load from memory-mapped files (e.g., HDF5, NumPy .npy) rather than parsing text or CSV on each call.

Definition:

DataLoader: Parallel Data Loading Pipeline

DataLoader wraps a Dataset with batching, shuffling, and parallel loading:

loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,       # parallel loading processes
    pin_memory=True,     # page-locked CPU memory for faster transfer
    prefetch_factor=2,   # batches prefetched per worker
    persistent_workers=True,  # keep workers alive between epochs
)

for batch_x, batch_y in loader:
    batch_x = batch_x.to('cuda', non_blocking=True)
    # GPU computation overlaps with next batch loading

The optimal num_workers depends on your system. A good starting point is the number of CPU cores divided by the number of GPUs.

Definition:

Pinned (Page-Locked) Memory

Pinned memory is CPU memory that is locked in physical RAM (cannot be swapped to disk by the OS). Benefits:

  • CPU-to-GPU transfers via DMA are ~2x faster (12 GB/s vs 6 GB/s)
  • Enables asynchronous transfers with non_blocking=True
  • Required for overlapping data transfer with computation
# DataLoader pins output automatically with pin_memory=True
loader = DataLoader(dataset, pin_memory=True)

# Manual pinning
x_pinned = x.pin_memory()
x_gpu = x_pinned.to('cuda', non_blocking=True)

Do not pin too much memory: pinned memory reduces available RAM for other processes and cannot be swapped, potentially causing system-wide memory pressure.

Definition:

Memory-Mapped File Access

For large datasets that do not fit in RAM, memory-mapped files provide random access without loading the entire file:

import numpy as np

# Save dataset to disk
np.save('data.npy', large_array)

# Memory-map: only loads pages when accessed
data = np.load('data.npy', mmap_mode='r')
sample = data[42]   # reads only this row from disk

This is essential for datasets in the 10-100 GB range where loading everything into RAM is impractical.

Theorem: Data Pipeline Throughput with Prefetching

Let TloadT_{\text{load}} be the time to load one batch, TcomputeT_{\text{compute}} the GPU computation time, and WW the number of DataLoader workers. The effective throughput per batch is:

Tbatch=max⁑ ⁣(TloadW,β€…β€ŠTcompute)T_{\text{batch}} = \max\!\left(\frac{T_{\text{load}}}{W},\; T_{\text{compute}}\right)

The pipeline is compute-bound when Tload/W<TcomputeT_{\text{load}}/W < T_{\text{compute}} (GPU is the bottleneck) and I/O-bound otherwise.

With WW workers loading in parallel, the effective loading time is divided by WW. Prefetching ensures the next batch is ready before the GPU finishes the current batch. The system runs at the speed of the slower stage.

Example: Tuning DataLoader for Maximum Throughput

Find the optimal num_workers and measure the impact of pin_memory and persistent_workers on data loading throughput.

DataLoader Pipeline Throughput

Simulate data loading pipeline throughput as a function of num_workers, batch size, and the ratio of I/O time to compute time. Observe when the system transitions from I/O-bound to compute-bound.

Parameters

Data Loading and Streaming

python
Dataset implementation, DataLoader tuning, pin_memory, and memory-mapped data.
# Code from: ch13/python/data_loading.py
# Load from backend supplements endpoint

Quick Check

Your GPU trains each batch in 5 ms, but loading a batch takes 40 ms. How many DataLoader workers do you need to keep the GPU fully utilized?

2 workers

4 workers

8 workers

16 workers

Common Mistake: Too Many DataLoader Workers

Mistake:

Setting num_workers=64 on a machine with 16 CPU cores. This causes excessive context switching, memory overhead (each worker holds a copy of the dataset), and can actually slow down loading.

Correction:

Start with num_workers = os.cpu_count() // num_gpus and tune from there. Monitor CPU utilization with htop. Use persistent_workers=True to avoid worker process creation overhead between epochs.

Key Takeaway

Use pin_memory=True and non_blocking=True together for asynchronous CPU-to-GPU data transfer. Set num_workers to ⌈Tload/TcomputeβŒ‰\lceil T_{\text{load}} / T_{\text{compute}} \rceil to balance the pipeline. Use persistent_workers=True to avoid process creation overhead.

Prefetching

Loading the next batch of data while the GPU processes the current batch, hiding I/O latency behind computation.

Pinned Memory

CPU memory pages locked in physical RAM, enabling faster DMA transfers to GPU and asynchronous data movement.

Related: Prefetching