Data Loading and Streaming

The Data Loading Bottleneck

A fast GPU can process a batch in 10 ms, but loading that batch from disk can take 50-200 ms. Without proper data loading, the GPU sits idle 80-95% of the time. PyTorch's DataLoader solves this by overlapping data loading with GPU computation using multiple worker processes, prefetching, and pinned memory transfers.

This section covers the Dataset/DataLoader API, performance tuning, and streaming patterns for datasets too large to fit in memory.

Definition:
PyTorch Dataset Interface

A torch.utils.data.Dataset implements two methods:

class MyDataset(torch.utils.data.Dataset):
    def __len__(self):
        return N           # total number of samples

    def __getitem__(self, idx):
        # Load and return sample at index idx
        return x, y        # input tensor, label

Map-style datasets support random access via __getitem__. Iterable-style datasets (IterableDataset) support only sequential access, useful for streaming data from network or generating samples on-the-fly.

Keep __getitem__ lightweight: load from memory-mapped files (e.g., HDF5, NumPy .npy) rather than parsing text or CSV on each call.

Definition:
DataLoader: Parallel Data Loading Pipeline

DataLoader wraps a Dataset with batching, shuffling, and parallel loading:

loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,       # parallel loading processes
    pin_memory=True,     # page-locked CPU memory for faster transfer
    prefetch_factor=2,   # batches prefetched per worker
    persistent_workers=True,  # keep workers alive between epochs
)

for batch_x, batch_y in loader:
    batch_x = batch_x.to('cuda', non_blocking=True)
    # GPU computation overlaps with next batch loading

The optimal num_workers depends on your system. A good starting point is the number of CPU cores divided by the number of GPUs.

Definition:
Pinned (Page-Locked) Memory

Pinned memory is CPU memory that is locked in physical RAM (cannot be swapped to disk by the OS). Benefits:

CPU-to-GPU transfers via DMA are ~2x faster (12 GB/s vs 6 GB/s)
Enables asynchronous transfers with non_blocking=True
Required for overlapping data transfer with computation

# DataLoader pins output automatically with pin_memory=True
loader = DataLoader(dataset, pin_memory=True)

# Manual pinning
x_pinned = x.pin_memory()
x_gpu = x_pinned.to('cuda', non_blocking=True)

Do not pin too much memory: pinned memory reduces available RAM for other processes and cannot be swapped, potentially causing system-wide memory pressure.

Definition:
Memory-Mapped File Access

For large datasets that do not fit in RAM, memory-mapped files provide random access without loading the entire file:

import numpy as np

# Save dataset to disk
np.save('data.npy', large_array)

# Memory-map: only loads pages when accessed
data = np.load('data.npy', mmap_mode='r')
sample = data[42]   # reads only this row from disk

This is essential for datasets in the 10-100 GB range where loading everything into RAM is impractical.

Theorem: Data Pipeline Throughput with Prefetching

Let $T_{\text{load}}$ be the time to load one batch, $T_{\text{compute}}$ the GPU computation time, and $W$ the number of DataLoader workers. The effective throughput per batch is:

$T_{\text{batch}} = \max\!\left(\frac{T_{\text{load}}}{W},\; T_{\text{compute}}\right)$

The pipeline is compute-bound when $T_{\text{load}}/W < T_{\text{compute}}$ (GPU is the bottleneck) and I/O-bound otherwise.

With $W$ workers loading in parallel, the effective loading time is divided by $W$ . Prefetching ensures the next batch is ready before the GPU finishes the current batch. The system runs at the speed of the slower stage.

Proof

Without prefetching

Each iteration takes $T_{\text{load}} + T_{\text{compute}}$ (sequential). GPU utilization: $T_{\text{compute}} / (T_{\text{load}} + T_{\text{compute}})$ .

With $W$ workers and prefetching

Loading is pipelined: while the GPU processes batch $i$ , workers prepare batch $i+1$ . Effective loading time per batch is $T_{\text{load}} / W$ . Total: $\max(T_{\text{load}}/W, T_{\text{compute}})$ .

Example: Tuning DataLoader for Maximum Throughput

Find the optimal num_workers and measure the impact of pin_memory and persistent_workers on data loading throughput.

Solution

Define dataset and benchmark

import torch
from torch.utils.data import Dataset, DataLoader
import time

class SyntheticDataset(Dataset):
    def __init__(self, n, dim):
        self.x = torch.randn(n, dim)
        self.y = torch.randint(0, 10, (n,))
    def __len__(self):
        return len(self.x)
    def __getitem__(self, i):
        return self.x[i], self.y[i]

dataset = SyntheticDataset(10000, 1024)

Sweep num_workers

for nw in [0, 1, 2, 4, 8]:
    loader = DataLoader(
        dataset, batch_size=64, num_workers=nw,
        pin_memory=True, persistent_workers=(nw > 0),
    )
    t0 = time.perf_counter()
    for x, y in loader:
        x = x.to('cuda', non_blocking=True)
    elapsed = time.perf_counter() - t0
    print(f"workers={nw}: {elapsed:.3f}s")

Measure pin_memory effect

for pm in [False, True]:
    loader = DataLoader(
        dataset, batch_size=64, num_workers=4,
        pin_memory=pm, persistent_workers=True,
    )
    t0 = time.perf_counter()
    for x, y in loader:
        x = x.to('cuda', non_blocking=True)
    elapsed = time.perf_counter() - t0
    print(f"pin_memory={pm}: {elapsed:.3f}s")

DataLoader Pipeline Throughput

Simulate data loading pipeline throughput as a function of num_workers, batch size, and the ratio of I/O time to compute time. Observe when the system transitions from I/O-bound to compute-bound.

Parameters

Data Loading and Streaming

python

Dataset implementation, DataLoader tuning, pin_memory, and memory-mapped data.

# Code from: ch13/python/data_loading.py
# Load from backend supplements endpoint

Quick Check

Your GPU trains each batch in 5 ms, but loading a batch takes 40 ms. How many DataLoader workers do you need to keep the GPU fully utilized?

2 workers

4 workers

8 workers

16 workers

Correction:

8 workers

With 8 workers, effective load time is 40/8 = 5 ms = compute time. The pipeline is balanced; adding more workers provides diminishing returns.

Common Mistake: Too Many DataLoader Workers

Mistake:

Setting num_workers=64 on a machine with 16 CPU cores. This causes excessive context switching, memory overhead (each worker holds a copy of the dataset), and can actually slow down loading.

Correction:

Start with num_workers = os.cpu_count() // num_gpus and tune from there. Monitor CPU utilization with htop. Use persistent_workers=True to avoid worker process creation overhead between epochs.

Key Takeaway

Use pin_memory=True and non_blocking=True together for asynchronous CPU-to-GPU data transfer. Set num_workers to $\lceil T_{\text{load}} / T_{\text{compute}} \rceil$ to balance the pipeline. Use persistent_workers=True to avoid process creation overhead.

Prefetching

Loading the next batch of data while the GPU processes the current batch, hiding I/O latency behind computation.

Pinned Memory

CPU memory pages locked in physical RAM, enabling faster DMA transfers to GPU and asynchronous data movement.

Related: Prefetching

Mixed Precision Multi-GPU and Distributed Computing

Data Loading and Streaming

The Data Loading Bottleneck

Definition: PyTorch Dataset Interface

Definition: DataLoader: Parallel Data Loading Pipeline

Definition: Pinned (Page-Locked) Memory

Definition: Memory-Mapped File Access

Theorem: Data Pipeline Throughput with Prefetching

Without prefetching

With $W$ workers and prefetching

Example: Tuning DataLoader for Maximum Throughput

Define dataset and benchmark

Sweep num_workers

Measure pin_memory effect

DataLoader Pipeline Throughput

Parameters

Data Loading and Streaming

Quick Check

Common Mistake: Too Many DataLoader Workers

Key Takeaway

Prefetching

Pinned Memory

Definition:
PyTorch Dataset Interface

Definition:
DataLoader: Parallel Data Loading Pipeline

Definition:
Pinned (Page-Locked) Memory

Definition:
Memory-Mapped File Access