Training Infrastructure

Scaling from Toy to Production

A training loop that works on a laptop with 1000 samples must also work on a GPU cluster with millions. This section covers the infrastructure: DataLoader, learning rate schedulers, checkpointing, and mixed-precision training.

Definition:

Dataset and DataLoader

torch.utils.data.Dataset defines __len__ and __getitem__. DataLoader wraps a dataset with batching, shuffling, and parallel loading:

from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, X, y):
        self.X, self.y = X, y
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

loader = DataLoader(dataset, batch_size=64, shuffle=True,
                    num_workers=4, pin_memory=True)

Set pin_memory=True when using GPU — it enables faster CPU-to-GPU transfer via page-locked memory.

Definition:

Learning Rate Schedulers

Schedulers adjust η\eta during training:

Step decay: ηt=η0γt/Tstep\eta_t = \eta_0 \cdot \gamma^{\lfloor t/T_{\text{step}} \rfloor}

Cosine annealing: ηt=ηmin+12(η0ηmin)(1+cos(πt/T))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})(1 + \cos(\pi t / T))

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100, eta_min=1e-6)

for epoch in range(100):
    train_one_epoch()
    scheduler.step()

Example: Saving and Loading Checkpoints

Save training state (model weights, optimizer state, epoch) so training can be resumed after interruption.

Learning Rate Schedule Explorer

Visualise different LR schedules over training.

Parameters

Quick Check

What does pin_memory=True in DataLoader do?

Keeps data on GPU permanently

Allocates page-locked CPU memory for faster GPU transfer

Prevents garbage collection of loaded batches

Common Mistake: Too Many DataLoader Workers

Mistake:

Setting num_workers very high (e.g., 32) on a machine with limited RAM, causing OOM errors or system slowdown.

Correction:

Start with num_workers=4 and increase gradually. Each worker duplicates the dataset in memory. On macOS, use num_workers=0 unless using multiprocessing.set_start_method('fork').

Key Takeaway

Always save checkpoints after each epoch. Use cosine annealing or warmup + cosine as default LR schedule. Set pin_memory=True for GPU training and profile num_workers to find the sweet spot.

DataLoader

PyTorch utility that wraps a Dataset with batching, shuffling, and parallel data loading.

Checkpoint

A saved snapshot of model weights, optimizer state, and training progress for resumption.