Training Infrastructure

Scaling from Toy to Production

A training loop that works on a laptop with 1000 samples must also work on a GPU cluster with millions. This section covers the infrastructure: DataLoader, learning rate schedulers, checkpointing, and mixed-precision training.

Definition:
Dataset and DataLoader

torch.utils.data.Dataset defines __len__ and __getitem__. DataLoader wraps a dataset with batching, shuffling, and parallel loading:

from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, X, y):
        self.X, self.y = X, y
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

loader = DataLoader(dataset, batch_size=64, shuffle=True,
                    num_workers=4, pin_memory=True)

Set pin_memory=True when using GPU — it enables faster CPU-to-GPU transfer via page-locked memory.

Definition:
Learning Rate Schedulers

Schedulers adjust $\eta$ during training:

Step decay: $\eta_t = \eta_0 \cdot \gamma^{\lfloor t/T_{\text{step}} \rfloor}$

Cosine annealing: $\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})(1 + \cos(\pi t / T))$

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100, eta_min=1e-6)

for epoch in range(100):
    train_one_epoch()
    scheduler.step()

Example: Saving and Loading Checkpoints

Save training state (model weights, optimizer state, epoch) so training can be resumed after interruption.

Solution

Save checkpoint

torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}, 'checkpoint.pt')

Load checkpoint

ckpt = torch.load('checkpoint.pt', weights_only=False)
model.load_state_dict(ckpt['model_state_dict'])
optimizer.load_state_dict(ckpt['optimizer_state_dict'])
start_epoch = ckpt['epoch'] + 1

Learning Rate Schedule Explorer

Visualise different LR schedules over training.

Parameters

Quick Check

What does pin_memory=True in DataLoader do?

Keeps data on GPU permanently

Allocates page-locked CPU memory for faster GPU transfer

Prevents garbage collection of loaded batches

Correction:

Allocates page-locked CPU memory for faster GPU transfer

Pinned memory enables asynchronous CPU-to-GPU copies, overlapping data transfer with computation.

Common Mistake: Too Many DataLoader Workers

Mistake:

Setting num_workers very high (e.g., 32) on a machine with limited RAM, causing OOM errors or system slowdown.

Correction:

Start with num_workers=4 and increase gradually. Each worker duplicates the dataset in memory. On macOS, use num_workers=0 unless using multiprocessing.set_start_method('fork').

Key Takeaway

Always save checkpoints after each epoch. Use cosine annealing or warmup + cosine as default LR schedule. Set pin_memory=True for GPU training and profile num_workers to find the sweet spot.

DataLoader

PyTorch utility that wraps a Dataset with batching, shuffling, and parallel data loading.

Checkpoint

A saved snapshot of model weights, optimizer state, and training progress for resumption.

Loss Functions Common Debugging Patterns