Training Infrastructure
Scaling from Toy to Production
A training loop that works on a laptop with 1000 samples must also work on a GPU cluster with millions. This section covers the infrastructure: DataLoader, learning rate schedulers, checkpointing, and mixed-precision training.
Definition: Dataset and DataLoader
Dataset and DataLoader
torch.utils.data.Dataset defines __len__ and __getitem__.
DataLoader wraps a dataset with batching, shuffling, and parallel loading:
from torch.utils.data import Dataset, DataLoader
class MyDataset(Dataset):
def __init__(self, X, y):
self.X, self.y = X, y
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
loader = DataLoader(dataset, batch_size=64, shuffle=True,
num_workers=4, pin_memory=True)
Set pin_memory=True when using GPU — it enables faster CPU-to-GPU
transfer via page-locked memory.
Definition: Learning Rate Schedulers
Learning Rate Schedulers
Schedulers adjust during training:
Step decay:
Cosine annealing:
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=100, eta_min=1e-6)
for epoch in range(100):
train_one_epoch()
scheduler.step()
Example: Saving and Loading Checkpoints
Save training state (model weights, optimizer state, epoch) so training can be resumed after interruption.
Save checkpoint
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, 'checkpoint.pt')
Load checkpoint
ckpt = torch.load('checkpoint.pt', weights_only=False)
model.load_state_dict(ckpt['model_state_dict'])
optimizer.load_state_dict(ckpt['optimizer_state_dict'])
start_epoch = ckpt['epoch'] + 1
Learning Rate Schedule Explorer
Visualise different LR schedules over training.
Parameters
Quick Check
What does pin_memory=True in DataLoader do?
Keeps data on GPU permanently
Allocates page-locked CPU memory for faster GPU transfer
Prevents garbage collection of loaded batches
Pinned memory enables asynchronous CPU-to-GPU copies, overlapping data transfer with computation.
Common Mistake: Too Many DataLoader Workers
Mistake:
Setting num_workers very high (e.g., 32) on a machine with limited
RAM, causing OOM errors or system slowdown.
Correction:
Start with num_workers=4 and increase gradually. Each worker
duplicates the dataset in memory. On macOS, use num_workers=0
unless using multiprocessing.set_start_method('fork').
Key Takeaway
Always save checkpoints after each epoch. Use cosine annealing or
warmup + cosine as default LR schedule. Set pin_memory=True for
GPU training and profile num_workers to find the sweet spot.
DataLoader
PyTorch utility that wraps a Dataset with batching, shuffling, and parallel data loading.
Checkpoint
A saved snapshot of model weights, optimizer state, and training progress for resumption.