Chapter Summary

Key Points

1.
nn.Module is the building block. Every PyTorch model is an nn.Module tree. Compose with nn.Sequential for chains, nn.ModuleList for indexed collections, nn.ModuleDict for named lookups. Always call super().__init__() and use model(x) not model.forward(x).
2.
The training loop has five steps. (1) optimizer.zero_grad(), (2) forward pass, (3) loss computation, (4) loss.backward(), (5) optimizer.step(). Call model.eval() and torch.no_grad() for validation. This explicit loop gives full control over every aspect of training.
3.
Choose losses that match your task. MSE for regression (Gaussian noise assumption), CrossEntropyLoss for classification (categorical distribution), BCEWithLogitsLoss for multi-label. Never apply softmax before CrossEntropyLoss. Custom losses must be differentiable.
4.
Infrastructure scales training. DataLoader with pin_memory=True and appropriate num_workers for efficient GPU feeding. Cosine annealing LR schedules. Checkpoint model and optimizer state after each epoch. AdamW is the default optimizer.
5.
Debug systematically. Always verify with the overfit-one-batch test. Monitor gradient norms to detect vanishing/exploding gradients. Use gradient clipping for stability. Check for NaN in gradients and activations.

Looking Ahead

Chapter 27 introduces convolutional neural networks (CNNs), which replace fully connected layers with parameter-sharing convolutions for spatial data. The nn.Module patterns, training loop, and debugging strategies from this chapter apply directly to all architectures in Part VI.

Common Debugging Patterns Exercises