Chapter Summary

Key Points

1.
Manage GPU memory proactively. Activations dominate training memory (10-100x more than parameters). Use torch.cuda.memory_summary() to profile, gradient checkpointing to reduce activation memory from $O(L)$ to $O(\sqrt{L})$ , and loss.item() to prevent graph retention. For Adam optimizer, budget $16P$ bytes for parameters+gradients+optimizer states plus $B \cdot L \cdot A$ bytes for activations.
2.
Batch everything on GPU. Replace Python loops over independent operations with batched tensor operations. torch.bmm, torch.einsum, and batched torch.linalg functions (SVD, Cholesky, solve) eliminate kernel launch overhead and achieve 10-100x speedup for small matrix operations. This is essential for MIMO processing across subcarriers.
3.
Use mixed precision by default. torch.autocast('cuda', dtype=torch.bfloat16) provides 2x memory savings and 2-3x speedup with minimal accuracy loss. BF16 is preferred over FP16 on Ampere+ GPUs because it matches FP32's exponent range, eliminating overflow and the need for GradScaler. Pad dimensions to multiples of 8 for Tensor Core utilization.
4.
Optimize the data pipeline. Use DataLoader with num_workers, pin_memory=True, and persistent_workers=True. The pipeline throughput is $\max(T_{\text{load}}/W, T_{\text{compute}})$ : set $W \ge T_{\text{load}}/T_{\text{compute}}$ to keep the GPU fully utilized. Use memory-mapped files for datasets exceeding RAM.
5.
Scale with DistributedDataParallel. Always use DDP over DataParallel for multi-GPU training. DDP uses one process per GPU, overlaps gradient all-reduce with backward computation, and scales to multiple machines. Remember DistributedSampler with set_epoch() for correct data shuffling, and apply the linear scaling rule for learning rate.

Looking Ahead

Chapter 14 applies these performance patterns to real-world wireless simulations: channel estimation, MIMO detection, and end-to-end learning systems. The batched operations from Section 13.2 enable efficient per-subcarrier processing, mixed precision from Section 13.3 doubles throughput for inference, and DDP from Section 13.5 scales training across GPU clusters.

Multi-GPU and Distributed Computing Exercises