Chapter Summary
Chapter Summary
Key Points
- 1.
Manage GPU memory proactively. Activations dominate training memory (10-100x more than parameters). Use
torch.cuda.memory_summary()to profile, gradient checkpointing to reduce activation memory from to , andloss.item()to prevent graph retention. For Adam optimizer, budget bytes for parameters+gradients+optimizer states plus bytes for activations. - 2.
Batch everything on GPU. Replace Python loops over independent operations with batched tensor operations.
torch.bmm,torch.einsum, and batchedtorch.linalgfunctions (SVD, Cholesky, solve) eliminate kernel launch overhead and achieve 10-100x speedup for small matrix operations. This is essential for MIMO processing across subcarriers. - 3.
Use mixed precision by default.
torch.autocast('cuda', dtype=torch.bfloat16)provides 2x memory savings and 2-3x speedup with minimal accuracy loss. BF16 is preferred over FP16 on Ampere+ GPUs because it matches FP32's exponent range, eliminating overflow and the need forGradScaler. Pad dimensions to multiples of 8 for Tensor Core utilization. - 4.
Optimize the data pipeline. Use
DataLoaderwithnum_workers,pin_memory=True, andpersistent_workers=True. The pipeline throughput is : set to keep the GPU fully utilized. Use memory-mapped files for datasets exceeding RAM. - 5.
Scale with DistributedDataParallel. Always use DDP over DataParallel for multi-GPU training. DDP uses one process per GPU, overlaps gradient all-reduce with backward computation, and scales to multiple machines. Remember
DistributedSamplerwithset_epoch()for correct data shuffling, and apply the linear scaling rule for learning rate.
Looking Ahead
Chapter 14 applies these performance patterns to real-world wireless simulations: channel estimation, MIMO detection, and end-to-end learning systems. The batched operations from Section 13.2 enable efficient per-subcarrier processing, mixed precision from Section 13.3 doubles throughput for inference, and DDP from Section 13.5 scales training across GPU clusters.