Chapter Summary

Chapter Summary

Key Points

  • 1.

    Manage GPU memory proactively. Activations dominate training memory (10-100x more than parameters). Use torch.cuda.memory_summary() to profile, gradient checkpointing to reduce activation memory from O(L)O(L) to O(L)O(\sqrt{L}), and loss.item() to prevent graph retention. For Adam optimizer, budget 16P16P bytes for parameters+gradients+optimizer states plus BLAB \cdot L \cdot A bytes for activations.

  • 2.

    Batch everything on GPU. Replace Python loops over independent operations with batched tensor operations. torch.bmm, torch.einsum, and batched torch.linalg functions (SVD, Cholesky, solve) eliminate kernel launch overhead and achieve 10-100x speedup for small matrix operations. This is essential for MIMO processing across subcarriers.

  • 3.

    Use mixed precision by default. torch.autocast('cuda', dtype=torch.bfloat16) provides 2x memory savings and 2-3x speedup with minimal accuracy loss. BF16 is preferred over FP16 on Ampere+ GPUs because it matches FP32's exponent range, eliminating overflow and the need for GradScaler. Pad dimensions to multiples of 8 for Tensor Core utilization.

  • 4.

    Optimize the data pipeline. Use DataLoader with num_workers, pin_memory=True, and persistent_workers=True. The pipeline throughput is max(Tload/W,Tcompute)\max(T_{\text{load}}/W, T_{\text{compute}}): set WTload/TcomputeW \ge T_{\text{load}}/T_{\text{compute}} to keep the GPU fully utilized. Use memory-mapped files for datasets exceeding RAM.

  • 5.

    Scale with DistributedDataParallel. Always use DDP over DataParallel for multi-GPU training. DDP uses one process per GPU, overlaps gradient all-reduce with backward computation, and scales to multiple machines. Remember DistributedSampler with set_epoch() for correct data shuffling, and apply the linear scaling rule for learning rate.

Looking Ahead

Chapter 14 applies these performance patterns to real-world wireless simulations: channel estimation, MIMO detection, and end-to-end learning systems. The batched operations from Section 13.2 enable efficient per-subcarrier processing, mixed precision from Section 13.3 doubles throughput for inference, and DDP from Section 13.5 scales training across GPU clusters.