References & Further Reading
References
- P. Micikevicius, S. Narang, J. Alben, et al., Mixed Precision Training, ICLR, 2018
The foundational paper on training neural networks with FP16 and FP32 accumulation. Introduces loss scaling and demonstrates no accuracy loss across a wide range of architectures.
- T. Chen, B. Xu, C. Zhang, and C. Guestrin, Training Deep Nets with Sublinear Memory Cost, arXiv:1604.06174, 2016
Introduces gradient checkpointing (trading compute for memory). Shows how to reduce memory from O(L) to O(sqrt(L)) for sequential networks with L layers.
- S. Li, Y. Zhao, R. Varma, et al., PyTorch Distributed: Experiences on Accelerating Data Parallel Training, VLDB, 2020
The official PyTorch DDP paper. Describes the bucketed all-reduce strategy, gradient computation/communication overlap, and scaling benchmarks up to 256 GPUs.
- P. Goyal, P. Dollar, R. Girshick, et al., Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, arXiv:1706.02677, 2017
Establishes the linear scaling rule for distributed training and the gradual warmup strategy. Trained ResNet-50 on ImageNet in 1 hour using 256 GPUs.
- PyTorch Contributors, Automatic Mixed Precision (torch.amp), 2024
Official PyTorch documentation for AMP, autocast, and GradScaler. Includes the eligibility list of operations that autocast handles.
- NVIDIA, NVIDIA A100 Tensor Core GPU Architecture, NVIDIA Whitepaper, 2020
Technical description of the A100 GPU architecture, including Tensor Core specifications, memory hierarchy, and TF32 mode.
- S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, SC20, 2020
Introduces ZeRO (Zero Redundancy Optimizer) for sharding optimizer states, gradients, and parameters across GPUs. The basis for PyTorch FSDP.
Further Reading
PyTorch memory management internals
PyTorch blog: Understanding CUDA Memory Usage (https://pytorch.org/blog/understanding-gpu-memory-1/)
Deep dive into the caching allocator, memory snapshots, and the memory visualization tools introduced in PyTorch 2.1.
Advanced mixed precision techniques
NVIDIA Deep Learning Performance Guide (https://docs.nvidia.com/deeplearning/performance/)
Practical guidelines for maximizing Tensor Core utilization, including dimension padding, memory alignment, and kernel fusion.
Scaling distributed training
DeepSpeed documentation (https://www.deepspeed.ai/)
Microsoft's library for efficient distributed training with ZeRO optimizer, pipeline parallelism, and 3D parallelism strategies for trillion-parameter models.
GPU profiling with NSight
NVIDIA Nsight Systems User Guide (https://docs.nvidia.com/nsight-systems/)
The definitive tool for GPU profiling, showing kernel timelines, memory transfers, and CPU-GPU synchronization points.