References & Further Reading

References

P. Micikevicius, S. Narang, J. Alben, et al., Mixed Precision Training, ICLR, 2018
The foundational paper on training neural networks with FP16 and FP32 accumulation. Introduces loss scaling and demonstrates no accuracy loss across a wide range of architectures.
T. Chen, B. Xu, C. Zhang, and C. Guestrin, Training Deep Nets with Sublinear Memory Cost, arXiv:1604.06174, 2016
Introduces gradient checkpointing (trading compute for memory). Shows how to reduce memory from O(L) to O(sqrt(L)) for sequential networks with L layers.
S. Li, Y. Zhao, R. Varma, et al., PyTorch Distributed: Experiences on Accelerating Data Parallel Training, VLDB, 2020
The official PyTorch DDP paper. Describes the bucketed all-reduce strategy, gradient computation/communication overlap, and scaling benchmarks up to 256 GPUs.
P. Goyal, P. Dollar, R. Girshick, et al., Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, arXiv:1706.02677, 2017
Establishes the linear scaling rule for distributed training and the gradual warmup strategy. Trained ResNet-50 on ImageNet in 1 hour using 256 GPUs.
PyTorch Contributors, Automatic Mixed Precision (torch.amp), 2024
Official PyTorch documentation for AMP, autocast, and GradScaler. Includes the eligibility list of operations that autocast handles.
NVIDIA, NVIDIA A100 Tensor Core GPU Architecture, NVIDIA Whitepaper, 2020
Technical description of the A100 GPU architecture, including Tensor Core specifications, memory hierarchy, and TF32 mode.
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, SC20, 2020
Introduces ZeRO (Zero Redundancy Optimizer) for sharding optimizer states, gradients, and parameters across GPUs. The basis for PyTorch FSDP.