References & Further Reading

References

  1. P. Micikevicius, S. Narang, J. Alben, et al., Mixed Precision Training, ICLR, 2018

    The foundational paper on training neural networks with FP16 and FP32 accumulation. Introduces loss scaling and demonstrates no accuracy loss across a wide range of architectures.

  2. T. Chen, B. Xu, C. Zhang, and C. Guestrin, Training Deep Nets with Sublinear Memory Cost, arXiv:1604.06174, 2016

    Introduces gradient checkpointing (trading compute for memory). Shows how to reduce memory from O(L) to O(sqrt(L)) for sequential networks with L layers.

  3. S. Li, Y. Zhao, R. Varma, et al., PyTorch Distributed: Experiences on Accelerating Data Parallel Training, VLDB, 2020

    The official PyTorch DDP paper. Describes the bucketed all-reduce strategy, gradient computation/communication overlap, and scaling benchmarks up to 256 GPUs.

  4. P. Goyal, P. Dollar, R. Girshick, et al., Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, arXiv:1706.02677, 2017

    Establishes the linear scaling rule for distributed training and the gradual warmup strategy. Trained ResNet-50 on ImageNet in 1 hour using 256 GPUs.

  5. PyTorch Contributors, Automatic Mixed Precision (torch.amp), 2024

    Official PyTorch documentation for AMP, autocast, and GradScaler. Includes the eligibility list of operations that autocast handles.

  6. NVIDIA, NVIDIA A100 Tensor Core GPU Architecture, NVIDIA Whitepaper, 2020

    Technical description of the A100 GPU architecture, including Tensor Core specifications, memory hierarchy, and TF32 mode.

  7. S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, SC20, 2020

    Introduces ZeRO (Zero Redundancy Optimizer) for sharding optimizer states, gradients, and parameters across GPUs. The basis for PyTorch FSDP.

Further Reading

  • PyTorch memory management internals

    PyTorch blog: Understanding CUDA Memory Usage (https://pytorch.org/blog/understanding-gpu-memory-1/)

    Deep dive into the caching allocator, memory snapshots, and the memory visualization tools introduced in PyTorch 2.1.

  • Advanced mixed precision techniques

    NVIDIA Deep Learning Performance Guide (https://docs.nvidia.com/deeplearning/performance/)

    Practical guidelines for maximizing Tensor Core utilization, including dimension padding, memory alignment, and kernel fusion.

  • Scaling distributed training

    DeepSpeed documentation (https://www.deepspeed.ai/)

    Microsoft's library for efficient distributed training with ZeRO optimizer, pipeline parallelism, and 3D parallelism strategies for trillion-parameter models.

  • GPU profiling with NSight

    NVIDIA Nsight Systems User Guide (https://docs.nvidia.com/nsight-systems/)

    The definitive tool for GPU profiling, showing kernel timelines, memory transfers, and CPU-GPU synchronization points.