References & Further Reading

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016
The comprehensive deep learning textbook. Chapters 6-8 cover feed-forward networks, regularization, and optimization in depth.
A. Paszke et al., PyTorch: An Imperative Style, High-Performance Deep Learning Library, NeurIPS, 2019
The original PyTorch paper describing the design philosophy, autograd engine, and performance characteristics.
K. He, X. Zhang, S. Ren, and J. Sun, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, ICCV, 2015
Introduces Kaiming initialization for ReLU networks, showing proper init is critical for training very deep networks.
D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, ICLR, 2014
The Adam optimizer paper. Combines adaptive per-parameter learning rates with momentum.
I. Loshchilov and F. Hutter, Decoupled Weight Decay Regularization, ICLR, 2019
Introduces AdamW, fixing the interaction between weight decay and adaptive learning rates in Adam.
T. O'Shea and J. Hoydis, An Introduction to Deep Learning for the Physical Layer, IEEE Trans. CSIT, 2017
Pioneering work on end-to-end learning of communication systems using neural network autoencoders.