References & Further Reading

A. Vaswani et al., Attention Is All You Need, NeurIPS, 2017
The transformer paper. Foundation of modern NLP and beyond.
A. Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR, 2020
Vision Transformer (ViT): applying transformers to image classification.
D. Bahdanau, K. Cho, and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR, 2014
Introduced the attention mechanism for sequence-to-sequence models.
T. Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, NeurIPS, 2022
IO-aware attention algorithm achieving sub-quadratic memory.
J. Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding, 2021
Rotary position embeddings (RoPE) used in modern LLMs.