References & Further Reading
References
- A. Vaswani et al., Attention Is All You Need, NeurIPS, 2017
The transformer paper. Foundation of modern NLP and beyond.
- A. Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR, 2020
Vision Transformer (ViT): applying transformers to image classification.
- D. Bahdanau, K. Cho, and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR, 2014
Introduced the attention mechanism for sequence-to-sequence models.
- T. Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, NeurIPS, 2022
IO-aware attention algorithm achieving sub-quadratic memory.
- J. Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding, 2021
Rotary position embeddings (RoPE) used in modern LLMs.
Further Reading
The Illustrated Transformer
Jay Alammar's blog (jalammar.github.io)
Best visual explanation of the transformer architecture.
Efficient Transformers survey
Tay et al., Efficient Transformers: A Survey, ACM Computing Surveys 2022
Comprehensive overview of linear attention and sparse attention variants.