References & Further Reading

References

  1. A. Vaswani et al., Attention Is All You Need, NeurIPS, 2017

    The transformer paper. Foundation of modern NLP and beyond.

  2. A. Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR, 2020

    Vision Transformer (ViT): applying transformers to image classification.

  3. D. Bahdanau, K. Cho, and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR, 2014

    Introduced the attention mechanism for sequence-to-sequence models.

  4. T. Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, NeurIPS, 2022

    IO-aware attention algorithm achieving sub-quadratic memory.

  5. J. Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding, 2021

    Rotary position embeddings (RoPE) used in modern LLMs.

Further Reading

  • The Illustrated Transformer

    Jay Alammar's blog (jalammar.github.io)

    Best visual explanation of the transformer architecture.

  • Efficient Transformers survey

    Tay et al., Efficient Transformers: A Survey, ACM Computing Surveys 2022

    Comprehensive overview of linear attention and sparse attention variants.