References & Further Reading

References

  1. T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient Estimation of Word Representations in Vector Space, 2013

    The original Word2Vec paper introducing Skip-gram and CBOW models. Demonstrated that simple neural networks trained on large corpora learn embeddings with linear analogy properties.

  2. J. Pennington, R. Socher, and C. D. Manning, GloVe: Global Vectors for Word Representation, EMNLP, 2014

    Introduces GloVe embeddings that combine global co-occurrence statistics with local context window training.

  3. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention Is All You Need, NeurIPS, 2017

    The foundational transformer paper. Introduces multi-head self-attention, positional encoding, and the encoder-decoder architecture that underlies all modern LLMs.

  4. R. Sennrich, B. Haddow, and A. Birch, Neural Machine Translation of Rare Words with Subword Units, ACL, 2016

    Introduces BPE for neural machine translation, which became the standard tokenization approach for transformer models.

  5. D. Jurafsky and J. H. Martin, Speech and Language Processing, Draft, 3rd edition, 2024

    The standard NLP textbook covering n-gram models, embeddings, transformers, and modern language models.

  6. T. Kudo and J. Richardson, SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing, EMNLP, 2018

    SentencePiece implements BPE and unigram language model tokenization in a language-agnostic way.

Further Reading

  • Contextual embeddings

    J. Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019

    BERT introduced contextual embeddings where each word's representation depends on its surrounding sentence, solving the polysemy problem of static embeddings.

  • Efficient attention

    T. Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, NeurIPS 2022

    FlashAttention reduces the memory footprint of attention from $O(T^2)$ to $O(T)$ through kernel fusion and tiling, enabling much longer sequences in practice.

  • Embeddings for scientific text

    I. Beltagy et al., SciBERT: A Pretrained Language Model for Scientific Text, EMNLP 2019

    Demonstrates that domain-specific pre-training on scientific text improves downstream NLP tasks in scientific domains.

  • Tokenization best practices

    HuggingFace Tokenizers Library Documentation (https://huggingface.co/docs/tokenizers)

    Practical guide to training and using BPE, WordPiece, and Unigram tokenizers with the modern HuggingFace ecosystem.