References & Further Reading

T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient Estimation of Word Representations in Vector Space, 2013
The original Word2Vec paper introducing Skip-gram and CBOW models. Demonstrated that simple neural networks trained on large corpora learn embeddings with linear analogy properties.
J. Pennington, R. Socher, and C. D. Manning, GloVe: Global Vectors for Word Representation, EMNLP, 2014
Introduces GloVe embeddings that combine global co-occurrence statistics with local context window training.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention Is All You Need, NeurIPS, 2017
The foundational transformer paper. Introduces multi-head self-attention, positional encoding, and the encoder-decoder architecture that underlies all modern LLMs.
R. Sennrich, B. Haddow, and A. Birch, Neural Machine Translation of Rare Words with Subword Units, ACL, 2016
Introduces BPE for neural machine translation, which became the standard tokenization approach for transformer models.
D. Jurafsky and J. H. Martin, Speech and Language Processing, Draft, 3rd edition, 2024
The standard NLP textbook covering n-gram models, embeddings, transformers, and modern language models.
T. Kudo and J. Richardson, SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing, EMNLP, 2018
SentencePiece implements BPE and unigram language model tokenization in a language-agnostic way.