Chapter Summary

Key Points

1.
Tokenization is the first design choice. Subword tokenization (BPE, SentencePiece) balances vocabulary size with sequence length and handles OOV words gracefully. GPT-family models use byte-level BPE with $V \approx 50{,}000$ . Always use a pre-trained tokenizer matched to your model.
2.
Dense embeddings capture semantic relationships. Word2Vec, GloVe, and FastText map tokens to continuous vectors where semantic similarity corresponds to geometric proximity. The embedding dimension $d \in [128, 512]$ provides a good trade-off between expressiveness and compute.
3.
Language models predict the next token. The autoregressive factorization $P(w_1, \ldots, w_T) = \prod_t P(w_t \mid w_{<t})$ is the foundation of modern LLMs. Perplexity $= \exp(\mathcal{L})$ is the standard evaluation metric.
4.
Transformers dominate modern NLP. Self-attention computes pairwise token affinities in $O(T^2 d)$ time, enabling parallel training and long-range dependency capture. The causal mask enforces the autoregressive property during training.
5.
The NLP pipeline connects to wireless. Embeddings enable semantic communication where meaning (not bits) is transmitted. Attention mechanisms inspire MIMO detectors. Language models drive code generation and literature review for telecom research.

Looking Ahead

Chapter 35 scales these foundations to large language models: GPT architecture details, pre-training at scale, RLHF alignment, and the zoo of model families (BERT, GPT, T5, LLaMA).

Language Modeling Exercises