Chapter Summary
Chapter Summary
Key Points
- 1.
Tokenization is the first design choice. Subword tokenization (BPE, SentencePiece) balances vocabulary size with sequence length and handles OOV words gracefully. GPT-family models use byte-level BPE with . Always use a pre-trained tokenizer matched to your model.
- 2.
Dense embeddings capture semantic relationships. Word2Vec, GloVe, and FastText map tokens to continuous vectors where semantic similarity corresponds to geometric proximity. The embedding dimension provides a good trade-off between expressiveness and compute.
- 3.
Language models predict the next token. The autoregressive factorization is the foundation of modern LLMs. Perplexity is the standard evaluation metric.
- 4.
Transformers dominate modern NLP. Self-attention computes pairwise token affinities in time, enabling parallel training and long-range dependency capture. The causal mask enforces the autoregressive property during training.
- 5.
The NLP pipeline connects to wireless. Embeddings enable semantic communication where meaning (not bits) is transmitted. Attention mechanisms inspire MIMO detectors. Language models drive code generation and literature review for telecom research.
Looking Ahead
Chapter 35 scales these foundations to large language models: GPT architecture details, pre-training at scale, RLHF alignment, and the zoo of model families (BERT, GPT, T5, LLaMA).