Chapter Summary

Key Points

1.
GPT is a stack of decoder-only transformer blocks. Each block applies causal multi-head self-attention followed by a feed-forward network with residual connections and layer normalization. The parameter count scales as $N \approx 12Ld^2 + Vd$ . Modern variants use RoPE, GQA, and SwiGLU activations.
2.
Scaling laws are predictive. Loss follows power laws in model size, data, and compute. Chinchilla-optimal training uses $D^* \approx 20N$ tokens per parameter. Many modern models over-train because inference cost dominates.
3.
RLHF/DPO aligns models with human preferences. The three-stage pipeline (SFT, reward model, PPO) or the simpler DPO alternative transforms a raw language model into a helpful, harmless assistant. The KL penalty prevents reward hacking.
4.
Three architecture families serve different purposes. Encoders (BERT) for understanding, decoders (GPT) for generation, encoder-decoders (T5) for sequence-to-sequence. The field is converging on decoder-only models for all tasks.
5.
Open-weight models have democratized LLM research. LLaMA, Mistral, and others provide competitive alternatives to proprietary models, enabling domain-specific fine-tuning for telecom and imaging research.

Looking Ahead

Chapter 36 shows how to use these models programmatically via APIs, with practical techniques for prompt engineering, tool use, and retrieval-augmented generation.

Model Families and Their Characteristics Exercises