Chapter Summary
Chapter Summary
Key Points
- 1.
GPT is a stack of decoder-only transformer blocks. Each block applies causal multi-head self-attention followed by a feed-forward network with residual connections and layer normalization. The parameter count scales as . Modern variants use RoPE, GQA, and SwiGLU activations.
- 2.
Scaling laws are predictive. Loss follows power laws in model size, data, and compute. Chinchilla-optimal training uses tokens per parameter. Many modern models over-train because inference cost dominates.
- 3.
RLHF/DPO aligns models with human preferences. The three-stage pipeline (SFT, reward model, PPO) or the simpler DPO alternative transforms a raw language model into a helpful, harmless assistant. The KL penalty prevents reward hacking.
- 4.
Three architecture families serve different purposes. Encoders (BERT) for understanding, decoders (GPT) for generation, encoder-decoders (T5) for sequence-to-sequence. The field is converging on decoder-only models for all tasks.
- 5.
Open-weight models have democratized LLM research. LLaMA, Mistral, and others provide competitive alternatives to proprietary models, enabling domain-specific fine-tuning for telecom and imaging research.
Looking Ahead
Chapter 36 shows how to use these models programmatically via APIs, with practical techniques for prompt engineering, tool use, and retrieval-augmented generation.