Chapter Summary

Chapter Summary

Key Points

  • 1.

    GPT is a stack of decoder-only transformer blocks. Each block applies causal multi-head self-attention followed by a feed-forward network with residual connections and layer normalization. The parameter count scales as N12Ld2+VdN \approx 12Ld^2 + Vd. Modern variants use RoPE, GQA, and SwiGLU activations.

  • 2.

    Scaling laws are predictive. Loss follows power laws in model size, data, and compute. Chinchilla-optimal training uses D20ND^* \approx 20N tokens per parameter. Many modern models over-train because inference cost dominates.

  • 3.

    RLHF/DPO aligns models with human preferences. The three-stage pipeline (SFT, reward model, PPO) or the simpler DPO alternative transforms a raw language model into a helpful, harmless assistant. The KL penalty prevents reward hacking.

  • 4.

    Three architecture families serve different purposes. Encoders (BERT) for understanding, decoders (GPT) for generation, encoder-decoders (T5) for sequence-to-sequence. The field is converging on decoder-only models for all tasks.

  • 5.

    Open-weight models have democratized LLM research. LLaMA, Mistral, and others provide competitive alternatives to proprietary models, enabling domain-specific fine-tuning for telecom and imaging research.

Looking Ahead

Chapter 36 shows how to use these models programmatically via APIs, with practical techniques for prompt engineering, tool use, and retrieval-augmented generation.