Chapter Summary

Chapter Summary

Key Points

  • 1.

    Attention computes weighted value retrieval. Scaled dot-product attention is the core operation: softmax(QK^T/sqrt(d_k))V. Multi-head attention learns diverse patterns in parallel.

  • 2.

    The transformer is attention + FFN + residual + LayerNorm. This simple stack scales to billions of parameters. Pre-norm (LayerNorm before sublayer) is more stable than post-norm.

  • 3.

    Positional encoding injects order. Without it, transformers cannot distinguish positions. Sinusoidal encodings generalise to unseen lengths; learned embeddings often work better for fixed lengths.

  • 4.

    ViT proves attention works for images. Split images into patches, treat as tokens. With enough data, ViT matches or exceeds CNNs. Useful for scientific imaging and grid data.

  • 5.

    Quadratic cost is the main limitation. Standard attention costs O(n^2) in memory and compute. Flash Attention reduces memory; linear attention reduces complexity but may sacrifice quality.

Looking Ahead

Chapter 31 uses transformers as components in generative models. Chapter 33 covers fine-tuning pre-trained transformers for domain tasks.