Chapter Summary
Chapter Summary
Key Points
- 1.
Attention computes weighted value retrieval. Scaled dot-product attention is the core operation: softmax(QK^T/sqrt(d_k))V. Multi-head attention learns diverse patterns in parallel.
- 2.
The transformer is attention + FFN + residual + LayerNorm. This simple stack scales to billions of parameters. Pre-norm (LayerNorm before sublayer) is more stable than post-norm.
- 3.
Positional encoding injects order. Without it, transformers cannot distinguish positions. Sinusoidal encodings generalise to unseen lengths; learned embeddings often work better for fixed lengths.
- 4.
ViT proves attention works for images. Split images into patches, treat as tokens. With enough data, ViT matches or exceeds CNNs. Useful for scientific imaging and grid data.
- 5.
Quadratic cost is the main limitation. Standard attention costs O(n^2) in memory and compute. Flash Attention reduces memory; linear attention reduces complexity but may sacrifice quality.
Looking Ahead
Chapter 31 uses transformers as components in generative models. Chapter 33 covers fine-tuning pre-trained transformers for domain tasks.