Chapter Summary

1.
Attention computes weighted value retrieval. Scaled dot-product attention is the core operation: softmax(QK^T/sqrt(d_k))V. Multi-head attention learns diverse patterns in parallel.
2.
The transformer is attention + FFN + residual + LayerNorm. This simple stack scales to billions of parameters. Pre-norm (LayerNorm before sublayer) is more stable than post-norm.
3.
Positional encoding injects order. Without it, transformers cannot distinguish positions. Sinusoidal encodings generalise to unseen lengths; learned embeddings often work better for fixed lengths.
4.
ViT proves attention works for images. Split images into patches, treat as tokens. With enough data, ViT matches or exceeds CNNs. Useful for scientific imaging and grid data.
5.
Quadratic cost is the main limitation. Standard attention costs O(n^2) in memory and compute. Flash Attention reduces memory; linear attention reduces complexity but may sacrifice quality.

Chapter 31 uses transformers as components in generative models. Chapter 33 covers fine-tuning pre-trained transformers for domain tasks.