Chapter Summary

1.
The API paradigm is message-based. Send structured messages (system, user, assistant), get text back. Token-based pricing means prompt design directly affects cost. Use streaming for interactive applications, structured output for reliable parsing.
2.
Prompt engineering is systematic. Use system messages for role/constraints, few-shot examples for task specification, and chain-of-thought for complex reasoning. Most gains come from the first 3-5 examples.
3.
Tool use enables LLM agents. LLMs can call external functions for computation, data retrieval, and actions. Each tool call has a reliability cost — verify results at every step.
4.
RAG grounds LLMs in domain knowledge. Chunk documents into 200-500 tokens, embed with sentence transformers, retrieve top-k by cosine similarity. RAG is simpler and more flexible than fine-tuning for factual knowledge.
5.
Local models provide privacy and zero marginal cost. Quantized 7-8B models run on consumer GPUs. Use Ollama for quick setup, vLLM for production serving, HuggingFace for full control.

Chapter 37 covers fine-tuning and training LLMs when prompting alone is insufficient — LoRA, nanoGPT, instruction tuning, and multimodal models.