References & Further Reading
References
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, Language Models are Unsupervised Multitask Learners, 2019
The GPT-2 paper demonstrating emergent few-shot abilities.
- T. Brown et al., Language Models are Few-Shot Learners, NeurIPS, 2020
GPT-3 (175B) showing in-context learning at scale.
- J. Hoffmann et al., Training Compute-Optimal Large Language Models, 2022
The Chinchilla paper establishing optimal compute allocation.
- L. Ouyang et al., Training Language Models to Follow Instructions with Human Feedback, NeurIPS, 2022
InstructGPT/RLHF paper showing alignment can beat scale.
- R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, Direct Preference Optimization: Your Language Model is Secretly a Reward Model, NeurIPS, 2023
DPO eliminates reward model and RL from alignment.
- H. Touvron et al., LLaMA: Open and Efficient Foundation Language Models, 2023
Meta's open-weight LLM family that catalyzed open-source development.
Further Reading
Mixture of Experts
W. Fedus et al., Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, JMLR 2022
Introduces simplified MoE routing for transformers.
Flash Attention
T. Dao et al., FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, 2023
The IO-aware attention algorithm used in all modern LLM training.
Transformer implementation
A. Karpathy, nanoGPT (https://github.com/karpathy/nanoGPT)
The simplest, most readable GPT implementation for learning.