References & Further Reading

References

  1. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, Language Models are Unsupervised Multitask Learners, 2019

    The GPT-2 paper demonstrating emergent few-shot abilities.

  2. T. Brown et al., Language Models are Few-Shot Learners, NeurIPS, 2020

    GPT-3 (175B) showing in-context learning at scale.

  3. J. Hoffmann et al., Training Compute-Optimal Large Language Models, 2022

    The Chinchilla paper establishing optimal compute allocation.

  4. L. Ouyang et al., Training Language Models to Follow Instructions with Human Feedback, NeurIPS, 2022

    InstructGPT/RLHF paper showing alignment can beat scale.

  5. R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, Direct Preference Optimization: Your Language Model is Secretly a Reward Model, NeurIPS, 2023

    DPO eliminates reward model and RL from alignment.

  6. H. Touvron et al., LLaMA: Open and Efficient Foundation Language Models, 2023

    Meta's open-weight LLM family that catalyzed open-source development.

Further Reading

  • Mixture of Experts

    W. Fedus et al., Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, JMLR 2022

    Introduces simplified MoE routing for transformers.

  • Flash Attention

    T. Dao et al., FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, 2023

    The IO-aware attention algorithm used in all modern LLM training.

  • Transformer implementation

    A. Karpathy, nanoGPT (https://github.com/karpathy/nanoGPT)

    The simplest, most readable GPT implementation for learning.