Training a Small Language Model from Scratch

Definition:

nanoGPT Training Pipeline

Training a small LM from scratch follows:

  1. Prepare data: tokenize corpus, create train/val splits
  2. Define model: GPT architecture with configurable size
  3. Train: AdamW optimizer, cosine LR schedule, gradient clipping
  4. Generate: Sample from the trained model
# nanoGPT-style training
model = GPT(vocab_size=50257, d_model=384, n_heads=6, n_layers=6)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)

Example: Training nanoGPT on Shakespeare

Train a 10M-parameter GPT on the Shakespeare corpus and generate text.