Training a Small Language Model from Scratch

Definition:
nanoGPT Training Pipeline

Training a small LM from scratch follows:

Prepare data: tokenize corpus, create train/val splits
Define model: GPT architecture with configurable size
Train: AdamW optimizer, cosine LR schedule, gradient clipping
Generate: Sample from the trained model

# nanoGPT-style training
model = GPT(vocab_size=50257, d_model=384, n_heads=6, n_layers=6)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)

Example: Training nanoGPT on Shakespeare

Train a 10M-parameter GPT on the Shakespeare corpus and generate text.

Solution

Training

for step in range(max_steps):
    xb, yb = get_batch("train")
    logits, loss = model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    scheduler.step()

Fine-Tuning Pre-Trained Models Instruction Tuning and Chat Models