Pre-Training at Scale
Definition: Pre-Training Objective
Pre-Training Objective
LLMs are pre-trained on massive text corpora using the causal language modeling (CLM) objective:
Training involves:
- Data: trillions of tokens from web, books, code
- Optimizer: AdamW with cosine learning rate schedule
- Hardware: thousands of GPUs with tensor/pipeline parallelism
- Duration: weeks to months of continuous training
Theorem: Neural Scaling Laws (Chinchilla)
The test loss of a language model follows a power law in model size , dataset size , and compute :
The Chinchilla-optimal training rule states that model size and data should scale proportionally:
A 7B model should be trained on ~140B tokens. Under-training (too few tokens) wastes parameters; over-training is compute-inefficient.
Scaling laws tell you how to allocate a fixed compute budget between model size and training data for optimal performance.
Theorem: Emergent Abilities and Scale
Certain capabilities (chain-of-thought reasoning, few-shot learning, code generation) appear only above a critical model size. Below the threshold, performance is near-random; above it, performance jumps sharply.
While the exact mechanism is debated, empirically:
- Few-shot learning emerges around 10B parameters
- Chain-of-thought reasoning emerges around 100B parameters
- Complex reasoning continues to improve with scale
Larger models can represent more complex functions and store more world knowledge, enabling qualitatively new capabilities.
Example: Predicting Loss from Scaling Laws
Using Chinchilla scaling laws, predict the test loss for a 7B model trained on 1T tokens vs. 140B tokens.
Calculation
import numpy as np
# Approximate Chinchilla constants
alpha_N = 0.076
alpha_D = 0.095
N_c = 1e13
D_c = 1e13
L_inf = 1.69
def predict_loss(N, D):
return (N_c/N)**alpha_N + (D_c/D)**alpha_D + L_inf
N = 7e9
loss_140B = predict_loss(N, 140e9)
loss_1T = predict_loss(N, 1e12)
print(f"7B, 140B tokens: L = {loss_140B:.3f}")
print(f"7B, 1T tokens: L = {loss_1T:.3f}")
print(f"Improvement: {loss_140B - loss_1T:.3f}")
# 1T tokens is over-training for 7B, but models like
# LLaMA do this because inference cost > training cost
Scaling Laws Explorer
Explore how loss scales with model size, data, and compute
Parameters
Quick Check
According to Chinchilla scaling laws, a 70B parameter model should optimally be trained on approximately how many tokens?
70 billion
1.4 trillion
7 trillion
Chinchilla-optimal is D* = 20N, so 20 * 70B = 1.4T tokens.
Key Takeaway
Scaling laws show that LLM performance follows predictable power laws in model size, data, and compute. The Chinchilla-optimal ratio is approximately 20 tokens per parameter, but many modern models train on far more data because inference cost dominates.
Scaling Laws
Empirical power-law relationships between model performance and training resources (parameters, data, compute), enabling prediction of capabilities before training.
Related: Chinchilla Scaling
Chinchilla Scaling
The finding by Hoffmann et al. (2022) that optimal compute allocation trains model size and data proportionally, with D* ~ 20N.
Related: Scaling Laws