Pre-Training at Scale

Definition:

Pre-Training Objective

LLMs are pre-trained on massive text corpora using the causal language modeling (CLM) objective:

LCLM(ΞΈ)=βˆ’Ex∼D[βˆ‘t=1Tlog⁑PΞΈ(xt∣x<t)]\mathcal{L}_\text{CLM}(\theta) = -\mathbb{E}_{x \sim \mathcal{D}}\left[\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t})\right]

Training involves:

  1. Data: trillions of tokens from web, books, code
  2. Optimizer: AdamW with cosine learning rate schedule
  3. Hardware: thousands of GPUs with tensor/pipeline parallelism
  4. Duration: weeks to months of continuous training

Theorem: Neural Scaling Laws (Chinchilla)

The test loss of a language model follows a power law in model size NN, dataset size DD, and compute CC:

L(N,D)β‰ˆ(NcN)Ξ±N+(DcD)Ξ±D+L∞L(N, D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty

The Chinchilla-optimal training rule states that model size and data should scale proportionally:

Dβˆ—β‰ˆ20β‹…ND^* \approx 20 \cdot N

A 7B model should be trained on ~140B tokens. Under-training (too few tokens) wastes parameters; over-training is compute-inefficient.

Scaling laws tell you how to allocate a fixed compute budget between model size and training data for optimal performance.

Theorem: Emergent Abilities and Scale

Certain capabilities (chain-of-thought reasoning, few-shot learning, code generation) appear only above a critical model size. Below the threshold, performance is near-random; above it, performance jumps sharply.

While the exact mechanism is debated, empirically:

  • Few-shot learning emerges around 10B parameters
  • Chain-of-thought reasoning emerges around 100B parameters
  • Complex reasoning continues to improve with scale

Larger models can represent more complex functions and store more world knowledge, enabling qualitatively new capabilities.

Example: Predicting Loss from Scaling Laws

Using Chinchilla scaling laws, predict the test loss for a 7B model trained on 1T tokens vs. 140B tokens.

Scaling Laws Explorer

Explore how loss scales with model size, data, and compute

Parameters

Quick Check

According to Chinchilla scaling laws, a 70B parameter model should optimally be trained on approximately how many tokens?

70 billion

1.4 trillion

7 trillion

Key Takeaway

Scaling laws show that LLM performance follows predictable power laws in model size, data, and compute. The Chinchilla-optimal ratio is approximately 20 tokens per parameter, but many modern models train on far more data because inference cost dominates.

Scaling Laws

Empirical power-law relationships between model performance and training resources (parameters, data, compute), enabling prediction of capabilities before training.

Related: Chinchilla Scaling

Chinchilla Scaling

The finding by Hoffmann et al. (2022) that optimal compute allocation trains model size and data proportionally, with D* ~ 20N.

Related: Scaling Laws