Pre-Training at Scale

Definition:
Pre-Training Objective

LLMs are pre-trained on massive text corpora using the causal language modeling (CLM) objective:

$\mathcal{L}_\text{CLM}(\theta) = -\mathbb{E}_{x \sim \mathcal{D}}\left[\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t})\right]$

Training involves:

Data: trillions of tokens from web, books, code
Optimizer: AdamW with cosine learning rate schedule
Hardware: thousands of GPUs with tensor/pipeline parallelism
Duration: weeks to months of continuous training

Theorem: Neural Scaling Laws (Chinchilla)

The test loss of a language model follows a power law in model size $N$ , dataset size $D$ , and compute $C$ :

$L(N, D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty$

The Chinchilla-optimal training rule states that model size and data should scale proportionally:

$D^* \approx 20 \cdot N$

A 7B model should be trained on ~140B tokens. Under-training (too few tokens) wastes parameters; over-training is compute-inefficient.

Scaling laws tell you how to allocate a fixed compute budget between model size and training data for optimal performance.

Theorem: Emergent Abilities and Scale

Certain capabilities (chain-of-thought reasoning, few-shot learning, code generation) appear only above a critical model size. Below the threshold, performance is near-random; above it, performance jumps sharply.

While the exact mechanism is debated, empirically:

Few-shot learning emerges around 10B parameters
Chain-of-thought reasoning emerges around 100B parameters
Complex reasoning continues to improve with scale

Larger models can represent more complex functions and store more world knowledge, enabling qualitatively new capabilities.

Example: Predicting Loss from Scaling Laws

Using Chinchilla scaling laws, predict the test loss for a 7B model trained on 1T tokens vs. 140B tokens.

Solution

Calculation

import numpy as np

# Approximate Chinchilla constants
alpha_N = 0.076
alpha_D = 0.095
N_c = 1e13
D_c = 1e13
L_inf = 1.69

def predict_loss(N, D):
    return (N_c/N)**alpha_N + (D_c/D)**alpha_D + L_inf

N = 7e9
loss_140B = predict_loss(N, 140e9)
loss_1T = predict_loss(N, 1e12)

print(f"7B, 140B tokens: L = {loss_140B:.3f}")
print(f"7B, 1T tokens:   L = {loss_1T:.3f}")
print(f"Improvement: {loss_140B - loss_1T:.3f}")
# 1T tokens is over-training for 7B, but models like
# LLaMA do this because inference cost > training cost

Scaling Laws Explorer

Explore how loss scales with model size, data, and compute

Parameters

Quick Check

According to Chinchilla scaling laws, a 70B parameter model should optimally be trained on approximately how many tokens?

70 billion

1.4 trillion

7 trillion

Correction:

1.4 trillion

Chinchilla-optimal is D* = 20N, so 20 * 70B = 1.4T tokens.

Key Takeaway

Scaling laws show that LLM performance follows predictable power laws in model size, data, and compute. The Chinchilla-optimal ratio is approximately 20 tokens per parameter, but many modern models train on far more data because inference cost dominates.

Scaling Laws

Empirical power-law relationships between model performance and training resources (parameters, data, compute), enabling prediction of capabilities before training.

Related: Chinchilla Scaling

Chinchilla Scaling

The finding by Hoffmann et al. (2022) that optimal compute allocation trains model size and data proportionally, with D* ~ 20N.

Related: Scaling Laws

The GPT Architecture RLHF and Alignment