The GPT Architecture

Definition:

GPT Architecture

The Generative Pre-trained Transformer (GPT) is a stack of LL decoder-only transformer blocks. Each block contains:

  1. Causal multi-head self-attention with mask
  2. Feed-forward network (two linear layers with GELU activation)
  3. Residual connections and layer normalization

The forward pass for block ll: hlβ€²=hlβˆ’1+MHA(LN(hlβˆ’1))\mathbf{h}'_l = \mathbf{h}_{l-1} + \text{MHA}(\text{LN}(\mathbf{h}_{l-1})) hl=hlβ€²+FFN(LN(hlβ€²))\mathbf{h}_l = \mathbf{h}'_l + \text{FFN}(\text{LN}(\mathbf{h}'_l))

where FFN(x)=W2β‹…GELU(W1x+b1)+b2\text{FFN}(\mathbf{x}) = \mathbf{W}_2 \cdot \text{GELU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2.

Modern GPTs use pre-norm (LN before attention/FFN) rather than post-norm, which stabilizes training for deep networks.

Definition:

Positional Encoding Strategies

Since self-attention is permutation-equivariant, position must be injected explicitly:

  1. Learned absolute: pt∈Rdmodel\mathbf{p}_t \in \mathbb{R}^{d_\text{model}} added to token embeddings (GPT-1/2)
  2. Rotary (RoPE): Rotates query/key vectors by position-dependent angles, encoding relative position in the dot product: ⟨RoPE(q,m),RoPE(k,n)⟩=g(q,k,mβˆ’n)\langle \text{RoPE}(\mathbf{q}, m), \text{RoPE}(\mathbf{k}, n) \rangle = g(\mathbf{q}, \mathbf{k}, m-n)
  3. ALiBi: Adds linear bias βˆ’βˆ£mβˆ’n∣-|m - n| to attention scores

RoPE is used in LLaMA, Mistral, and most modern open LLMs.

Definition:

KV-Cache for Efficient Inference

During autoregressive generation, the key and value projections for all previous tokens are cached:

Kt=[k1,…,kt],Vt=[v1,…,vt]\mathbf{K}_t = [\mathbf{k}_1, \ldots, \mathbf{k}_t], \quad \mathbf{V}_t = [\mathbf{v}_1, \ldots, \mathbf{v}_t]

At step t+1t+1, only the new token's qt+1\mathbf{q}_{t+1} is computed, and attention is softmax(qt+1TKt/dk)Vt\text{softmax}(\mathbf{q}_{t+1}^T \mathbf{K}_t / \sqrt{d_k}) \mathbf{V}_t.

Memory cost: O(Lβ‹…nhβ‹…Tβ‹…dk)O(L \cdot n_h \cdot T \cdot d_k) per sequence.

For a 70B model with T=8192T = 8192, the KV-cache requires ~16 GB, often exceeding the model weights in memory.

Definition:

Parameter Count Formula

For a GPT with LL layers, dimension dd, and vocabulary VV:

Nβ‰ˆ12Ld2+VdN \approx 12 L d^2 + V d

The 12Ld212Ld^2 comes from: 4 projection matrices (WQ,WK,WV,WO\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V, \mathbf{W}_O, each dΓ—dd \times d) plus 2 FFN matrices (dΓ—4dd \times 4d and 4dΓ—d4d \times d), totaling 12d212d^2 per layer.

Definition:

Grouped Query Attention (GQA)

GQA reduces KV-cache size by sharing key/value heads across multiple query heads. With nhn_h query heads and ngn_g KV groups:

  • Multi-Head Attention (MHA): ng=nhn_g = n_h (no sharing)
  • Multi-Query Attention (MQA): ng=1n_g = 1 (all queries share one KV)
  • GQA: 1<ng<nh1 < n_g < n_h (each group of nh/ngn_h / n_g queries shares KV)

LLaMA 2 70B uses GQA with ng=8n_g = 8, reducing KV-cache by 4Γ—4\times.

Theorem: FLOPs per Token in Forward Pass

For a GPT with NN parameters, the approximate FLOPs for a single forward pass on one token is:

FLOPsβ‰ˆ2N\text{FLOPs} \approx 2N

For training (forward + backward), the total is approximately 6N6N FLOPs per token. Training a model on DD tokens costs:

Cβ‰ˆ6NDΒ FLOPsC \approx 6ND \text{ FLOPs}

Each parameter participates in one multiply-add (2 FLOPs) during the forward pass. Backward pass costs roughly 2x forward.

Example: Building a Minimal GPT in PyTorch

Implement a GPT model with configurable depth and width. Count parameters and verify against the formula.

Example: KV-Cache Implementation

Implement KV-caching for efficient autoregressive generation.

GPT Parameter Calculator

Calculate parameter count for different GPT configurations

Parameters

KV-Cache Memory Calculator

Estimate KV-cache memory for different model sizes and sequence lengths

Parameters

GPT Architecture Diagram

GPT Architecture Diagram
The GPT decoder-only transformer: token embedding, positional encoding, L transformer blocks, and language model head.

Quick Check

Why do modern GPTs use pre-norm (LayerNorm before attention) instead of post-norm?

It reduces parameter count

It stabilizes training for very deep networks

It improves inference speed

Common Mistake: Underestimating KV-Cache Memory

Mistake:

Planning GPU memory budget based only on model weights.

Correction:

For long sequences, KV-cache can exceed model weight memory. A 7B model at 4096 tokens uses ~2 GB for KV-cache in fp16. At 32K tokens, this grows to ~16 GB. Always account for KV-cache when planning batch sizes.

Historical Note: The Evolution of GPT

2018-2023

GPT-1 (2018, 117M params) showed that pre-training a transformer on unlabeled text, then fine-tuning on tasks, outperformed task-specific architectures. GPT-2 (2019, 1.5B) demonstrated emergent few-shot abilities. GPT-3 (2020, 175B) made in-context learning practical. GPT-4 (2023) remains largely undocumented but represents a massive leap in capability.

GPT (Generative Pre-trained Transformer)

A family of decoder-only transformer language models trained with next-token prediction, starting with GPT-1 (2018) at OpenAI.

Related: KV-Cache

KV-Cache

A cache storing previously computed key and value tensors during autoregressive generation, avoiding redundant computation at the cost of memory.

Related: GPT (Generative Pre-trained Transformer)