Ferkans — Interactive Telecom Tutor

Definition:
GPT Architecture

The Generative Pre-trained Transformer (GPT) is a stack of $L$ decoder-only transformer blocks. Each block contains:

Causal multi-head self-attention with mask
Feed-forward network (two linear layers with GELU activation)
Residual connections and layer normalization

The forward pass for block $l$ : $\mathbf{h}'_l = \mathbf{h}_{l-1} + \text{MHA}(\text{LN}(\mathbf{h}_{l-1}))$ $\mathbf{h}_l = \mathbf{h}'_l + \text{FFN}(\text{LN}(\mathbf{h}'_l))$

where $\text{FFN}(\mathbf{x}) = \mathbf{W}_2 \cdot \text{GELU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$ .

Modern GPTs use pre-norm (LN before attention/FFN) rather than post-norm, which stabilizes training for deep networks.

Definition:
Positional Encoding Strategies

Since self-attention is permutation-equivariant, position must be injected explicitly:

Learned absolute: $\mathbf{p}_t \in \mathbb{R}^{d_\text{model}}$ added to token embeddings (GPT-1/2)
Rotary (RoPE): Rotates query/key vectors by position-dependent angles, encoding relative position in the dot product: $\langle \text{RoPE}(\mathbf{q}, m), \text{RoPE}(\mathbf{k}, n) \rangle = g(\mathbf{q}, \mathbf{k}, m-n)$
ALiBi: Adds linear bias $-|m - n|$ to attention scores

RoPE is used in LLaMA, Mistral, and most modern open LLMs.

Definition:
KV-Cache for Efficient Inference

During autoregressive generation, the key and value projections for all previous tokens are cached:

$\mathbf{K}_t = [\mathbf{k}_1, \ldots, \mathbf{k}_t], \quad \mathbf{V}_t = [\mathbf{v}_1, \ldots, \mathbf{v}_t]$

At step $t+1$ , only the new token's $\mathbf{q}_{t+1}$ is computed, and attention is $\text{softmax}(\mathbf{q}_{t+1}^T \mathbf{K}_t / \sqrt{d_k}) \mathbf{V}_t$ .

Memory cost: $O(L \cdot n_h \cdot T \cdot d_k)$ per sequence.

For a 70B model with $T = 8192$ , the KV-cache requires ~16 GB, often exceeding the model weights in memory.

Definition:
Parameter Count Formula

For a GPT with $L$ layers, dimension $d$ , and vocabulary $V$ :

$N \approx 12 L d^2 + V d$

The $12Ld^2$ comes from: 4 projection matrices ( $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V, \mathbf{W}_O$ , each $d \times d$ ) plus 2 FFN matrices ( $d \times 4d$ and $4d \times d$ ), totaling $12d^2$ per layer.

Definition:
Grouped Query Attention (GQA)

GQA reduces KV-cache size by sharing key/value heads across multiple query heads. With $n_h$ query heads and $n_g$ KV groups:

Multi-Head Attention (MHA): $n_g = n_h$ (no sharing)
Multi-Query Attention (MQA): $n_g = 1$ (all queries share one KV)
GQA: $1 < n_g < n_h$ (each group of $n_h / n_g$ queries shares KV)

LLaMA 2 70B uses GQA with $n_g = 8$ , reducing KV-cache by $4\times$ .

Theorem: FLOPs per Token in Forward Pass

For a GPT with $N$ parameters, the approximate FLOPs for a single forward pass on one token is:

$\text{FLOPs} \approx 2N$

For training (forward + backward), the total is approximately $6N$ FLOPs per token. Training a model on $D$ tokens costs:

$C \approx 6ND \text{ FLOPs}$

Each parameter participates in one multiply-add (2 FLOPs) during the forward pass. Backward pass costs roughly 2x forward.

Example: Building a Minimal GPT in PyTorch

Implement a GPT model with configurable depth and width. Count parameters and verify against the formula.

Solution

Implementation

import torch
import torch.nn as nn

class GPTBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(
            d_model, n_heads, batch_first=True)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )

    def forward(self, x, mask=None):
        h = self.ln1(x)
        h, _ = self.attn(h, h, h, attn_mask=mask)
        x = x + h
        x = x + self.ffn(self.ln2(x))
        return x

class GPT(nn.Module):
    def __init__(self, vocab_size, d_model, n_heads, n_layers, max_len):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        self.blocks = nn.ModuleList([
            GPTBlock(d_model, n_heads) for _ in range(n_layers)])
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, idx):
        B, T = idx.shape
        tok = self.tok_emb(idx)
        pos = self.pos_emb(torch.arange(T, device=idx.device))
        x = tok + pos
        mask = torch.triu(torch.ones(T, T, device=idx.device),
                         diagonal=1).bool()
        for block in self.blocks:
            x = block(x, mask)
        return self.head(self.ln_f(x))

model = GPT(50257, d_model=768, n_heads=12, n_layers=12, max_len=1024)
n_params = sum(p.numel() for p in model.parameters())
formula = 12 * 12 * 768**2 + 50257 * 768
print(f"Actual: {n_params:,}")
print(f"Formula: {formula:,}")

Example: KV-Cache Implementation

Implement KV-caching for efficient autoregressive generation.

Solution

Key Idea

class CachedAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.W_qkv = nn.Linear(d_model, 3 * d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x, cache=None):
        B, T, d = x.shape
        qkv = self.W_qkv(x).reshape(B, T, 3, self.n_heads, self.d_k)
        q, k, v = qkv.unbind(2)

        if cache is not None:
            k_prev, v_prev = cache
            k = torch.cat([k_prev, k], dim=1)
            v = torch.cat([v_prev, v], dim=1)

        new_cache = (k, v)
        # Attention with full K, V
        scores = (q.transpose(1,2) @ k.transpose(1,2).transpose(-2,-1))
        scores = scores / (self.d_k ** 0.5)
        attn = scores.softmax(-1)
        out = (attn @ v.transpose(1,2)).transpose(1, 2)
        return self.W_o(out.reshape(B, T, d)), new_cache

GPT Parameter Calculator

Calculate parameter count for different GPT configurations

Parameters

KV-Cache Memory Calculator

Estimate KV-cache memory for different model sizes and sequence lengths

Parameters

GPT Architecture Diagram — The GPT decoder-only transformer: token embedding, positional encoding, L transformer blocks, and language model head.

Quick Check

Why do modern GPTs use pre-norm (LayerNorm before attention) instead of post-norm?

It reduces parameter count

It stabilizes training for very deep networks

It improves inference speed

Correction:

It stabilizes training for very deep networks

Pre-norm prevents gradient explosion/vanishing in deep transformers by normalizing inputs to each sub-layer.

Common Mistake: Underestimating KV-Cache Memory

Mistake:

Planning GPU memory budget based only on model weights.

Correction:

For long sequences, KV-cache can exceed model weight memory. A 7B model at 4096 tokens uses ~2 GB for KV-cache in fp16. At 32K tokens, this grows to ~16 GB. Always account for KV-cache when planning batch sizes.

Historical Note: The Evolution of GPT

2018-2023

GPT-1 (2018, 117M params) showed that pre-training a transformer on unlabeled text, then fine-tuning on tasks, outperformed task-specific architectures. GPT-2 (2019, 1.5B) demonstrated emergent few-shot abilities. GPT-3 (2020, 175B) made in-context learning practical. GPT-4 (2023) remains largely undocumented but represents a massive leap in capability.

GPT (Generative Pre-trained Transformer)

A family of decoder-only transformer language models trained with next-token prediction, starting with GPT-1 (2018) at OpenAI.

Related: KV-Cache

KV-Cache

A cache storing previously computed key and value tensors during autoregressive generation, avoiding redundant computation at the cost of memory.

The GPT Architecture

Definition: GPT Architecture

Definition: Positional Encoding Strategies

Definition: KV-Cache for Efficient Inference

Definition: Parameter Count Formula

Definition: Grouped Query Attention (GQA)

Theorem: FLOPs per Token in Forward Pass

Example: Building a Minimal GPT in PyTorch

Implementation

Example: KV-Cache Implementation

Key Idea

GPT Parameter Calculator

Parameters

KV-Cache Memory Calculator

Parameters

GPT Architecture Diagram

Quick Check

Common Mistake: Underestimating KV-Cache Memory

Historical Note: The Evolution of GPT

GPT (Generative Pre-trained Transformer)

KV-Cache

Definition:
GPT Architecture

Definition:
Positional Encoding Strategies

Definition:
KV-Cache for Efficient Inference

Definition:
Parameter Count Formula

Definition:
Grouped Query Attention (GQA)