Exercises

ex-sp-ch35-01

Easy

Compute the parameter count for GPT-2 Small (L=12L=12, d=768d=768, V=50257V=50257) using the formula N12Ld2+VdN \approx 12Ld^2 + Vd. Compare with the reported 124M parameters.

ex-sp-ch35-02

Easy

Calculate the FLOPs required to pre-train a 7B model on 1T tokens using the C6NDC \approx 6ND formula. Express in GPU-hours assuming an A100 at 312 TFLOPS.

ex-sp-ch35-03

Easy

Implement sinusoidal positional encoding and verify that the dot product between positions decays smoothly with distance.

ex-sp-ch35-04

Easy

Calculate the KV-cache memory for a 7B model (L=32L=32, d=4096d=4096, nh=32n_h=32) at sequence length 4096 in fp16.

ex-sp-ch35-05

Medium

Implement a minimal GPT (2 layers, d=128d=128) and train it on a small text corpus. Plot the training loss and perplexity.

ex-sp-ch35-06

Medium

Implement GQA with 8 query heads and 2 KV groups. Compare the memory usage with standard MHA.

ex-sp-ch35-07

Medium

Implement the DPO loss function and train on synthetic preference data (random preferred/rejected pairs). Plot the loss convergence.

ex-sp-ch35-08

Medium

Implement KV-caching and measure the speedup compared to recomputing the full sequence at each step. Plot latency vs. sequence length.

ex-sp-ch35-09

Medium

Build a scaling law plot: train language models with 1M, 5M, 10M, and 50M parameters on 100M tokens and fit the power law LNαL \propto N^{-\alpha}.

ex-sp-ch35-10

Hard

Implement a simple Mixture of Experts layer with 4 experts and top-1 routing. Include load balancing loss. Train on a text corpus and compare with a dense model of equivalent active parameters.

ex-sp-ch35-11

Hard

Implement RoPE (Rotary Positional Encoding) from scratch and show that the dot product between query and key only depends on relative position.

ex-sp-ch35-12

Hard

Implement the RLHF pipeline: SFT on a small dataset, then train a reward model on preferences, then PPO to optimize against the reward model (use a toy 1M-parameter model).

ex-sp-ch35-13

Hard

Profile the forward pass of a GPT model and break down compute time by component: attention, FFN, layer norm, embedding.

ex-sp-ch35-14

Hard

Use BERT to extract contextual embeddings for the word "channel" in different contexts (TV channel, water channel, communication channel). Show that the embeddings differ.

ex-sp-ch35-15

Challenge

Implement Flash Attention v2 tiling algorithm in pure PyTorch. Benchmark memory and speed against standard attention for sequence lengths 1024, 4096, and 16384.

ex-sp-ch35-16

Challenge

Train a Chinchilla-optimal set of models at 3 different compute budgets (varying N and D proportionally). Verify the power-law relationship between compute and test loss.