Exercises

ex-sp-ch35-01

Easy

Compute the parameter count for GPT-2 Small ( $L=12$ , $d=768$ , $V=50257$ ) using the formula $N \approx 12Ld^2 + Vd$ . Compare with the reported 124M parameters.

ex-sp-ch35-02

Easy

Calculate the FLOPs required to pre-train a 7B model on 1T tokens using the $C \approx 6ND$ formula. Express in GPU-hours assuming an A100 at 312 TFLOPS.

ex-sp-ch35-03

Easy

Implement sinusoidal positional encoding and verify that the dot product between positions decays smoothly with distance.

ex-sp-ch35-04

Easy

Calculate the KV-cache memory for a 7B model ( $L=32$ , $d=4096$ , $n_h=32$ ) at sequence length 4096 in fp16.

ex-sp-ch35-05

Medium

Implement a minimal GPT (2 layers, $d=128$ ) and train it on a small text corpus. Plot the training loss and perplexity.

ex-sp-ch35-06

Medium

Implement GQA with 8 query heads and 2 KV groups. Compare the memory usage with standard MHA.

ex-sp-ch35-07

Medium

Implement the DPO loss function and train on synthetic preference data (random preferred/rejected pairs). Plot the loss convergence.

ex-sp-ch35-08

Medium

Implement KV-caching and measure the speedup compared to recomputing the full sequence at each step. Plot latency vs. sequence length.

ex-sp-ch35-09

Medium

Build a scaling law plot: train language models with 1M, 5M, 10M, and 50M parameters on 100M tokens and fit the power law $L \propto N^{-\alpha}$ .

ex-sp-ch35-10

Hard

Implement a simple Mixture of Experts layer with 4 experts and top-1 routing. Include load balancing loss. Train on a text corpus and compare with a dense model of equivalent active parameters.

ex-sp-ch35-11

Hard

Implement RoPE (Rotary Positional Encoding) from scratch and show that the dot product between query and key only depends on relative position.

ex-sp-ch35-12

Hard

Implement the RLHF pipeline: SFT on a small dataset, then train a reward model on preferences, then PPO to optimize against the reward model (use a toy 1M-parameter model).

ex-sp-ch35-13

Hard

Profile the forward pass of a GPT model and break down compute time by component: attention, FFN, layer norm, embedding.

ex-sp-ch35-14

Hard

Use BERT to extract contextual embeddings for the word "channel" in different contexts (TV channel, water channel, communication channel). Show that the embeddings differ.

ex-sp-ch35-15

Challenge

Implement Flash Attention v2 tiling algorithm in pure PyTorch. Benchmark memory and speed against standard attention for sequence lengths 1024, 4096, and 16384.

ex-sp-ch35-16

Challenge

Train a Chinchilla-optimal set of models at 3 different compute budgets (varying N and D proportionally). Verify the power-law relationship between compute and test loss.

Chapter Summary References & Further Reading