Exercises
ex-sp-ch35-01
EasyCompute the parameter count for GPT-2 Small (, , ) using the formula . Compare with the reported 124M parameters.
ex-sp-ch35-02
EasyCalculate the FLOPs required to pre-train a 7B model on 1T tokens using the formula. Express in GPU-hours assuming an A100 at 312 TFLOPS.
ex-sp-ch35-03
EasyImplement sinusoidal positional encoding and verify that the dot product between positions decays smoothly with distance.
ex-sp-ch35-04
EasyCalculate the KV-cache memory for a 7B model (, , ) at sequence length 4096 in fp16.
ex-sp-ch35-05
MediumImplement a minimal GPT (2 layers, ) and train it on a small text corpus. Plot the training loss and perplexity.
ex-sp-ch35-06
MediumImplement GQA with 8 query heads and 2 KV groups. Compare the memory usage with standard MHA.
ex-sp-ch35-07
MediumImplement the DPO loss function and train on synthetic preference data (random preferred/rejected pairs). Plot the loss convergence.
ex-sp-ch35-08
MediumImplement KV-caching and measure the speedup compared to recomputing the full sequence at each step. Plot latency vs. sequence length.
ex-sp-ch35-09
MediumBuild a scaling law plot: train language models with 1M, 5M, 10M, and 50M parameters on 100M tokens and fit the power law .
ex-sp-ch35-10
HardImplement a simple Mixture of Experts layer with 4 experts and top-1 routing. Include load balancing loss. Train on a text corpus and compare with a dense model of equivalent active parameters.
ex-sp-ch35-11
HardImplement RoPE (Rotary Positional Encoding) from scratch and show that the dot product between query and key only depends on relative position.
ex-sp-ch35-12
HardImplement the RLHF pipeline: SFT on a small dataset, then train a reward model on preferences, then PPO to optimize against the reward model (use a toy 1M-parameter model).
ex-sp-ch35-13
HardProfile the forward pass of a GPT model and break down compute time by component: attention, FFN, layer norm, embedding.
ex-sp-ch35-14
HardUse BERT to extract contextual embeddings for the word "channel" in different contexts (TV channel, water channel, communication channel). Show that the embeddings differ.
ex-sp-ch35-15
ChallengeImplement Flash Attention v2 tiling algorithm in pure PyTorch. Benchmark memory and speed against standard attention for sequence lengths 1024, 4096, and 16384.
ex-sp-ch35-16
ChallengeTrain a Chinchilla-optimal set of models at 3 different compute budgets (varying N and D proportionally). Verify the power-law relationship between compute and test loss.