Exercises

ex-sp-ch30-01

Easy

Implement scaled dot-product attention from scratch. Verify on Q, K, V of shape (1, 4, 8).

ex-sp-ch30-02

Easy

Compute sinusoidal positional encodings for d_model=64, max_len=100. Plot as a heatmap.

ex-sp-ch30-03

Easy

Create a causal mask for sequence length 8 and verify masked positions get zero attention.

ex-sp-ch30-04

Easy

Implement nn.MultiheadAttention and compare output to your from-scratch implementation.

ex-sp-ch30-05

Easy

Count parameters in a transformer block with d_model=512, n_heads=8, d_ff=2048.

ex-sp-ch30-06

Medium

Implement multi-head attention from scratch with h=4 heads and d_model=64.

ex-sp-ch30-07

Medium

Build a 4-layer transformer encoder and train it for sequence classification.

ex-sp-ch30-08

Medium

Implement ViT for CIFAR-10 (32x32, patch_size=4). Train for 20 epochs.

ex-sp-ch30-09

Medium

Visualise attention weights for each head and interpret what patterns emerge.

ex-sp-ch30-10

Medium

Compare transformer vs LSTM on a sequence classification task. Report accuracy and speed.

ex-sp-ch30-11

Hard

Implement a transformer decoder with causal masking and cross-attention.

ex-sp-ch30-12

Hard

Implement KV-cache for efficient autoregressive generation. Measure speedup.

ex-sp-ch30-13

Hard

Implement learned positional embeddings and compare to sinusoidal on a task.

ex-sp-ch30-14

Hard

Implement rotary position embeddings (RoPE) and integrate with multi-head attention.

ex-sp-ch30-15

Challenge

Apply a transformer to OFDM channel estimation: tokens = subcarriers, features = pilot measurements.

ex-sp-ch30-16

Challenge

Implement Flash Attention v2 (tiled, IO-aware) and compare memory usage to standard attention.

Chapter Summary References & Further Reading