Exercises

ex-sp-ch13-01

Easy

Write a function that prints the current GPU memory allocated, reserved, and peak allocated. Then create a 1000x1000 float32 tensor on GPU, check the memory again, delete the tensor, and check once more. Explain why reserved memory may not decrease.

ex-sp-ch13-02

Easy

Use torch.bmm to compute the product of 200 pairs of 8×88 \times 8 complex matrices. Compare the result with a Python loop.

ex-sp-ch13-03

Easy

Create a tensor of value 300.0 in FP16, FP32, and BF16. Square each and report which format overflows. What is the maximum representable value in each format?

ex-sp-ch13-04

Easy

Create a DataLoader with num_workers=0 and num_workers=4 for a synthetic dataset of 10000 samples. Time iterating through the full dataset and report the speedup.

ex-sp-ch13-05

Easy

Write a training loop that accumulates gradients over 4 mini-batches before calling optimizer.step(). This simulates a 4x larger effective batch size using the same GPU memory.

ex-sp-ch13-06

Medium

Implement gradient checkpointing for a 20-layer residual network. Measure peak GPU memory with and without checkpointing at batch size 128 and hidden dimension 512. Report the memory savings and the time overhead.

ex-sp-ch13-07

Medium

Use torch.einsum to implement batched MIMO detection: given BB channel matrices HCB×Nr×Nt\mathbf{H} \in \mathbb{C}^{B \times N_r \times N_t} and received vectors yCB×Nr\mathbf{y} \in \mathbb{C}^{B \times N_r}, compute the matched filter output x^=HHy\hat{\mathbf{x}} = \mathbf{H}^H \mathbf{y}.

ex-sp-ch13-08

Medium

Implement a complete AMP training loop with BFloat16 for a 3-layer MLP. Compare training loss convergence (100 steps) between FP32 and BF16 to verify they match.

ex-sp-ch13-09

Medium

Implement a custom IterableDataset that generates batches of random channel matrices on-the-fly (streaming). Show that it uses constant memory regardless of total samples generated.

ex-sp-ch13-10

Medium

Write a memory-efficient attention computation using gradient checkpointing. Compare memory usage for sequence length 1024 with and without checkpointing.

ex-sp-ch13-11

Hard

Implement a custom CUDA memory allocator that logs every allocation and free event. Use it to identify memory leaks in a training loop where tensors are accidentally kept alive.

ex-sp-ch13-12

Hard

Implement a batched MMSE MIMO detector: x^=(HHH+σ2I)1HHy\hat{\mathbf{x}} = (\mathbf{H}^H\mathbf{H} + \sigma^2\mathbf{I})^{-1}\mathbf{H}^H\mathbf{y} for B=1000B = 1000 channel realizations with Nr=16N_r = 16, Nt=4N_t = 4. Use batched Cholesky solve and measure throughput in channels/second.

ex-sp-ch13-13

Hard

Write a DDP training script that trains a simple model on 2 GPUs (simulated with torch.multiprocessing.spawn). Verify that gradients are identical across GPUs after each step.

ex-sp-ch13-14

Challenge

Implement a complete mixed-precision training pipeline for a MIMO autoencoder: encoder maps (Nt,1)(N_t, 1) symbols to (Nt,1)(N_t, 1) transmitted signals, channel applies Hx+n\mathbf{H}\mathbf{x} + \mathbf{n}, decoder estimates the original symbols. Use BF16 autocast, gradient checkpointing on the decoder, and batched channel application. Benchmark FP32 vs mixed-precision in throughput and final loss.

ex-sp-ch13-15

Challenge

Build a performance profiling dashboard that measures and visualizes: (a) GPU memory timeline, (b) kernel execution time breakdown, (c) data loading vs compute ratio, and (d) arithmetic intensity. Apply it to a batched MIMO simulation pipeline and identify the bottleneck.