Exercises

ex-sp-ch26-01

Easy

Create an nn.Module that implements a single linear layer $y = Wx + b$ without using nn.Linear. Use nn.Parameter for $W$ and $b$ .

Show Hint

Initialize W with shape (out_features, in_features) and b with shape (out_features,).

Solution

Implementation

class MyLinear(nn.Module):
    def __init__(self, d_in, d_out):
        super().__init__()
        self.W = nn.Parameter(torch.randn(d_out, d_in) * 0.01)
        self.b = nn.Parameter(torch.zeros(d_out))
    def forward(self, x):
        return x @ self.W.T + self.b

ex-sp-ch26-02

Easy

Count the total number of parameters in an nn.Sequential model with layers [Linear(100, 256), ReLU, Linear(256, 128), ReLU, Linear(128, 10)]. Verify with sum(p.numel() for p in model.parameters()).

Show Hint

Each Linear(m, n) has m*n + n parameters.

Solution

Calculation

Layer 1: 100*256 + 256 = 25,856
Layer 2: 256*128 + 128 = 32,896
Layer 3: 128*10 + 10   = 1,290
Total: 60,042

ex-sp-ch26-03

Easy

Write a training loop for a linear regression model $y = wx + b$ on synthetic data $y = 3x + 2 + \epsilon$ where $\epsilon \sim \mathcal{N}(0, 0.1)$ . Train for 200 steps and verify $w \approx 3$ , $b \approx 2$ .

Show Hint

Use nn.Linear(1, 1), MSELoss, and SGD optimizer.

Solution

Implementation

torch.manual_seed(42)
x = torch.randn(200, 1)
y = 3 * x + 2 + 0.1 * torch.randn(200, 1)
model = nn.Linear(1, 1)
opt = torch.optim.SGD(model.parameters(), lr=0.1)
for _ in range(200):
    opt.zero_grad()
    nn.MSELoss()(model(x), y).backward()
    opt.step()
print(model.weight.item(), model.bias.item())

ex-sp-ch26-04

Easy

Implement the overfit-one-batch test for a 3-layer MLP on random classification data with 5 classes. Verify the training loss reaches near zero within 500 steps.

Show Hint

Generate random data: x = torch.randn(32, 20), y = torch.randint(0, 5, (32,))

Solution

Implementation

model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(),
                      nn.Linear(64, 64), nn.ReLU(),
                      nn.Linear(64, 5))
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
x, y = torch.randn(32, 20), torch.randint(0, 5, (32,))
for step in range(500):
    opt.zero_grad()
    loss = nn.CrossEntropyLoss()(model(x), y)
    loss.backward()
    opt.step()
print(f"Final loss: {loss.item():.6f}")  # should be ~0

ex-sp-ch26-05

Easy

Use a forward hook to record the output of the first hidden layer of an MLP during a forward pass. Print the mean and std of the recorded activations.

Show Hint

Use model.net[0].register_forward_hook(hook_fn)

Solution

Implementation

recorded = {}
def hook(module, inp, out):
    recorded['act'] = out.detach()
model.net[0].register_forward_hook(hook)
model(torch.randn(16, 784))
print(recorded['act'].mean().item(), recorded['act'].std().item())

ex-sp-ch26-06

Medium

Implement a Dataset and DataLoader for the function $y = \sin(2\pi x) + 0.1\epsilon$ with 10000 samples, batch size 64, and 80/20 train/validation split using random_split.

Show Hint

Use torch.utils.data.random_split

Solution

Implementation

from torch.utils.data import Dataset, DataLoader, random_split

class SineDataset(Dataset):
    def __init__(self, n=10000):
        self.x = torch.rand(n, 1)
        self.y = torch.sin(2 * torch.pi * self.x) + 0.1 * torch.randn(n, 1)
    def __len__(self): return len(self.x)
    def __getitem__(self, i): return self.x[i], self.y[i]

ds = SineDataset()
train_ds, val_ds = random_split(ds, [8000, 2000])
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=256)

ex-sp-ch26-07

Medium

Compare SGD (with momentum 0.9), Adam, and AdamW on a 4-layer MLP for the sine regression task. Plot training loss curves for all three.

Show Hint

Run 3 separate training loops with the same initial model weights.

Solution

Key insight

Use copy.deepcopy(model) to ensure each optimizer starts from identical weights. Run each for 100 epochs and record losses.

ex-sp-ch26-08

Medium

Implement cosine annealing with warmup: linearly increase LR from 0 to lr_max over the first 10 epochs, then cosine decay to lr_min over the remaining epochs.

Show Hint

Use torch.optim.lr_scheduler.LambdaLR with a custom lambda.

Solution

Implementation

import math
def warmup_cosine(epoch, warmup=10, total=100, lr_min=1e-6, lr_max=1e-3):
    if epoch < warmup:
        return epoch / warmup
    progress = (epoch - warmup) / (total - warmup)
    return lr_min/lr_max + (1 - lr_min/lr_max) * 0.5 * (1 + math.cos(math.pi * progress))

scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, warmup_cosine)

ex-sp-ch26-09

Medium

Implement a training loop with gradient clipping and gradient norm logging. Print a warning if any gradient norm exceeds 10.0.

Show Hint

Use torch.nn.utils.clip_grad_norm_ and check returned total norm.

Solution

Implementation

for x, y in loader:
    opt.zero_grad()
    loss = loss_fn(model(x), y)
    loss.backward()
    total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    if total_norm > 10.0:
        print(f"WARNING: grad norm = {total_norm:.2f}")
    opt.step()

ex-sp-ch26-10

Medium

Implement early stopping: stop training when validation loss has not improved for patience=10 consecutive epochs. Save the best model.

Show Hint

Track best_val_loss and epochs_without_improvement.

Solution

Implementation

best_loss = float('inf')
patience_counter = 0
for epoch in range(500):
    train_one_epoch()
    val_loss = evaluate()
    if val_loss < best_loss:
        best_loss = val_loss
        patience_counter = 0
        torch.save(model.state_dict(), 'best_model.pt')
    else:
        patience_counter += 1
        if patience_counter >= 10:
            print(f"Early stopping at epoch {epoch}")
            break
model.load_state_dict(torch.load('best_model.pt'))

ex-sp-ch26-11

Hard

Implement mixed-precision training using torch.amp (automatic mixed precision). Compare training speed and memory usage with and without AMP on a 10-layer MLP with width 1024.

Show Hint

Use torch.amp.autocast and torch.amp.GradScaler.

Solution

Implementation

scaler = torch.amp.GradScaler('cuda')
for x, y in loader:
    opt.zero_grad()
    with torch.amp.autocast('cuda'):
        loss = loss_fn(model(x.cuda()), y.cuda())
    scaler.scale(loss).backward()
    scaler.step(opt)
    scaler.update()

ex-sp-ch26-12

Hard

Build a modular training framework with a Trainer class that accepts a model, optimizer, loss function, and callbacks (e.g., logging, checkpointing, early stopping) as constructor arguments.

Show Hint

Define a Callback base class with on_epoch_end, on_batch_end methods.

Solution

Key structure

class Callback:
    def on_epoch_end(self, trainer, metrics): pass
    def on_batch_end(self, trainer, loss): pass

class Trainer:
    def __init__(self, model, opt, loss_fn, callbacks=None):
        self.model, self.opt, self.loss_fn = model, opt, loss_fn
        self.callbacks = callbacks or []
    def fit(self, train_loader, val_loader, epochs):
        for epoch in range(epochs):
            self.model.train()
            for x, y in train_loader:
                # training step
                for cb in self.callbacks:
                    cb.on_batch_end(self, loss)
            metrics = self.validate(val_loader)
            for cb in self.callbacks:
                cb.on_epoch_end(self, metrics)

ex-sp-ch26-13

Hard

Implement gradient accumulation to simulate a batch size of 512 using actual batches of 32 (accumulate over 16 steps before updating).

Show Hint

Call optimizer.step() every accumulation_steps batches.

Solution

Implementation

accum_steps = 16
for i, (x, y) in enumerate(loader):
    loss = loss_fn(model(x), y) / accum_steps
    loss.backward()
    if (i + 1) % accum_steps == 0:
        opt.step()
        opt.zero_grad()

ex-sp-ch26-14

Hard

Implement a custom autograd function for a "straight-through estimator" that applies hard thresholding in the forward pass but passes gradients through as if it were the identity in the backward pass.

Show Hint

Subclass torch.autograd.Function and implement forward/backward.

Solution

Implementation

class StraightThrough(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, threshold=0.5):
        return (x > threshold).float()
    @staticmethod
    def backward(ctx, grad_output):
        return grad_output, None

ex-sp-ch26-15

Challenge

Implement a neural network that learns the BPSK BER function $P_b = Q(\sqrt{2 E_b/N_0})$ from simulated data. Generate 10000 $(E_b/N_0, \text{BER})$ pairs via Monte Carlo, train an MLP, and compare the learned function to the analytical curve on a log scale.

Show Hint

Use log-scale inputs and outputs for better numerical conditioning.

Solution

Approach

Map $E_b/N_0$ (dB) to log10(BER). Train with MSE loss on the log-domain targets. This gives a smooth function that an MLP can learn well.

ex-sp-ch26-16

Challenge

Implement data-parallel training using torch.nn.DataParallel or torch.nn.parallel.DistributedDataParallel. Measure the speedup (or slowdown) with 1 vs 2 GPUs on a large MLP.

Show Hint

For DDP, use torch.distributed.launch or torchrun.

Solution

DataParallel approach

model = nn.DataParallel(model)  # wraps model for multi-GPU
# Training loop remains the same

Chapter Summary References & Further Reading