Exercises
ex-sp-ch26-01
EasyCreate an nn.Module that implements a single linear layer
without using nn.Linear. Use nn.Parameter for
and .
Initialize W with shape (out_features, in_features) and b with shape (out_features,).
Implementation
class MyLinear(nn.Module):
def __init__(self, d_in, d_out):
super().__init__()
self.W = nn.Parameter(torch.randn(d_out, d_in) * 0.01)
self.b = nn.Parameter(torch.zeros(d_out))
def forward(self, x):
return x @ self.W.T + self.b
ex-sp-ch26-02
EasyCount the total number of parameters in an nn.Sequential model
with layers [Linear(100, 256), ReLU, Linear(256, 128), ReLU, Linear(128, 10)].
Verify with sum(p.numel() for p in model.parameters()).
Each Linear(m, n) has m*n + n parameters.
Calculation
Layer 1: 100*256 + 256 = 25,856
Layer 2: 256*128 + 128 = 32,896
Layer 3: 128*10 + 10 = 1,290
Total: 60,042
ex-sp-ch26-03
EasyWrite a training loop for a linear regression model on synthetic data where . Train for 200 steps and verify , .
Use nn.Linear(1, 1), MSELoss, and SGD optimizer.
Implementation
torch.manual_seed(42)
x = torch.randn(200, 1)
y = 3 * x + 2 + 0.1 * torch.randn(200, 1)
model = nn.Linear(1, 1)
opt = torch.optim.SGD(model.parameters(), lr=0.1)
for _ in range(200):
opt.zero_grad()
nn.MSELoss()(model(x), y).backward()
opt.step()
print(model.weight.item(), model.bias.item())
ex-sp-ch26-04
EasyImplement the overfit-one-batch test for a 3-layer MLP on random classification data with 5 classes. Verify the training loss reaches near zero within 500 steps.
Generate random data: x = torch.randn(32, 20), y = torch.randint(0, 5, (32,))
Implementation
model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(),
nn.Linear(64, 64), nn.ReLU(),
nn.Linear(64, 5))
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
x, y = torch.randn(32, 20), torch.randint(0, 5, (32,))
for step in range(500):
opt.zero_grad()
loss = nn.CrossEntropyLoss()(model(x), y)
loss.backward()
opt.step()
print(f"Final loss: {loss.item():.6f}") # should be ~0
ex-sp-ch26-05
EasyUse a forward hook to record the output of the first hidden layer of an MLP during a forward pass. Print the mean and std of the recorded activations.
Use model.net[0].register_forward_hook(hook_fn)
Implementation
recorded = {}
def hook(module, inp, out):
recorded['act'] = out.detach()
model.net[0].register_forward_hook(hook)
model(torch.randn(16, 784))
print(recorded['act'].mean().item(), recorded['act'].std().item())
ex-sp-ch26-06
MediumImplement a Dataset and DataLoader for the function
with 10000 samples, batch size 64,
and 80/20 train/validation split using random_split.
Use torch.utils.data.random_split
Implementation
from torch.utils.data import Dataset, DataLoader, random_split
class SineDataset(Dataset):
def __init__(self, n=10000):
self.x = torch.rand(n, 1)
self.y = torch.sin(2 * torch.pi * self.x) + 0.1 * torch.randn(n, 1)
def __len__(self): return len(self.x)
def __getitem__(self, i): return self.x[i], self.y[i]
ds = SineDataset()
train_ds, val_ds = random_split(ds, [8000, 2000])
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=256)
ex-sp-ch26-07
MediumCompare SGD (with momentum 0.9), Adam, and AdamW on a 4-layer MLP for the sine regression task. Plot training loss curves for all three.
Run 3 separate training loops with the same initial model weights.
Key insight
Use copy.deepcopy(model) to ensure each optimizer starts from
identical weights. Run each for 100 epochs and record losses.
ex-sp-ch26-08
MediumImplement cosine annealing with warmup: linearly increase LR from 0
to lr_max over the first 10 epochs, then cosine decay to lr_min
over the remaining epochs.
Use torch.optim.lr_scheduler.LambdaLR with a custom lambda.
Implementation
import math
def warmup_cosine(epoch, warmup=10, total=100, lr_min=1e-6, lr_max=1e-3):
if epoch < warmup:
return epoch / warmup
progress = (epoch - warmup) / (total - warmup)
return lr_min/lr_max + (1 - lr_min/lr_max) * 0.5 * (1 + math.cos(math.pi * progress))
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, warmup_cosine)
ex-sp-ch26-09
MediumImplement a training loop with gradient clipping and gradient norm logging. Print a warning if any gradient norm exceeds 10.0.
Use torch.nn.utils.clip_grad_norm_ and check returned total norm.
Implementation
for x, y in loader:
opt.zero_grad()
loss = loss_fn(model(x), y)
loss.backward()
total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
if total_norm > 10.0:
print(f"WARNING: grad norm = {total_norm:.2f}")
opt.step()
ex-sp-ch26-10
MediumImplement early stopping: stop training when validation loss has not
improved for patience=10 consecutive epochs. Save the best model.
Track best_val_loss and epochs_without_improvement.
Implementation
best_loss = float('inf')
patience_counter = 0
for epoch in range(500):
train_one_epoch()
val_loss = evaluate()
if val_loss < best_loss:
best_loss = val_loss
patience_counter = 0
torch.save(model.state_dict(), 'best_model.pt')
else:
patience_counter += 1
if patience_counter >= 10:
print(f"Early stopping at epoch {epoch}")
break
model.load_state_dict(torch.load('best_model.pt'))
ex-sp-ch26-11
HardImplement mixed-precision training using torch.amp (automatic
mixed precision). Compare training speed and memory usage with
and without AMP on a 10-layer MLP with width 1024.
Use torch.amp.autocast and torch.amp.GradScaler.
Implementation
scaler = torch.amp.GradScaler('cuda')
for x, y in loader:
opt.zero_grad()
with torch.amp.autocast('cuda'):
loss = loss_fn(model(x.cuda()), y.cuda())
scaler.scale(loss).backward()
scaler.step(opt)
scaler.update()
ex-sp-ch26-12
HardBuild a modular training framework with a Trainer class that
accepts a model, optimizer, loss function, and callbacks (e.g.,
logging, checkpointing, early stopping) as constructor arguments.
Define a Callback base class with on_epoch_end, on_batch_end methods.
Key structure
class Callback:
def on_epoch_end(self, trainer, metrics): pass
def on_batch_end(self, trainer, loss): pass
class Trainer:
def __init__(self, model, opt, loss_fn, callbacks=None):
self.model, self.opt, self.loss_fn = model, opt, loss_fn
self.callbacks = callbacks or []
def fit(self, train_loader, val_loader, epochs):
for epoch in range(epochs):
self.model.train()
for x, y in train_loader:
# training step
for cb in self.callbacks:
cb.on_batch_end(self, loss)
metrics = self.validate(val_loader)
for cb in self.callbacks:
cb.on_epoch_end(self, metrics)
ex-sp-ch26-13
HardImplement gradient accumulation to simulate a batch size of 512 using actual batches of 32 (accumulate over 16 steps before updating).
Call optimizer.step() every accumulation_steps batches.
Implementation
accum_steps = 16
for i, (x, y) in enumerate(loader):
loss = loss_fn(model(x), y) / accum_steps
loss.backward()
if (i + 1) % accum_steps == 0:
opt.step()
opt.zero_grad()
ex-sp-ch26-14
HardImplement a custom autograd function for a "straight-through estimator" that applies hard thresholding in the forward pass but passes gradients through as if it were the identity in the backward pass.
Subclass torch.autograd.Function and implement forward/backward.
Implementation
class StraightThrough(torch.autograd.Function):
@staticmethod
def forward(ctx, x, threshold=0.5):
return (x > threshold).float()
@staticmethod
def backward(ctx, grad_output):
return grad_output, None
ex-sp-ch26-15
ChallengeImplement a neural network that learns the BPSK BER function from simulated data. Generate 10000 pairs via Monte Carlo, train an MLP, and compare the learned function to the analytical curve on a log scale.
Use log-scale inputs and outputs for better numerical conditioning.
Approach
Map (dB) to log10(BER). Train with MSE loss on the log-domain targets. This gives a smooth function that an MLP can learn well.
ex-sp-ch26-16
ChallengeImplement data-parallel training using torch.nn.DataParallel or
torch.nn.parallel.DistributedDataParallel. Measure the speedup
(or slowdown) with 1 vs 2 GPUs on a large MLP.
For DDP, use torch.distributed.launch or torchrun.
DataParallel approach
model = nn.DataParallel(model) # wraps model for multi-GPU
# Training loop remains the same