Common Debugging Patterns
When Training Goes Wrong
Neural network debugging is notoriously difficult because bugs often manifest as "the model just doesn't learn" rather than a crash. This section collects the most common failure modes and systematic debugging strategies.
Definition: Gradient Clipping
Gradient Clipping
When gradients explode, clip them to a maximum norm:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Call between .backward() and optimizer.step()
Example: The Overfit-One-Batch Test
Before training on the full dataset, verify the model can memorise a single batch perfectly.
Implementation
x_batch, y_batch = next(iter(train_loader))
for step in range(1000):
optimizer.zero_grad()
loss = loss_fn(model(x_batch), y_batch)
loss.backward()
optimizer.step()
if step % 100 == 0:
print(f"Step {step}: loss = {loss.item():.6f}")
# Loss should reach ~0. If not, there's a bug.
What to check if it fails
- Wrong loss function for the task
- Model output shape doesn't match target shape
- Learning rate too high or too low
- Accidentally detached gradients (e.g.,
.detach()or.data)
Example: Monitoring Gradient Statistics
Detect vanishing or exploding gradients during training.
Print gradient norms per layer
for name, param in model.named_parameters():
if param.grad is not None:
grad_norm = param.grad.norm().item()
print(f"{name}: grad_norm = {grad_norm:.6f}")
Check for NaN
for name, param in model.named_parameters():
if param.grad is not None and torch.isnan(param.grad).any():
print(f"NaN gradient in {name}!")
Gradient Flow Visualisation
See how gradients flow (or vanish) through layers.
Parameters
Quick Check
Your training loss stays flat after 100 epochs. What is the FIRST thing to check?
Try a larger model
Verify the model can overfit a single batch
Switch to a different optimizer
If the model can't memorise one batch, the bug is in the model, loss, or data pipeline — not model capacity.
Debugging Checklist
| Symptom | Likely Cause | Fix |
|---|---|---|
| Loss is NaN | Learning rate too high or numerical overflow | Lower LR, use gradient clipping, check for log(0) |
| Loss doesn't decrease | Bug in data pipeline or loss | Overfit one batch first |
| Train loss drops, val loss flat | Overfitting | Add dropout, weight decay, or data augmentation |
| Loss oscillates wildly | LR too high | Reduce LR by 10x |
| All predictions identical | Dead neurons or wrong activation | Check init, use LeakyReLU |