Common Debugging Patterns

When Training Goes Wrong

Neural network debugging is notoriously difficult because bugs often manifest as "the model just doesn't learn" rather than a crash. This section collects the most common failure modes and systematic debugging strategies.

Definition:
Gradient Clipping

When gradients explode, clip them to a maximum norm:

$\hat{\mathbf{g}} = \begin{cases} \mathbf{g} & \text{if } \|\mathbf{g}\| \le \tau \\ \tau \cdot \mathbf{g} / \|\mathbf{g}\| & \text{otherwise} \end{cases}$

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Call between .backward() and optimizer.step()

Example: The Overfit-One-Batch Test

Before training on the full dataset, verify the model can memorise a single batch perfectly.

Solution

Implementation

x_batch, y_batch = next(iter(train_loader))
for step in range(1000):
    optimizer.zero_grad()
    loss = loss_fn(model(x_batch), y_batch)
    loss.backward()
    optimizer.step()
    if step % 100 == 0:
        print(f"Step {step}: loss = {loss.item():.6f}")
# Loss should reach ~0. If not, there's a bug.

What to check if it fails

Wrong loss function for the task
Model output shape doesn't match target shape
Learning rate too high or too low
Accidentally detached gradients (e.g., .detach() or .data)

Example: Monitoring Gradient Statistics

Detect vanishing or exploding gradients during training.

Solution

Print gradient norms per layer

for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm().item()
        print(f"{name}: grad_norm = {grad_norm:.6f}")

Check for NaN

for name, param in model.named_parameters():
    if param.grad is not None and torch.isnan(param.grad).any():
        print(f"NaN gradient in {name}!")

Gradient Flow Visualisation

See how gradients flow (or vanish) through layers.

Parameters

Quick Check

Your training loss stays flat after 100 epochs. What is the FIRST thing to check?

Try a larger model

Verify the model can overfit a single batch

Switch to a different optimizer

Correction: