Common Debugging Patterns

When Training Goes Wrong

Neural network debugging is notoriously difficult because bugs often manifest as "the model just doesn't learn" rather than a crash. This section collects the most common failure modes and systematic debugging strategies.

Definition:

Gradient Clipping

When gradients explode, clip them to a maximum norm:

g^={gif gττg/gotherwise\hat{\mathbf{g}} = \begin{cases} \mathbf{g} & \text{if } \|\mathbf{g}\| \le \tau \\ \tau \cdot \mathbf{g} / \|\mathbf{g}\| & \text{otherwise} \end{cases}

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Call between .backward() and optimizer.step()

Example: The Overfit-One-Batch Test

Before training on the full dataset, verify the model can memorise a single batch perfectly.

Example: Monitoring Gradient Statistics

Detect vanishing or exploding gradients during training.

Gradient Flow Visualisation

See how gradients flow (or vanish) through layers.

Parameters

Quick Check

Your training loss stays flat after 100 epochs. What is the FIRST thing to check?

Try a larger model

Verify the model can overfit a single batch

Switch to a different optimizer

Debugging Checklist

SymptomLikely CauseFix
Loss is NaNLearning rate too high or numerical overflowLower LR, use gradient clipping, check for log(0)
Loss doesn't decreaseBug in data pipeline or lossOverfit one batch first
Train loss drops, val loss flatOverfittingAdd dropout, weight decay, or data augmentation
Loss oscillates wildlyLR too highReduce LR by 10x
All predictions identicalDead neurons or wrong activationCheck init, use LeakyReLU