Loss Functions
Choosing the Right Loss Function
The loss function encodes what you want the network to learn. A wrong loss leads to a correct optimisation of the wrong objective. This section covers the essential losses for regression, classification, and structured prediction tasks.
Definition: Mean Squared Error (MSE) Loss
Mean Squared Error (MSE) Loss
For regression tasks:
loss_fn = nn.MSELoss()
loss = loss_fn(predictions, targets) # both shape (B, d_out)
MSE is equivalent to maximum likelihood estimation under Gaussian noise: .
Definition: Cross-Entropy Loss
Cross-Entropy Loss
For -class classification with logits and true class :
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, labels) # logits: (B, C), labels: (B,)
nn.CrossEntropyLoss combines log_softmax and nll_loss for
numerical stability. Do NOT apply softmax before this loss.
Definition: Binary Cross-Entropy Loss
Binary Cross-Entropy Loss
For binary classification or multi-label problems:
loss_fn = nn.BCEWithLogitsLoss() # includes sigmoid
loss = loss_fn(logits, targets.float())
Theorem: Losses as Maximum Likelihood Estimation
Minimising MSE loss is equivalent to MLE under Gaussian noise. Minimising cross-entropy loss is equivalent to MLE under a categorical distribution. In general, for a likelihood model :
This unifying view explains why cross-entropy and MSE are natural choices for their respective tasks.
The loss function implicitly defines the noise model. MSE assumes Gaussian noise; cross-entropy assumes categorical outcomes.
Example: Implementing a Custom Huber Loss
Implement the Huber loss, which is MSE for small errors and L1 for large errors, making it robust to outliers.
Formula and implementation
$
def huber_loss(pred, target, delta=1.0):
diff = pred - target
abs_diff = diff.abs()
quadratic = torch.clamp(abs_diff, max=delta)
linear = abs_diff - quadratic
return (0.5 * quadratic**2 + delta * linear).mean()
PyTorch built-in
loss_fn = nn.HuberLoss(delta=1.0)
Loss Function Comparison
Compare MSE, MAE, Huber, and Cross-Entropy loss shapes.
Parameters
Quick Check
Should you apply softmax to logits before passing them to nn.CrossEntropyLoss?
Yes, CrossEntropyLoss expects probabilities
No, CrossEntropyLoss applies log-softmax internally
Only if using mixed precision training
Applying softmax twice causes numerical issues and incorrect gradients.
Common Mistake: Shape Mismatch in Loss Functions
Mistake:
Passing targets of shape (B, 1) to nn.CrossEntropyLoss which
expects (B,), or passing integer labels to nn.MSELoss.
Correction:
For CrossEntropyLoss: logits shape (B, C), labels shape (B,) as torch.long.
For MSELoss: both tensors must have the same shape and be float.
Logits
Raw (unnormalised) scores output by the network before softmax. For classification, logits are in .
Softmax
Converts logits to probabilities: .