Loss Functions

Choosing the Right Loss Function

The loss function encodes what you want the network to learn. A wrong loss leads to a correct optimisation of the wrong objective. This section covers the essential losses for regression, classification, and structured prediction tasks.

Definition:

Mean Squared Error (MSE) Loss

For regression tasks:

LMSE=1Nβˆ‘i=1Nβˆ₯y^iβˆ’yiβˆ₯2L_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} \|\hat{\mathbf{y}}_i - \mathbf{y}_i\|^2

loss_fn = nn.MSELoss()
loss = loss_fn(predictions, targets)  # both shape (B, d_out)

MSE is equivalent to maximum likelihood estimation under Gaussian noise: p(y∣x)=N(fΞΈ(x),Οƒ2I)p(y|x) = \mathcal{N}(f_\theta(x), \sigma^2 I).

Definition:

Cross-Entropy Loss

For CC-class classification with logits z∈RC\mathbf{z} \in \mathbb{R}^C and true class y∈{0,…,Cβˆ’1}y \in \{0, \ldots, C-1\}:

LCE=βˆ’log⁑ezyβˆ‘c=0Cβˆ’1ezc=βˆ’zy+logβ‘βˆ‘c=0Cβˆ’1ezcL_{\text{CE}} = -\log \frac{e^{z_y}}{\sum_{c=0}^{C-1} e^{z_c}} = -z_y + \log \sum_{c=0}^{C-1} e^{z_c}

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, labels)  # logits: (B, C), labels: (B,)

nn.CrossEntropyLoss combines log_softmax and nll_loss for numerical stability. Do NOT apply softmax before this loss.

Definition:

Binary Cross-Entropy Loss

For binary classification or multi-label problems:

LBCE=βˆ’1Nβˆ‘i=1N[yilog⁑y^i+(1βˆ’yi)log⁑(1βˆ’y^i)]L_{\text{BCE}} = -\frac{1}{N}\sum_{i=1}^{N} \bigl[y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\bigr]

loss_fn = nn.BCEWithLogitsLoss()  # includes sigmoid
loss = loss_fn(logits, targets.float())

Theorem: Losses as Maximum Likelihood Estimation

Minimising MSE loss is equivalent to MLE under Gaussian noise. Minimising cross-entropy loss is equivalent to MLE under a categorical distribution. In general, for a likelihood model p(y∣x,θ)p(y|x,\theta):

ΞΈ^MLE=arg⁑minβ‘ΞΈβˆ’βˆ‘i=1Nlog⁑p(yi∣xi,ΞΈ)\hat{\theta}_{\text{MLE}} = \arg\min_\theta -\sum_{i=1}^{N} \log p(y_i | x_i, \theta)

This unifying view explains why cross-entropy and MSE are natural choices for their respective tasks.

The loss function implicitly defines the noise model. MSE assumes Gaussian noise; cross-entropy assumes categorical outcomes.

Example: Implementing a Custom Huber Loss

Implement the Huber loss, which is MSE for small errors and L1 for large errors, making it robust to outliers.

Loss Function Comparison

Compare MSE, MAE, Huber, and Cross-Entropy loss shapes.

Parameters

Quick Check

Should you apply softmax to logits before passing them to nn.CrossEntropyLoss?

Yes, CrossEntropyLoss expects probabilities

No, CrossEntropyLoss applies log-softmax internally

Only if using mixed precision training

Common Mistake: Shape Mismatch in Loss Functions

Mistake:

Passing targets of shape (B, 1) to nn.CrossEntropyLoss which expects (B,), or passing integer labels to nn.MSELoss.

Correction:

For CrossEntropyLoss: logits shape (B, C), labels shape (B,) as torch.long. For MSELoss: both tensors must have the same shape and be float.

Logits

Raw (unnormalised) scores output by the network before softmax. For classification, logits are in RC\mathbb{R}^C.

Softmax

Converts logits to probabilities: softmax(zc)=ezc/βˆ‘kezk\text{softmax}(z_c) = e^{z_c} / \sum_k e^{z_k}.