Ferkans — Interactive Telecom Tutor

Choosing the Right Loss Function

The loss function encodes what you want the network to learn. A wrong loss leads to a correct optimisation of the wrong objective. This section covers the essential losses for regression, classification, and structured prediction tasks.

Definition:
Mean Squared Error (MSE) Loss

For regression tasks:

$L_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} \|\hat{\mathbf{y}}_i - \mathbf{y}_i\|^2$

loss_fn = nn.MSELoss()
loss = loss_fn(predictions, targets)  # both shape (B, d_out)

MSE is equivalent to maximum likelihood estimation under Gaussian noise: $p(y|x) = \mathcal{N}(f_\theta(x), \sigma^2 I)$ .

Definition:
Cross-Entropy Loss

For $C$ -class classification with logits $\mathbf{z} \in \mathbb{R}^C$ and true class $y \in \{0, \ldots, C-1\}$ :

$L_{\text{CE}} = -\log \frac{e^{z_y}}{\sum_{c=0}^{C-1} e^{z_c}} = -z_y + \log \sum_{c=0}^{C-1} e^{z_c}$

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, labels)  # logits: (B, C), labels: (B,)

nn.CrossEntropyLoss combines log_softmax and nll_loss for numerical stability. Do NOT apply softmax before this loss.

Definition:
Binary Cross-Entropy Loss

For binary classification or multi-label problems:

$L_{\text{BCE}} = -\frac{1}{N}\sum_{i=1}^{N} \bigl[y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\bigr]$

loss_fn = nn.BCEWithLogitsLoss()  # includes sigmoid
loss = loss_fn(logits, targets.float())

Theorem: Losses as Maximum Likelihood Estimation

Minimising MSE loss is equivalent to MLE under Gaussian noise. Minimising cross-entropy loss is equivalent to MLE under a categorical distribution. In general, for a likelihood model $p(y|x,\theta)$ :

$\hat{\theta}_{\text{MLE}} = \arg\min_\theta -\sum_{i=1}^{N} \log p(y_i | x_i, \theta)$

This unifying view explains why cross-entropy and MSE are natural choices for their respective tasks.

The loss function implicitly defines the noise model. MSE assumes Gaussian noise; cross-entropy assumes categorical outcomes.

Example: Implementing a Custom Huber Loss

Implement the Huber loss, which is MSE for small errors and L1 for large errors, making it robust to outliers.

Solution

Formula and implementation

$L_\delta(a) = \begin{cases} \frac{1}{2}a^2 & |a| \le \delta \\ \delta(|a| - \frac{1}{2}\delta) & |a| > \delta \end{cases}$ $

def huber_loss(pred, target, delta=1.0):
    diff = pred - target
    abs_diff = diff.abs()
    quadratic = torch.clamp(abs_diff, max=delta)
    linear = abs_diff - quadratic
    return (0.5 * quadratic**2 + delta * linear).mean()

PyTorch built-in

loss_fn = nn.HuberLoss(delta=1.0)

Loss Function Comparison

Compare MSE, MAE, Huber, and Cross-Entropy loss shapes.

Parameters

Quick Check

Should you apply softmax to logits before passing them to nn.CrossEntropyLoss?

Yes, CrossEntropyLoss expects probabilities

No, CrossEntropyLoss applies log-softmax internally

Only if using mixed precision training

Correction:

No, CrossEntropyLoss applies log-softmax internally

Applying softmax twice causes numerical issues and incorrect gradients.

Common Mistake: Shape Mismatch in Loss Functions

Mistake:

Passing targets of shape (B, 1) to nn.CrossEntropyLoss which expects (B,), or passing integer labels to nn.MSELoss.

Correction:

For CrossEntropyLoss: logits shape (B, C), labels shape (B,) as torch.long. For MSELoss: both tensors must have the same shape and be float.

Logits

Raw (unnormalised) scores output by the network before softmax. For classification, logits are in $\mathbb{R}^C$ .

Softmax

Converts logits to probabilities: $\text{softmax}(z_c) = e^{z_c} / \sum_k e^{z_k}$ .

Loss Functions