The Training Loop

The Training Loop Is the Heart of Deep Learning

Unlike scikit-learn's .fit(), PyTorch requires you to write the training loop explicitly. This gives you complete control over every aspect: data loading, loss computation, gradient accumulation, mixed precision, and logging.

Definition:

Stochastic Gradient Descent (SGD)

SGD updates parameters using the gradient of the loss on a mini-batch:

ΞΈt+1=ΞΈtβˆ’Ξ·β‹…1Bβˆ‘i=1Bβˆ‡ΞΈL(y^i,yi)\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{B} \sum_{i=1}^{B} \nabla_\theta L(\hat{y}_i, y_i)

With momentum Ξ²\beta: vt+1=Ξ²vt+βˆ‡ΞΈL,ΞΈt+1=ΞΈtβˆ’Ξ·vt+1v_{t+1} = \beta v_t + \nabla_\theta L, \qquad \theta_{t+1} = \theta_t - \eta v_{t+1}

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

Definition:

Adam Optimiser

Adam combines momentum with per-parameter adaptive learning rates:

mt=Ξ²1mtβˆ’1+(1βˆ’Ξ²1)gt,vt=Ξ²2vtβˆ’1+(1βˆ’Ξ²2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2

m^t=mt1βˆ’Ξ²1t,v^t=vt1βˆ’Ξ²2t,ΞΈt=ΞΈtβˆ’1βˆ’Ξ·m^tv^t+Ξ΅\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}, \quad \theta_t = \theta_{t-1} - \frac{\eta \hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}

Default: Ξ²1=0.9\beta_1 = 0.9, Ξ²2=0.999\beta_2 = 0.999, Ξ΅=10βˆ’8\varepsilon = 10^{-8}.

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

AdamW decouples weight decay from the gradient update, giving better generalisation. Prefer AdamW over Adam for most tasks.

Theorem: SGD Convergence for Convex Objectives

For an LL-smooth convex function ff with bounded gradient variance Οƒ2\sigma^2, SGD with learning rate Ξ·t=Ξ·0/t\eta_t = \eta_0 / \sqrt{t} satisfies:

E[f(ΞΈΛ‰T)]βˆ’f(ΞΈβˆ—)≀βˆ₯ΞΈ0βˆ’ΞΈβˆ—βˆ₯22Ξ·0T+Ξ·0Οƒ22T\mathbb{E}[f(\bar{\theta}_T)] - f(\theta^*) \le \frac{\|\theta_0 - \theta^*\|^2}{2\eta_0\sqrt{T}} + \frac{\eta_0 \sigma^2}{2\sqrt{T}}

This gives an O(1/T)O(1/\sqrt{T}) convergence rate.

SGD converges more slowly than full-batch GD (O(1/T)O(1/T)) because of gradient noise, but each step is N/BN/B times cheaper.

Example: The Standard PyTorch Training Loop

Write a complete training loop for an MLP on synthetic regression data.

Training Dynamics: Loss vs Epoch

Watch how learning rate and optimizer choice affect convergence.

Parameters

Quick Check

What happens if you forget optimizer.zero_grad() in the training loop?

Gradients from successive batches accumulate (add up)

The model does not learn at all

PyTorch raises an error

Common Mistake: Forgetting model.eval() During Validation

Mistake:

Running validation without calling model.eval(), causing BatchNorm and Dropout to behave as in training mode.

Correction:

Always call model.eval() before validation and model.train() before the next training epoch. Also wrap validation in torch.no_grad().

Why This Matters: End-to-End Learning of Communication Systems

The training loop framework applies directly to learning communication systems end-to-end: the transmitter and receiver are neural networks, the channel is a differentiable layer, and the loss is the bit error rate or mutual information. This autoencoder approach (O'Shea & Hoydis, 2017) can discover novel modulation schemes that outperform hand-designed ones.

See full treatment in Chapter 28

Historical Note: Adam: Adaptive Moment Estimation

2014

Kingma and Ba introduced Adam in 2014, combining ideas from AdaGrad (per-parameter rates) and RMSProp (exponential moving average of squared gradients). Despite theoretical concerns about non-convergence in some cases (Reddi et al., 2018), Adam and its variant AdamW remain the most widely used optimizers in deep learning.

Epoch

One complete pass through the entire training dataset.

Mini-batch

A subset of training examples used to compute one gradient update. Typical sizes: 32-512.