nn.Module and Model Definition

Why nn.Module Is the Foundation

Every neural network in PyTorch is an nn.Module. Understanding this base class β€” how it registers parameters, composes sub-modules, and manages device placement β€” is essential before writing any training code. This section covers the mechanics that the rest of Part VI builds upon.

Definition:

nn.Module

torch.nn.Module is the base class for all neural network components. A module encapsulates:

  1. Parameters β€” learnable tensors registered via nn.Parameter
  2. Sub-modules β€” child nn.Module instances (set as attributes)
  3. Forward logic β€” the forward() method defining the computation
import torch
import torch.nn as nn

class LinearModel(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)

    def forward(self, x):
        return self.linear(x)

Calling model(x) invokes model.forward(x) through __call__, which also runs registered hooks.

Never call model.forward(x) directly β€” always use model(x) so hooks and gradient tracking work correctly.

Definition:

nn.Parameter

nn.Parameter is a Tensor subclass that, when assigned as a module attribute, is automatically added to the list of parameters:

ΞΈ={W1,b1,W2,b2,…}\theta = \{\mathbf{W}_1, \mathbf{b}_1, \mathbf{W}_2, \mathbf{b}_2, \ldots\}

class ManualLinear(nn.Module):
    def __init__(self, d_in, d_out):
        super().__init__()
        self.W = nn.Parameter(torch.randn(d_out, d_in))
        self.b = nn.Parameter(torch.zeros(d_out))

    def forward(self, x):
        return x @ self.W.T + self.b

Use model.parameters() to iterate over all parameters and model.named_parameters() for (name, param) pairs.

Definition:

nn.Sequential

nn.Sequential chains modules so each output feeds the next:

y^=fL∘fLβˆ’1βˆ˜β‹―βˆ˜f1(x)\hat{\mathbf{y}} = f_L \circ f_{L-1} \circ \cdots \circ f_1(\mathbf{x})

mlp = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

For named layers, use OrderedDict:

from collections import OrderedDict
mlp = nn.Sequential(OrderedDict([
    ("fc1", nn.Linear(784, 256)),
    ("act", nn.ReLU()),
    ("fc2", nn.Linear(256, 10)),
]))

Definition:

Common Activation Functions

Activation functions introduce non-linearity:

ReLU(x)=max⁑(0,x),Sigmoid(x)=11+eβˆ’x,Tanh(x)=exβˆ’eβˆ’xex+eβˆ’x\text{ReLU}(x) = \max(0, x), \qquad \text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}, \qquad \text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

GELU (used in transformers): GELU(x)=xβ‹…Ξ¦(x)\text{GELU}(x) = x \cdot \Phi(x)

where Ξ¦\Phi is the standard Gaussian CDF.

In PyTorch: nn.ReLU(), nn.Sigmoid(), nn.GELU(), or their functional forms F.relu(x), torch.sigmoid(x), etc.

ReLU is the default choice for hidden layers. Use GELU for transformer-based architectures and sigmoid/softmax for output layers.

Definition:

Weight Initialisation Strategies

Proper initialisation prevents vanishing/exploding activations. For a layer with fan-in ninn_{\text{in}} and fan-out noutn_{\text{out}}:

Kaiming (He) for ReLU networks: Wij∼N ⁣(0,2nin)W_{ij} \sim \mathcal{N}\!\left(0, \frac{2}{n_{\text{in}}}\right)

Xavier (Glorot) for tanh/sigmoid: Wij∼N ⁣(0,2nin+nout)W_{ij} \sim \mathcal{N}\!\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)

nn.init.kaiming_normal_(layer.weight, nonlinearity='relu')
nn.init.xavier_uniform_(layer.weight)
nn.init.zeros_(layer.bias)

Definition:

Forward and Backward Hooks

Hooks let you inspect or modify data flowing through modules:

  • Forward hook: module.register_forward_hook(fn) β€” called after forward(), receives (module, input, output)
  • Backward hook: module.register_full_backward_hook(fn) β€” called during backward pass, receives (module, grad_input, grad_output)
activations = {}
def save_activation(name):
    def hook(module, input, output):
        activations[name] = output.detach()
    return hook

model.fc1.register_forward_hook(save_activation('fc1'))

Hooks are invaluable for debugging (checking for NaN gradients), feature extraction, and gradient visualization.

Theorem: Universal Approximation Theorem

A feed-forward network with a single hidden layer containing a finite number of neurons can approximate any continuous function f:Rn→Rmf: \mathbb{R}^n \to \mathbb{R}^m on a compact subset of Rn\mathbb{R}^n to arbitrary precision, provided the activation function is non-constant, bounded, and continuous (Cybenko, 1989).

More generally, for ReLU networks (Hanin, 2019): a network with width n+1n+1 and sufficient depth can approximate any continuous function on [0,1]n[0,1]^n.

A single hidden layer can represent any function, but may require exponentially many neurons. Depth provides exponentially more efficient representations.

Theorem: Parameter Count for Fully Connected Networks

An MLP with layer widths n0,n1,…,nLn_0, n_1, \ldots, n_L has total parameter count:

∣θ∣=βˆ‘l=1L(nlβˆ’1β‹…nl+nl)=βˆ‘l=1Lnl(nlβˆ’1+1)|\theta| = \sum_{l=1}^{L} (n_{l-1} \cdot n_l + n_l) = \sum_{l=1}^{L} n_l(n_{l-1} + 1)

The first term counts weights Wl∈RnlΓ—nlβˆ’1\mathbf{W}_l \in \mathbb{R}^{n_l \times n_{l-1}} and the second counts biases bl∈Rnl\mathbf{b}_l \in \mathbb{R}^{n_l}.

Each neuron connects to every neuron in the previous layer (weights) plus one bias. This quadratic scaling motivates architectures like CNNs and transformers that share parameters.

Theorem: Backpropagation via the Chain Rule

For a composition L=β„“(fL(fLβˆ’1(β‹―f1(x))))L = \ell(f_L(f_{L-1}(\cdots f_1(\mathbf{x})))), the gradient with respect to parameters Wk\mathbf{W}_k of layer kk is:

βˆ‚Lβˆ‚Wk=βˆ‚Lβˆ‚zLβ‹…βˆl=k+1Lβˆ‚zlβˆ‚zlβˆ’1β‹…βˆ‚zkβˆ‚Wk\frac{\partial L}{\partial \mathbf{W}_k} = \frac{\partial L}{\partial \mathbf{z}_L} \cdot \prod_{l=k+1}^{L} \frac{\partial \mathbf{z}_l}{\partial \mathbf{z}_{l-1}} \cdot \frac{\partial \mathbf{z}_k}{\partial \mathbf{W}_k}

PyTorch's autograd computes this automatically using a dynamic computational graph built during the forward pass.

Backpropagation is just the chain rule applied layer by layer from output to input. PyTorch records operations on tensors with requires_grad=True and replays them in reverse during .backward().

Example: Building a 3-Layer MLP for Classification

Build a 3-layer MLP that maps 784-dimensional input (flattened 28x28 image) to 10 class logits, with hidden layers of size 256 and 128.

Example: Custom Module with Residual (Skip) Connection

Implement a module where the output is y=f(x)+x\mathbf{y} = f(\mathbf{x}) + \mathbf{x}, i.e., a residual block that adds the input to the transformed output.

Example: Inspecting Parameters and Module Tree

Given a model, enumerate all sub-modules and their parameter shapes.

Example: Moving Models and Data to GPU

Move a model and its input tensors to GPU for accelerated computation.

Activation Function Explorer

Compare ReLU, LeakyReLU, GELU, Sigmoid, Tanh, and Swish.

Parameters

MLP Parameter Count Calculator

See how width and depth affect parameter count.

Parameters

Weight Initialisation Comparison

Compare activation distributions through layers with different init strategies.

Parameters

Forward Pass Through an MLP

Watch activations propagate layer by layer through a network.

Parameters

nn.Module Composition Tree

nn.Module Composition Tree
Hierarchical structure of an nn.Module: parameters live in leaf modules, and named_modules() traverses the tree depth-first.

Autograd Computational Graph

Autograd Computational Graph
PyTorch builds a directed acyclic graph of operations during the forward pass. Calling .backward() traverses this graph in reverse to compute gradients.

Quick Check

What happens if you assign a torch.Tensor (not nn.Parameter) as a module attribute?

It is automatically registered as a parameter

It is ignored by model.parameters() and the optimizer

PyTorch raises an error

Quick Check

Which initialisation is most appropriate for a network using ReLU activations?

Xavier (Glorot) initialisation

Kaiming (He) initialisation

All-zeros initialisation

Common Mistake: Forgetting super().init()

Mistake:

Defining an nn.Module subclass without calling super().__init__() in __init__.

Correction:

Always start with super().__init__(). Without it, PyTorch cannot register parameters or sub-modules, and .parameters() returns nothing.

Common Mistake: Using Python List Instead of nn.ModuleList

Mistake:

Storing sub-modules in a plain Python list: self.layers = [nn.Linear(64, 64) for _ in range(3)]

Correction:

Use nn.ModuleList: self.layers = nn.ModuleList([nn.Linear(64, 64) for _ in range(3)]). Plain lists are invisible to .parameters(), .to(device), and .state_dict().

Common Mistake: Calling .forward() Directly

Mistake:

Writing output = model.forward(x) instead of output = model(x).

Correction:

Always use model(x). The __call__ method runs hooks, applies torch.no_grad() context if set, and handles other internal bookkeeping.

Key Takeaway

nn.Module is a tree: compose complex architectures by nesting modules. Use nn.Sequential for simple chains, nn.ModuleList for indexed access, and nn.ModuleDict for named dynamic architectures.

Key Takeaway

Weight initialisation determines whether gradients flow through deep networks. Use Kaiming for ReLU, Xavier for tanh/sigmoid, and always initialise biases to zero.

Why This Matters: Neural Networks for Channel Estimation

In 5G NR, neural networks can learn the mapping from received pilot signals to channel estimates. An MLP with input dimension equal to the number of pilot subcarriers and output equal to the full channel dimension replaces traditional LS/MMSE estimators, learning non-linear dependencies in the channel structure.

See full treatment in Chapter 33

Historical Note: From Perceptrons to Deep Learning

1958-2012

Rosenblatt's Perceptron (1958) was a single-layer linear classifier. Minsky and Papert (1969) showed it could not learn XOR, triggering the first AI winter. Backpropagation (Rumelhart, Hinton, Williams, 1986) enabled training multi-layer networks, but deep networks only became practical with GPU computing (Krizhevsky et al., 2012).

Historical Note: PyTorch: From Lua Torch to Python

2017-present

PyTorch emerged from the Lua-based Torch framework. Released by Facebook AI Research in 2017, its define-by-run (eager) execution model and Pythonic API quickly made it the dominant research framework. PyTorch 2.0 (2023) introduced torch.compile for graph-mode optimization without sacrificing the eager programming model.

nn.Module

Base class for all neural network components in PyTorch.

Related: nn.Parameter, nn.Sequential

nn.Parameter

A Tensor subclass automatically registered as a learnable parameter when assigned to a Module.

Related: nn.Module

nn.Sequential

A container that chains modules sequentially, passing each output as input to the next.

Related: nn.Module

Autograd

PyTorch's automatic differentiation engine that records operations on tensors and computes gradients via reverse-mode AD.

ReLU

Rectified Linear Unit activation: ReLU(x)=max⁑(0,x)\text{ReLU}(x) = \max(0, x). The default activation for hidden layers.

Kaiming Initialisation

Weight initialisation that accounts for ReLU activations: W∼N(0,2/nin)W \sim \mathcal{N}(0, 2/n_{\text{in}}).

Activation Function Comparison

ActivationFormulaRangeGradient at 0Best For
ReLUmax⁑(0,x)\max(0,x)[0,∞)[0, \infty)undefined (0.5 subgrad)Default hidden layers
LeakyReLUmax⁑(Ξ±x,x)\max(\alpha x, x)(βˆ’βˆž,∞)(-\infty, \infty)1Avoiding dead neurons
GELUxΞ¦(x)x\Phi(x)(βˆ’0.17,∞)(-0.17, \infty)0.5Transformers
Sigmoid1/(1+eβˆ’x)1/(1+e^{-x})(0,1)(0, 1)0.25Binary output / gates
Tanh(exβˆ’eβˆ’x)/(ex+eβˆ’x)(e^x-e^{-x})/(e^x+e^{-x})(βˆ’1,1)(-1, 1)1Bounded output / RNNs