Conv2d, BatchNorm, Activation

Why Convolutions?

Fully connected layers treat each input dimension independently β€” for a 256x256 image, that is 65,536 inputs per neuron. Convolutions exploit two key priors: translation equivariance (features can appear anywhere) and locality (nearby pixels are more related than distant ones). This reduces parameters by orders of magnitude.

Definition:

2D Convolution Layer

nn.Conv2d computes a 2D cross-correlation:

y[cout,h,w]=b[cout]+βˆ‘c=0Cinβˆ’1βˆ‘i=0kβˆ’1βˆ‘j=0kβˆ’1K[cout,c,i,j]β‹…x[c,hβ‹…s+i,wβ‹…s+j]y[c_{\text{out}}, h, w] = b[c_{\text{out}}] + \sum_{c=0}^{C_{\text{in}}-1} \sum_{i=0}^{k-1} \sum_{j=0}^{k-1} K[c_{\text{out}}, c, i, j] \cdot x[c, h \cdot s + i, w \cdot s + j]

Parameters: CoutΓ—CinΓ—kΓ—k+CoutC_{\text{out}} \times C_{\text{in}} \times k \times k + C_{\text{out}}.

conv = nn.Conv2d(in_channels=3, out_channels=64,
                 kernel_size=3, stride=1, padding=1)
# Input: (B, 3, H, W) β†’ Output: (B, 64, H, W)

With padding=k//2 and stride=1, the spatial dimensions are preserved. This is the most common configuration in modern architectures.

Definition:

Convolution Output Size

For input size HinH_{\text{in}}, kernel kk, padding pp, stride ss, and dilation dd:

Hout=⌊Hin+2pβˆ’d(kβˆ’1)βˆ’1s+1βŒ‹H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} + 2p - d(k-1) - 1}{s} + 1 \right\rfloor

For padding='same' with stride=1: Hout=HinH_{\text{out}} = H_{\text{in}}.

Definition:

Batch Normalisation

BatchNorm normalises activations across the batch dimension:

x^i=xiβˆ’ΞΌBΟƒB2+Ξ΅,yi=Ξ³x^i+Ξ²\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}, \qquad y_i = \gamma \hat{x}_i + \beta

where ΞΌB\mu_B and ΟƒB2\sigma_B^2 are the per-channel batch mean and variance, and Ξ³\gamma, Ξ²\beta are learnable scale/shift parameters.

bn = nn.BatchNorm2d(num_features=64)
# During eval: uses running mean/variance

BatchNorm behaves differently in train vs eval mode. Always call model.train() and model.eval() appropriately.

Definition:

Pooling Layers

Pooling reduces spatial dimensions:

  • Max pooling: y[h,w]=max⁑i,j∈windowx[hβ‹…s+i,wβ‹…s+j]y[h,w] = \max_{i,j \in \text{window}} x[h \cdot s + i, w \cdot s + j]
  • Average pooling: y[h,w]=1k2βˆ‘i,jx[hβ‹…s+i,wβ‹…s+j]y[h,w] = \frac{1}{k^2} \sum_{i,j} x[h \cdot s + i, w \cdot s + j]
  • Adaptive pooling: nn.AdaptiveAvgPool2d((1,1)) β†’ global average
pool = nn.MaxPool2d(kernel_size=2, stride=2)  # halves spatial dims
gap = nn.AdaptiveAvgPool2d((1, 1))  # β†’ (B, C, 1, 1)

Definition:

Depthwise Separable Convolution

Splits a standard convolution into:

  1. Depthwise: one kΓ—kk \times k filter per channel (groups=C_{\text{in}})
  2. Pointwise: 1Γ—11 \times 1 convolution to mix channels

Parameter reduction: Cinβ‹…k2+Cinβ‹…CoutCinβ‹…Coutβ‹…k2β‰ˆ1Cout+1k2\frac{C_{\text{in}} \cdot k^2 + C_{\text{in}} \cdot C_{\text{out}}}{C_{\text{in}} \cdot C_{\text{out}} \cdot k^2} \approx \frac{1}{C_{\text{out}}} + \frac{1}{k^2}

depthwise = nn.Conv2d(64, 64, 3, padding=1, groups=64)
pointwise = nn.Conv2d(64, 128, 1)

Theorem: Translation Equivariance of Convolution

Let TdT_{\mathbf{d}} denote spatial translation by d\mathbf{d}. For a convolution operator C\mathcal{C}:

C(Td[x])=Td[C(x)]\mathcal{C}(T_{\mathbf{d}}[\mathbf{x}]) = T_{\mathbf{d}}[\mathcal{C}(\mathbf{x})]

Convolution commutes with translation. An object detected at one location is detected identically at another.

The same filter slides across the entire image, so it responds to the same pattern regardless of position. This is why CNNs need far fewer parameters than fully connected networks for image tasks.

Theorem: Receptive Field Growth

For LL layers of kΓ—kk \times k convolutions with stride 1:

rL=1+L(kβˆ’1)r_L = 1 + L(k - 1)

With stride ss at layer ll: rL=1+βˆ‘l=1L(klβˆ’1)∏i=1lβˆ’1sir_L = 1 + \sum_{l=1}^{L} (k_l - 1) \prod_{i=1}^{l-1} s_i

Dilated convolutions with rate dd effectively use kernel size k+(kβˆ’1)(dβˆ’1)k + (k-1)(d-1), growing the receptive field without adding parameters.

Each layer sees a larger region of the input. Deeper networks with more layers (or dilated convolutions) can capture longer-range spatial dependencies.

Theorem: CNN vs FC Parameter Count

For an image of size HΓ—WΓ—CinH \times W \times C_{\text{in}} mapped to CoutC_{\text{out}} feature maps:

  • FC: Hβ‹…Wβ‹…Cinβ‹…Hβ‹…Wβ‹…CoutH \cdot W \cdot C_{\text{in}} \cdot H \cdot W \cdot C_{\text{out}} parameters
  • Conv: Cinβ‹…Coutβ‹…k2C_{\text{in}} \cdot C_{\text{out}} \cdot k^2 parameters

For a 256x256 RGB image with 64 output channels and 3Γ—33 \times 3 kernel: FC = 12.9 billion, Conv = 1,728. Ratio: 7.5Γ—1067.5 \times 10^6.

Weight sharing across spatial positions gives CNNs their massive parameter efficiency for structured (grid) data.

Example: The Conv-BN-ReLU Building Block

Implement the standard Conv-BN-ReLU block used in almost every CNN.

Example: A Simple CNN for Image Classification

Build a CNN that takes 32x32 RGB images and outputs 10 class logits.

Example: Computing Feature Map Dimensions

Calculate the output spatial size after Conv2d(3, 64, 7, stride=2, padding=3) followed by MaxPool2d(3, stride=2, padding=1) on a 224x224 input.

2D Convolution Explorer

Visualise how kernel size, stride, padding, and dilation affect the output.

Parameters

CNN Feature Map Visualiser

See how different filters transform input images.

Parameters

Receptive Field Calculator

Compute receptive field size for different architectures.

Parameters

Convolution Sliding Window

Watch a kernel slide across a 2D input computing the output.

Parameters

Conv2d Operation Diagram

Conv2d Operation Diagram
A 3x3 kernel slides across the input feature map, computing a weighted sum at each position to produce one pixel in the output feature map.

CNN Architecture Timeline

CNN Architecture Timeline
Evolution from LeNet (1998) through AlexNet, VGG, ResNet, to modern EfficientNet and ConvNeXt architectures.

Quick Check

What is the output spatial size of Conv2d(64, 128, 3, stride=2, padding=1) on a 32x32 input?

32x32

16x16

15x15

Quick Check

Why set bias=False in Conv2d when followed by BatchNorm?

To reduce memory usage

BatchNorm's learnable shift parameter makes the conv bias redundant

Bias causes training instability

Quick Check

How many parameters does Conv2d(64, 128, 3, bias=False) have?

73,728

73,856

1,152

Common Mistake: Channel Ordering: NCHW vs NHWC

Mistake:

Passing images in (B, H, W, C) format (e.g., from PIL/OpenCV) to PyTorch Conv2d which expects (B, C, H, W).

Correction:

Always permute: x = x.permute(0, 3, 1, 2) or use torchvision.transforms.ToTensor() which handles the conversion.

Common Mistake: BatchNorm in Training vs Eval

Mistake:

Not switching BatchNorm to eval mode during inference, causing predictions to depend on batch composition.

Correction:

Always use model.eval() for inference and model.train() for training. BatchNorm uses running statistics in eval mode.

Common Mistake: BatchNorm with Very Small Batches

Mistake:

Using BatchNorm with batch size 1 or 2, where batch statistics are extremely noisy.

Correction:

Use GroupNorm or InstanceNorm for small batches. GroupNorm with num_groups=32 is a robust alternative.

Key Takeaway

The Conv-BN-ReLU block is the fundamental building unit of CNNs. Use kernel_size=3, padding=1 to preserve spatial dimensions. Set bias=False when using BatchNorm.

Key Takeaway

Convolutions are translation equivariant and share weights across spatial positions, reducing parameters by O(Hβ‹…W)O(H \cdot W) compared to fully connected layers.

Why This Matters: CNNs for Channel Estimation on OFDM Grids

The OFDM resource grid (subcarriers x OFDM symbols) is a 2D structure, making it a natural fit for CNNs. Networks like ChannelNet apply Conv2d layers to interpolate channel estimates from sparse pilot positions across the full time-frequency grid.

See full treatment in Chapter 33

Historical Note: LeNet-5: The First CNN

1998

Yann LeCun's LeNet-5 (1998) applied two convolutional layers followed by pooling and fully connected layers to handwritten digit recognition. This architecture remained the blueprint for CNNs for over a decade.

Historical Note: AlexNet and the Deep Learning Revolution

2012

AlexNet (Krizhevsky et al., 2012) won ImageNet by a massive margin using GPUs to train a deeper CNN with ReLU activations and dropout. This result single-handedly triggered the modern deep learning era.

Conv2d

2D convolution layer that applies learned filters across spatial dimensions of the input.

Related: Batch Normalisation

Batch Normalisation

Normalises activations across the batch, stabilising training and allowing higher learning rates.

Receptive Field

The region of the input that influences a particular output neuron.

Feature Map

The 2D output of a convolutional layer. Each channel represents a different learned feature.

Stride

The step size of the convolution kernel as it slides across the input. Stride > 1 downsamples.

Normalisation Layer Comparison

LayerNormalises OverBatch DependenceBest For
BatchNormBatch + SpatialYesLarge-batch training (batch >= 16)
LayerNormChannel + SpatialNoTransformers, NLP
InstanceNormSpatial onlyNoStyle transfer, small batch
GroupNormGroups of channelsNoSmall batch, detection