Conv2d, BatchNorm, Activation
Why Convolutions?
Fully connected layers treat each input dimension independently β for a 256x256 image, that is 65,536 inputs per neuron. Convolutions exploit two key priors: translation equivariance (features can appear anywhere) and locality (nearby pixels are more related than distant ones). This reduces parameters by orders of magnitude.
Definition: 2D Convolution Layer
2D Convolution Layer
nn.Conv2d computes a 2D cross-correlation:
Parameters: .
conv = nn.Conv2d(in_channels=3, out_channels=64,
kernel_size=3, stride=1, padding=1)
# Input: (B, 3, H, W) β Output: (B, 64, H, W)
With padding=k//2 and stride=1, the spatial dimensions are preserved.
This is the most common configuration in modern architectures.
Definition: Convolution Output Size
Convolution Output Size
For input size , kernel , padding , stride , and dilation :
For padding='same' with stride=1: .
Definition: Batch Normalisation
Batch Normalisation
BatchNorm normalises activations across the batch dimension:
where and are the per-channel batch mean and variance, and , are learnable scale/shift parameters.
bn = nn.BatchNorm2d(num_features=64)
# During eval: uses running mean/variance
BatchNorm behaves differently in train vs eval mode. Always call
model.train() and model.eval() appropriately.
Definition: Pooling Layers
Pooling Layers
Pooling reduces spatial dimensions:
- Max pooling:
- Average pooling:
- Adaptive pooling:
nn.AdaptiveAvgPool2d((1,1))β global average
pool = nn.MaxPool2d(kernel_size=2, stride=2) # halves spatial dims
gap = nn.AdaptiveAvgPool2d((1, 1)) # β (B, C, 1, 1)
Definition: Depthwise Separable Convolution
Depthwise Separable Convolution
Splits a standard convolution into:
- Depthwise: one filter per channel (
groups=C_{\text{in}}) - Pointwise: convolution to mix channels
Parameter reduction:
depthwise = nn.Conv2d(64, 64, 3, padding=1, groups=64)
pointwise = nn.Conv2d(64, 128, 1)
Theorem: Translation Equivariance of Convolution
Let denote spatial translation by . For a convolution operator :
Convolution commutes with translation. An object detected at one location is detected identically at another.
The same filter slides across the entire image, so it responds to the same pattern regardless of position. This is why CNNs need far fewer parameters than fully connected networks for image tasks.
Theorem: Receptive Field Growth
For layers of convolutions with stride 1:
With stride at layer :
Dilated convolutions with rate effectively use kernel size , growing the receptive field without adding parameters.
Each layer sees a larger region of the input. Deeper networks with more layers (or dilated convolutions) can capture longer-range spatial dependencies.
Theorem: CNN vs FC Parameter Count
For an image of size mapped to feature maps:
- FC: parameters
- Conv: parameters
For a 256x256 RGB image with 64 output channels and kernel: FC = 12.9 billion, Conv = 1,728. Ratio: .
Weight sharing across spatial positions gives CNNs their massive parameter efficiency for structured (grid) data.
Example: The Conv-BN-ReLU Building Block
Implement the standard Conv-BN-ReLU block used in almost every CNN.
Implementation
def conv_bn_relu(in_ch, out_ch, kernel_size=3, stride=1):
return nn.Sequential(
nn.Conv2d(in_ch, out_ch, kernel_size,
stride=stride, padding=kernel_size // 2, bias=False),
nn.BatchNorm2d(out_ch),
nn.ReLU(inplace=True),
)
Why bias=False?
BatchNorm has its own learnable bias (), making the convolution bias redundant.
Example: A Simple CNN for Image Classification
Build a CNN that takes 32x32 RGB images and outputs 10 class logits.
Implementation
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
conv_bn_relu(3, 32),
nn.MaxPool2d(2), # 32 -> 16
conv_bn_relu(32, 64),
nn.MaxPool2d(2), # 16 -> 8
conv_bn_relu(64, 128),
nn.AdaptiveAvgPool2d(1), # 8 -> 1
)
self.classifier = nn.Linear(128, 10)
def forward(self, x):
x = self.features(x)
x = x.flatten(1)
return self.classifier(x)
Example: Computing Feature Map Dimensions
Calculate the output spatial size after Conv2d(3, 64, 7, stride=2, padding=3) followed by MaxPool2d(3, stride=2, padding=1) on a 224x224 input.
Step-by-step
After Conv:
After MaxPool:
Final: β exactly what ResNet's stem produces.
2D Convolution Explorer
Visualise how kernel size, stride, padding, and dilation affect the output.
Parameters
CNN Feature Map Visualiser
See how different filters transform input images.
Parameters
Receptive Field Calculator
Compute receptive field size for different architectures.
Parameters
Convolution Sliding Window
Watch a kernel slide across a 2D input computing the output.
Parameters
Conv2d Operation Diagram
CNN Architecture Timeline
Quick Check
What is the output spatial size of Conv2d(64, 128, 3, stride=2, padding=1) on a 32x32 input?
32x32
16x16
15x15
floor((32 + 2*1 - 3)/2 + 1) = floor(31/2 + 1) = 16
Quick Check
Why set bias=False in Conv2d when followed by BatchNorm?
To reduce memory usage
BatchNorm's learnable shift parameter makes the conv bias redundant
Bias causes training instability
BN subtracts the mean then adds its own bias, so the conv bias is absorbed.
Quick Check
How many parameters does Conv2d(64, 128, 3, bias=False) have?
73,728
73,856
1,152
64 * 128 * 3 * 3 = 73,728
Common Mistake: Channel Ordering: NCHW vs NHWC
Mistake:
Passing images in (B, H, W, C) format (e.g., from PIL/OpenCV) to PyTorch Conv2d which expects (B, C, H, W).
Correction:
Always permute: x = x.permute(0, 3, 1, 2) or use
torchvision.transforms.ToTensor() which handles the conversion.
Common Mistake: BatchNorm in Training vs Eval
Mistake:
Not switching BatchNorm to eval mode during inference, causing predictions to depend on batch composition.
Correction:
Always use model.eval() for inference and model.train() for
training. BatchNorm uses running statistics in eval mode.
Common Mistake: BatchNorm with Very Small Batches
Mistake:
Using BatchNorm with batch size 1 or 2, where batch statistics are extremely noisy.
Correction:
Use GroupNorm or InstanceNorm for small batches. GroupNorm with
num_groups=32 is a robust alternative.
Key Takeaway
The Conv-BN-ReLU block is the fundamental building unit of CNNs. Use kernel_size=3, padding=1 to preserve spatial dimensions. Set bias=False when using BatchNorm.
Key Takeaway
Convolutions are translation equivariant and share weights across spatial positions, reducing parameters by compared to fully connected layers.
Why This Matters: CNNs for Channel Estimation on OFDM Grids
The OFDM resource grid (subcarriers x OFDM symbols) is a 2D structure, making it a natural fit for CNNs. Networks like ChannelNet apply Conv2d layers to interpolate channel estimates from sparse pilot positions across the full time-frequency grid.
See full treatment in Chapter 33
Historical Note: LeNet-5: The First CNN
1998Yann LeCun's LeNet-5 (1998) applied two convolutional layers followed by pooling and fully connected layers to handwritten digit recognition. This architecture remained the blueprint for CNNs for over a decade.
Historical Note: AlexNet and the Deep Learning Revolution
2012AlexNet (Krizhevsky et al., 2012) won ImageNet by a massive margin using GPUs to train a deeper CNN with ReLU activations and dropout. This result single-handedly triggered the modern deep learning era.
Conv2d
2D convolution layer that applies learned filters across spatial dimensions of the input.
Related: Batch Normalisation
Batch Normalisation
Normalises activations across the batch, stabilising training and allowing higher learning rates.
Receptive Field
The region of the input that influences a particular output neuron.
Feature Map
The 2D output of a convolutional layer. Each channel represents a different learned feature.
Stride
The step size of the convolution kernel as it slides across the input. Stride > 1 downsamples.
Normalisation Layer Comparison
| Layer | Normalises Over | Batch Dependence | Best For |
|---|---|---|---|
| BatchNorm | Batch + Spatial | Yes | Large-batch training (batch >= 16) |
| LayerNorm | Channel + Spatial | No | Transformers, NLP |
| InstanceNorm | Spatial only | No | Style transfer, small batch |
| GroupNorm | Groups of channels | No | Small batch, detection |