Ferkans — Interactive Telecom Tutor

Why Convolutions?

Fully connected layers treat each input dimension independently — for a 256x256 image, that is 65,536 inputs per neuron. Convolutions exploit two key priors: translation equivariance (features can appear anywhere) and locality (nearby pixels are more related than distant ones). This reduces parameters by orders of magnitude.

Definition:
2D Convolution Layer

nn.Conv2d computes a 2D cross-correlation:

$y[c_{\text{out}}, h, w] = b[c_{\text{out}}] + \sum_{c=0}^{C_{\text{in}}-1} \sum_{i=0}^{k-1} \sum_{j=0}^{k-1} K[c_{\text{out}}, c, i, j] \cdot x[c, h \cdot s + i, w \cdot s + j]$

Parameters: $C_{\text{out}} \times C_{\text{in}} \times k \times k + C_{\text{out}}$ .

conv = nn.Conv2d(in_channels=3, out_channels=64,
                 kernel_size=3, stride=1, padding=1)
# Input: (B, 3, H, W) → Output: (B, 64, H, W)

With padding=k//2 and stride=1, the spatial dimensions are preserved. This is the most common configuration in modern architectures.

Definition:
Convolution Output Size

For input size $H_{\text{in}}$ , kernel $k$ , padding $p$ , stride $s$ , and dilation $d$ :

$H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} + 2p - d(k-1) - 1}{s} + 1 \right\rfloor$

For padding='same' with stride=1: $H_{\text{out}} = H_{\text{in}}$ .

Definition:
Batch Normalisation

BatchNorm normalises activations across the batch dimension:

$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}, \qquad y_i = \gamma \hat{x}_i + \beta$

where $\mu_B$ and $\sigma_B^2$ are the per-channel batch mean and variance, and $\gamma$ , $\beta$ are learnable scale/shift parameters.

bn = nn.BatchNorm2d(num_features=64)
# During eval: uses running mean/variance

BatchNorm behaves differently in train vs eval mode. Always call model.train() and model.eval() appropriately.

Definition:
Pooling Layers

Pooling reduces spatial dimensions:

Max pooling: $y[h,w] = \max_{i,j \in \text{window}} x[h \cdot s + i, w \cdot s + j]$
Average pooling: $y[h,w] = \frac{1}{k^2} \sum_{i,j} x[h \cdot s + i, w \cdot s + j]$
Adaptive pooling: nn.AdaptiveAvgPool2d((1,1)) → global average

pool = nn.MaxPool2d(kernel_size=2, stride=2)  # halves spatial dims
gap = nn.AdaptiveAvgPool2d((1, 1))  # → (B, C, 1, 1)

Definition:
Depthwise Separable Convolution

Splits a standard convolution into:

Depthwise: one $k \times k$ filter per channel (groups=C_{\text{in}})
Pointwise: $1 \times 1$ convolution to mix channels

Parameter reduction: $\frac{C_{\text{in}} \cdot k^2 + C_{\text{in}} \cdot C_{\text{out}}}{C_{\text{in}} \cdot C_{\text{out}} \cdot k^2} \approx \frac{1}{C_{\text{out}}} + \frac{1}{k^2}$

depthwise = nn.Conv2d(64, 64, 3, padding=1, groups=64)
pointwise = nn.Conv2d(64, 128, 1)

Theorem: Translation Equivariance of Convolution

Let $T_{\mathbf{d}}$ denote spatial translation by $\mathbf{d}$ . For a convolution operator $\mathcal{C}$ :

$\mathcal{C}(T_{\mathbf{d}}[\mathbf{x}]) = T_{\mathbf{d}}[\mathcal{C}(\mathbf{x})]$

Convolution commutes with translation. An object detected at one location is detected identically at another.

The same filter slides across the entire image, so it responds to the same pattern regardless of position. This is why CNNs need far fewer parameters than fully connected networks for image tasks.

Theorem: Receptive Field Growth

For $L$ layers of $k \times k$ convolutions with stride 1:

$r_L = 1 + L(k - 1)$

With stride $s$ at layer $l$ : $r_L = 1 + \sum_{l=1}^{L} (k_l - 1) \prod_{i=1}^{l-1} s_i$

Dilated convolutions with rate $d$ effectively use kernel size $k + (k-1)(d-1)$ , growing the receptive field without adding parameters.

Each layer sees a larger region of the input. Deeper networks with more layers (or dilated convolutions) can capture longer-range spatial dependencies.

Theorem: CNN vs FC Parameter Count

For an image of size $H \times W \times C_{\text{in}}$ mapped to $C_{\text{out}}$ feature maps:

FC: $H \cdot W \cdot C_{\text{in}} \cdot H \cdot W \cdot C_{\text{out}}$ parameters
Conv: $C_{\text{in}} \cdot C_{\text{out}} \cdot k^2$ parameters

For a 256x256 RGB image with 64 output channels and $3 \times 3$ kernel: FC = 12.9 billion, Conv = 1,728. Ratio: $7.5 \times 10^6$ .

Weight sharing across spatial positions gives CNNs their massive parameter efficiency for structured (grid) data.

Example: The Conv-BN-ReLU Building Block

Implement the standard Conv-BN-ReLU block used in almost every CNN.

Solution

Implementation

def conv_bn_relu(in_ch, out_ch, kernel_size=3, stride=1):
    return nn.Sequential(
        nn.Conv2d(in_ch, out_ch, kernel_size,
                  stride=stride, padding=kernel_size // 2, bias=False),
        nn.BatchNorm2d(out_ch),
        nn.ReLU(inplace=True),
    )

Why bias=False?

BatchNorm has its own learnable bias ( $\beta$ ), making the convolution bias redundant.

Example: A Simple CNN for Image Classification

Build a CNN that takes 32x32 RGB images and outputs 10 class logits.

Solution

Implementation

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            conv_bn_relu(3, 32),
            nn.MaxPool2d(2),        # 32 -> 16
            conv_bn_relu(32, 64),
            nn.MaxPool2d(2),        # 16 -> 8
            conv_bn_relu(64, 128),
            nn.AdaptiveAvgPool2d(1), # 8 -> 1
        )
        self.classifier = nn.Linear(128, 10)

    def forward(self, x):
        x = self.features(x)
        x = x.flatten(1)
        return self.classifier(x)

Example: Computing Feature Map Dimensions

Calculate the output spatial size after Conv2d(3, 64, 7, stride=2, padding=3) followed by MaxPool2d(3, stride=2, padding=1) on a 224x224 input.

Solution

Step-by-step

After Conv: $\lfloor (224 + 2 \cdot 3 - 7)/2 + 1 \rfloor = 112$

After MaxPool: $\lfloor (112 + 2 \cdot 1 - 3)/2 + 1 \rfloor = 56$

Final: $(B, 64, 56, 56)$ — exactly what ResNet's stem produces.

2D Convolution Explorer

Visualise how kernel size, stride, padding, and dilation affect the output.

Parameters

CNN Feature Map Visualiser

See how different filters transform input images.

Parameters

Receptive Field Calculator

Compute receptive field size for different architectures.

Parameters

Convolution Sliding Window

Watch a kernel slide across a 2D input computing the output.

Parameters

Conv2d Operation Diagram — A 3x3 kernel slides across the input feature map, computing a weighted sum at each position to produce one pixel in the output feature map.

CNN Architecture Timeline — Evolution from LeNet (1998) through AlexNet, VGG, ResNet, to modern EfficientNet and ConvNeXt architectures.

Quick Check

What is the output spatial size of Conv2d(64, 128, 3, stride=2, padding=1) on a 32x32 input?

32x32

16x16

15x15

Correction:

16x16

floor((32 + 2*1 - 3)/2 + 1) = floor(31/2 + 1) = 16

Quick Check

Why set bias=False in Conv2d when followed by BatchNorm?

To reduce memory usage

BatchNorm's learnable shift parameter makes the conv bias redundant

Bias causes training instability

Correction:

BatchNorm's learnable shift parameter makes the conv bias redundant

BN subtracts the mean then adds its own bias, so the conv bias is absorbed.

Quick Check

How many parameters does Conv2d(64, 128, 3, bias=False) have?

73,728

73,856

1,152

Correction:

73,728

64 * 128 * 3 * 3 = 73,728

Common Mistake: Channel Ordering: NCHW vs NHWC

Mistake:

Passing images in (B, H, W, C) format (e.g., from PIL/OpenCV) to PyTorch Conv2d which expects (B, C, H, W).

Correction:

Always permute: x = x.permute(0, 3, 1, 2) or use torchvision.transforms.ToTensor() which handles the conversion.

Common Mistake: BatchNorm in Training vs Eval

Mistake:

Not switching BatchNorm to eval mode during inference, causing predictions to depend on batch composition.

Correction:

Always use model.eval() for inference and model.train() for training. BatchNorm uses running statistics in eval mode.

Common Mistake: BatchNorm with Very Small Batches

Mistake:

Using BatchNorm with batch size 1 or 2, where batch statistics are extremely noisy.

Correction:

Use GroupNorm or InstanceNorm for small batches. GroupNorm with num_groups=32 is a robust alternative.

Key Takeaway

The Conv-BN-ReLU block is the fundamental building unit of CNNs. Use kernel_size=3, padding=1 to preserve spatial dimensions. Set bias=False when using BatchNorm.

Key Takeaway

Convolutions are translation equivariant and share weights across spatial positions, reducing parameters by $O(H \cdot W)$ compared to fully connected layers.

Why This Matters: CNNs for Channel Estimation on OFDM Grids

The OFDM resource grid (subcarriers x OFDM symbols) is a 2D structure, making it a natural fit for CNNs. Networks like ChannelNet apply Conv2d layers to interpolate channel estimates from sparse pilot positions across the full time-frequency grid.

See full treatment in Chapter 33

Historical Note: LeNet-5: The First CNN

1998

Yann LeCun's LeNet-5 (1998) applied two convolutional layers followed by pooling and fully connected layers to handwritten digit recognition. This architecture remained the blueprint for CNNs for over a decade.

Historical Note: AlexNet and the Deep Learning Revolution

2012

AlexNet (Krizhevsky et al., 2012) won ImageNet by a massive margin using GPUs to train a deeper CNN with ReLU activations and dropout. This result single-handedly triggered the modern deep learning era.

Conv2d

2D convolution layer that applies learned filters across spatial dimensions of the input.

Related: Batch Normalisation

Batch Normalisation

Normalises activations across the batch, stabilising training and allowing higher learning rates.

Receptive Field

The region of the input that influences a particular output neuron.

Feature Map

The 2D output of a convolutional layer. Each channel represents a different learned feature.

Stride

The step size of the convolution kernel as it slides across the input. Stride > 1 downsamples.

Normalisation Layer Comparison

Layer	Normalises Over	Batch Dependence	Best For
BatchNorm	Batch + Spatial	Yes	Large-batch training (batch >= 16)
LayerNorm	Channel + Spatial	No	Transformers, NLP
InstanceNorm	Spatial only	No	Style transfer, small batch
GroupNorm	Groups of channels	No	Small batch, detection

Conv2d, BatchNorm, Activation

Why Convolutions?

Definition: 2D Convolution Layer

Definition: Convolution Output Size

Definition: Batch Normalisation

Definition: Pooling Layers

Definition: Depthwise Separable Convolution

Theorem: Translation Equivariance of Convolution

Theorem: Receptive Field Growth

Theorem: CNN vs FC Parameter Count

Example: The Conv-BN-ReLU Building Block

Implementation

Why bias=False?

Example: A Simple CNN for Image Classification

Implementation

Example: Computing Feature Map Dimensions

Step-by-step

2D Convolution Explorer

Parameters

CNN Feature Map Visualiser

Parameters

Receptive Field Calculator

Parameters

Convolution Sliding Window

Parameters

Conv2d Operation Diagram

CNN Architecture Timeline

Quick Check

Quick Check

Quick Check

Common Mistake: Channel Ordering: NCHW vs NHWC

Common Mistake: BatchNorm in Training vs Eval

Common Mistake: BatchNorm with Very Small Batches

Key Takeaway

Key Takeaway

Why This Matters: CNNs for Channel Estimation on OFDM Grids

Historical Note: LeNet-5: The First CNN

Historical Note: AlexNet and the Deep Learning Revolution

Conv2d

Batch Normalisation

Receptive Field

Feature Map

Stride

Normalisation Layer Comparison

Definition:
2D Convolution Layer

Definition:
Convolution Output Size

Definition:
Batch Normalisation

Definition:
Pooling Layers

Definition:
Depthwise Separable Convolution