Residual Blocks

Definition:
Residual Block (ResNet)

A residual block adds the input to the output of a nonlinear path:

$\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}$

where $\mathcal{F}$ is typically two Conv-BN-ReLU layers.

class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        residual = x
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        return F.relu(out + residual)

When dimensions change (stride > 1 or channel change), use a 1x1 convolution as the skip connection.

Definition:
Bottleneck Block

The bottleneck reduces computation by using 1x1 convolutions:

$\text{1x1 Conv}(C \to C/4) \to \text{3x3 Conv}(C/4 \to C/4) \to \text{1x1 Conv}(C/4 \to C)$

This reduces parameters from $2 \cdot C^2 \cdot 9$ (basic block) to $C^2/4 + C/4 \cdot C/4 \cdot 9 + C^2/4 \approx C^2/2 + 9C^2/16$ .

class Bottleneck(nn.Module):
    def __init__(self, channels, expansion=4):
        super().__init__()
        mid = channels // expansion
        self.conv1 = nn.Conv2d(channels, mid, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(mid)
        self.conv2 = nn.Conv2d(mid, mid, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(mid)
        self.conv3 = nn.Conv2d(mid, channels, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(channels)

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        return F.relu(out + x)

Theorem: Gradient Flow in Residual Networks

For a residual network with $L$ blocks, the gradient of the loss with respect to the input of block $l$ decomposes as:

$\frac{\partial L}{\partial \mathbf{x}_l} = \frac{\partial L}{\partial \mathbf{x}_L} \left(1 + \frac{\partial}{\partial \mathbf{x}_l} \sum_{i=l}^{L-1} \mathcal{F}_i(\mathbf{x}_i)\right)$

The "1" term provides a direct gradient path that does not vanish regardless of depth. This is why ResNets can be trained with 100+ layers while plain networks cannot.

The skip connection acts as a "gradient highway" — even if the learned transformation $\mathcal{F}$ has vanishing gradients, the identity path always carries signal.

Example: Building a ResNet-18 from Scratch

Implement ResNet-18: 4 stages with [2, 2, 2, 2] residual blocks, channels [64, 128, 256, 512].

Solution

Stage definition

def make_stage(in_ch, out_ch, n_blocks, stride=1):
    layers = [ResidualBlock(in_ch, out_ch, stride)]
    for _ in range(1, n_blocks):
        layers.append(ResidualBlock(out_ch, out_ch))
    return nn.Sequential(*layers)

class ResNet18(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(3, 64, 7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64), nn.ReLU(),
            nn.MaxPool2d(3, stride=2, padding=1))
        self.stage1 = make_stage(64, 64, 2)
        self.stage2 = make_stage(64, 128, 2, stride=2)
        self.stage3 = make_stage(128, 256, 2, stride=2)
        self.stage4 = make_stage(256, 512, 2, stride=2)
        self.gap = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(512, num_classes)

Skip Connection Effect on Gradient Flow

Compare gradient magnitudes with and without skip connections.

Parameters

Quick Check

What does a residual block learn when the optimal transformation is the identity?

It learns W = I (identity matrix)

It learns F(x) = 0, so the output is x + 0 = x

It cannot represent the identity

Correction:

It learns F(x) = 0, so the output is x + 0 = x

It is easier to push F toward zero than to learn the identity mapping explicitly.

Common Mistake: Dimension Mismatch in Skip Connections

Mistake:

Adding the skip connection when the spatial size or channel count changes between input and output.