Residual Blocks

Definition:

Residual Block (ResNet)

A residual block adds the input to the output of a nonlinear path:

y=F(x)+x\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}

where F\mathcal{F} is typically two Conv-BN-ReLU layers.

class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        residual = x
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        return F.relu(out + residual)

When dimensions change (stride > 1 or channel change), use a 1x1 convolution as the skip connection.

Definition:

Bottleneck Block

The bottleneck reduces computation by using 1x1 convolutions:

1x1 Conv(C→C/4)→3x3 Conv(C/4→C/4)→1x1 Conv(C/4→C)\text{1x1 Conv}(C \to C/4) \to \text{3x3 Conv}(C/4 \to C/4) \to \text{1x1 Conv}(C/4 \to C)

This reduces parameters from 2β‹…C2β‹…92 \cdot C^2 \cdot 9 (basic block) to C2/4+C/4β‹…C/4β‹…9+C2/4β‰ˆC2/2+9C2/16C^2/4 + C/4 \cdot C/4 \cdot 9 + C^2/4 \approx C^2/2 + 9C^2/16.

class Bottleneck(nn.Module):
    def __init__(self, channels, expansion=4):
        super().__init__()
        mid = channels // expansion
        self.conv1 = nn.Conv2d(channels, mid, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(mid)
        self.conv2 = nn.Conv2d(mid, mid, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(mid)
        self.conv3 = nn.Conv2d(mid, channels, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(channels)

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        return F.relu(out + x)

Theorem: Gradient Flow in Residual Networks

For a residual network with LL blocks, the gradient of the loss with respect to the input of block ll decomposes as:

βˆ‚Lβˆ‚xl=βˆ‚Lβˆ‚xL(1+βˆ‚βˆ‚xlβˆ‘i=lLβˆ’1Fi(xi))\frac{\partial L}{\partial \mathbf{x}_l} = \frac{\partial L}{\partial \mathbf{x}_L} \left(1 + \frac{\partial}{\partial \mathbf{x}_l} \sum_{i=l}^{L-1} \mathcal{F}_i(\mathbf{x}_i)\right)

The "1" term provides a direct gradient path that does not vanish regardless of depth. This is why ResNets can be trained with 100+ layers while plain networks cannot.

The skip connection acts as a "gradient highway" β€” even if the learned transformation F\mathcal{F} has vanishing gradients, the identity path always carries signal.

Example: Building a ResNet-18 from Scratch

Implement ResNet-18: 4 stages with [2, 2, 2, 2] residual blocks, channels [64, 128, 256, 512].

Skip Connection Effect on Gradient Flow

Compare gradient magnitudes with and without skip connections.

Parameters

Quick Check

What does a residual block learn when the optimal transformation is the identity?

It learns W = I (identity matrix)

It learns F(x) = 0, so the output is x + 0 = x

It cannot represent the identity

Common Mistake: Dimension Mismatch in Skip Connections

Mistake:

Adding the skip connection when the spatial size or channel count changes between input and output.

Correction:

Use a 1x1 conv with appropriate stride for the skip path: self.skip = nn.Conv2d(in_ch, out_ch, 1, stride=stride, bias=False)