Residual Blocks
Definition: Residual Block (ResNet)
Residual Block (ResNet)
A residual block adds the input to the output of a nonlinear path:
where is typically two Conv-BN-ReLU layers.
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(channels)
def forward(self, x):
residual = x
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
return F.relu(out + residual)
When dimensions change (stride > 1 or channel change), use a 1x1 convolution as the skip connection.
Definition: Bottleneck Block
Bottleneck Block
The bottleneck reduces computation by using 1x1 convolutions:
This reduces parameters from (basic block) to .
class Bottleneck(nn.Module):
def __init__(self, channels, expansion=4):
super().__init__()
mid = channels // expansion
self.conv1 = nn.Conv2d(channels, mid, 1, bias=False)
self.bn1 = nn.BatchNorm2d(mid)
self.conv2 = nn.Conv2d(mid, mid, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(mid)
self.conv3 = nn.Conv2d(mid, channels, 1, bias=False)
self.bn3 = nn.BatchNorm2d(channels)
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = F.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
return F.relu(out + x)
Theorem: Gradient Flow in Residual Networks
For a residual network with blocks, the gradient of the loss with respect to the input of block decomposes as:
The "1" term provides a direct gradient path that does not vanish regardless of depth. This is why ResNets can be trained with 100+ layers while plain networks cannot.
The skip connection acts as a "gradient highway" β even if the learned transformation has vanishing gradients, the identity path always carries signal.
Example: Building a ResNet-18 from Scratch
Implement ResNet-18: 4 stages with [2, 2, 2, 2] residual blocks, channels [64, 128, 256, 512].
Stage definition
def make_stage(in_ch, out_ch, n_blocks, stride=1):
layers = [ResidualBlock(in_ch, out_ch, stride)]
for _ in range(1, n_blocks):
layers.append(ResidualBlock(out_ch, out_ch))
return nn.Sequential(*layers)
class ResNet18(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.stem = nn.Sequential(
nn.Conv2d(3, 64, 7, stride=2, padding=3, bias=False),
nn.BatchNorm2d(64), nn.ReLU(),
nn.MaxPool2d(3, stride=2, padding=1))
self.stage1 = make_stage(64, 64, 2)
self.stage2 = make_stage(64, 128, 2, stride=2)
self.stage3 = make_stage(128, 256, 2, stride=2)
self.stage4 = make_stage(256, 512, 2, stride=2)
self.gap = nn.AdaptiveAvgPool2d(1)
self.fc = nn.Linear(512, num_classes)
Skip Connection Effect on Gradient Flow
Compare gradient magnitudes with and without skip connections.
Parameters
Quick Check
What does a residual block learn when the optimal transformation is the identity?
It learns W = I (identity matrix)
It learns F(x) = 0, so the output is x + 0 = x
It cannot represent the identity
It is easier to push F toward zero than to learn the identity mapping explicitly.
Common Mistake: Dimension Mismatch in Skip Connections
Mistake:
Adding the skip connection when the spatial size or channel count changes between input and output.
Correction:
Use a 1x1 conv with appropriate stride for the skip path:
self.skip = nn.Conv2d(in_ch, out_ch, 1, stride=stride, bias=False)