The Transformer Architecture
Definition: Transformer Encoder Block
Transformer Encoder Block
Each encoder block applies:
- Multi-head self-attention + residual + LayerNorm
- Feed-forward network + residual + LayerNorm
where FFN is two linear layers with GELU: .
Definition: Feed-Forward Network (FFN)
Feed-Forward Network (FFN)
The FFN in each transformer block expands then contracts:
Typical expansion ratio: .
Example: Implementing a Transformer Encoder Block
Build a complete transformer encoder block.
Solution
Implementation
class TransformerBlock(nn.Module):
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.attn = MultiHeadAttention(d_model, n_heads)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff), nn.GELU(),
nn.Dropout(dropout), nn.Linear(d_ff, d_model))
self.ln1 = nn.LayerNorm(d_model)
self.ln2 = nn.LayerNorm(d_model)
self.drop = nn.Dropout(dropout)
def forward(self, x, mask=None):
x = x + self.drop(self.attn(self.ln1(x), self.ln1(x), self.ln1(x), mask))
x = x + self.drop(self.ffn(self.ln2(x)))
return x