Ferkans — Interactive Telecom Tutor

Definition:
Transformer Encoder Block

Each encoder block applies:

Multi-head self-attention + residual + LayerNorm
Feed-forward network + residual + LayerNorm

$\mathbf{x}' = \text{LN}(\mathbf{x} + \text{MHA}(\mathbf{x}, \mathbf{x}, \mathbf{x}))$ $\mathbf{x}'' = \text{LN}(\mathbf{x}' + \text{FFN}(\mathbf{x}'))$

where FFN is two linear layers with GELU: $\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2$ .

Definition:
Feed-Forward Network (FFN)

The FFN in each transformer block expands then contracts:

$\text{FFN}(\mathbf{x}) = \mathbf{W}_2 \cdot \text{GELU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$

Typical expansion ratio: $d_{\text{ff}} = 4 \cdot d_{\text{model}}$ .

Example: Implementing a Transformer Encoder Block

Build a complete transformer encoder block.

Solution

Implementation

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.GELU(),
            nn.Dropout(dropout), nn.Linear(d_ff, d_model))
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.drop = nn.Dropout(dropout)
    def forward(self, x, mask=None):
        x = x + self.drop(self.attn(self.ln1(x), self.ln1(x), self.ln1(x), mask))
        x = x + self.drop(self.ffn(self.ln2(x)))
        return x

The Transformer Architecture

Definition: Transformer Encoder Block

Definition: Feed-Forward Network (FFN)

Example: Implementing a Transformer Encoder Block

Implementation

Definition:
Transformer Encoder Block

Definition:
Feed-Forward Network (FFN)