The Transformer Architecture

Definition:

Transformer Encoder Block

Each encoder block applies:

  1. Multi-head self-attention + residual + LayerNorm
  2. Feed-forward network + residual + LayerNorm

x=LN(x+MHA(x,x,x))\mathbf{x}' = \text{LN}(\mathbf{x} + \text{MHA}(\mathbf{x}, \mathbf{x}, \mathbf{x})) x=LN(x+FFN(x))\mathbf{x}'' = \text{LN}(\mathbf{x}' + \text{FFN}(\mathbf{x}'))

where FFN is two linear layers with GELU: FFN(x)=W2GELU(W1x+b1)+b2\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2.

Definition:

Feed-Forward Network (FFN)

The FFN in each transformer block expands then contracts:

FFN(x)=W2GELU(W1x+b1)+b2\text{FFN}(\mathbf{x}) = \mathbf{W}_2 \cdot \text{GELU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2

Typical expansion ratio: dff=4dmodeld_{\text{ff}} = 4 \cdot d_{\text{model}}.

Example: Implementing a Transformer Encoder Block

Build a complete transformer encoder block.