Encoder Models: BERT and Variants

Definition:

BERT: Bidirectional Encoder Representations

BERT uses a transformer encoder (no causal mask) with two pre-training objectives:

  1. Masked Language Modeling (MLM): Randomly mask 15% of tokens, predict the masked tokens from context: LMLM=imaskedlogP(wiw\i)\mathcal{L}_\text{MLM} = -\sum_{i \in \text{masked}} \log P(w_i \mid w_{\backslash i})

  2. Next Sentence Prediction (NSP): Binary classification of whether sentence B follows sentence A (later shown to be unnecessary).

BERT produces contextual embeddings: each token's representation depends on the entire input, not just left context.

Example: Using BERT as a Feature Extractor

Extract sentence embeddings from BERT and compute similarity between wireless paper titles.

Transformer Architecture Families

FamilyArchitecturePre-trainingBest ForExamples
Encoder-onlyBidirectional attentionMLMClassification, NER, retrievalBERT, RoBERTa, DeBERTa
Decoder-onlyCausal attentionNext-token predictionGeneration, few-shotGPT-2/3/4, LLaMA, Mistral
Encoder-DecoderCross-attentionSpan corruptionTranslation, summarizationT5, BART, Flan-T5

Common Mistake: Using BERT for Text Generation

Mistake:

Trying to use BERT for autoregressive text generation.

Correction:

BERT is an encoder model trained with MLM — it sees the full input bidirectionally and cannot generate text autoregressively. For generation, use decoder-only (GPT) or encoder-decoder (T5) models.

Key Takeaway

The three transformer families serve different purposes: encoders (BERT) for understanding/classification, decoders (GPT) for generation, and encoder-decoders (T5) for sequence-to-sequence tasks. Modern practice increasingly favors decoder-only models for all tasks.