Encoder Models: BERT and Variants

Definition:
BERT: Bidirectional Encoder Representations

BERT uses a transformer encoder (no causal mask) with two pre-training objectives:

Masked Language Modeling (MLM): Randomly mask 15% of tokens, predict the masked tokens from context: $\mathcal{L}_\text{MLM} = -\sum_{i \in \text{masked}} \log P(w_i \mid w_{\backslash i})$
Next Sentence Prediction (NSP): Binary classification of whether sentence B follows sentence A (later shown to be unnecessary).

BERT produces contextual embeddings: each token's representation depends on the entire input, not just left context.

Example: Using BERT as a Feature Extractor

Extract sentence embeddings from BERT and compute similarity between wireless paper titles.

Solution

Implementation

from transformers import AutoModel, AutoTokenizer
import torch

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

sentences = [
    "MIMO channel estimation using deep learning",
    "Deep neural network for channel prediction",
    "Image reconstruction via compressed sensing",
]

with torch.no_grad():
    inputs = tokenizer(sentences, return_tensors="pt",
                     padding=True, truncation=True)
    outputs = model(**inputs)
    # Use [CLS] token as sentence embedding
    embeddings = outputs.last_hidden_state[:, 0, :]

# Cosine similarity
sim = torch.cosine_similarity(
    embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=2
)
print("Similarity matrix:")
print(sim.numpy().round(3))

Transformer Architecture Families

Family	Architecture	Pre-training	Best For	Examples
Encoder-only	Bidirectional attention	MLM	Classification, NER, retrieval	BERT, RoBERTa, DeBERTa
Decoder-only	Causal attention	Next-token prediction	Generation, few-shot	GPT-2/3/4, LLaMA, Mistral
Encoder-Decoder	Cross-attention	Span corruption	Translation, summarization	T5, BART, Flan-T5

Common Mistake: Using BERT for Text Generation

Mistake:

Trying to use BERT for autoregressive text generation.

Correction:

BERT is an encoder model trained with MLM — it sees the full input bidirectionally and cannot generate text autoregressively. For generation, use decoder-only (GPT) or encoder-decoder (T5) models.

Key Takeaway

The three transformer families serve different purposes: encoders (BERT) for understanding/classification, decoders (GPT) for generation, and encoder-decoders (T5) for sequence-to-sequence tasks. Modern practice increasingly favors decoder-only models for all tasks.

RLHF and Alignment Model Families and Their Characteristics