Encoder Models: BERT and Variants
Definition: BERT: Bidirectional Encoder Representations
BERT: Bidirectional Encoder Representations
BERT uses a transformer encoder (no causal mask) with two pre-training objectives:
-
Masked Language Modeling (MLM): Randomly mask 15% of tokens, predict the masked tokens from context:
-
Next Sentence Prediction (NSP): Binary classification of whether sentence B follows sentence A (later shown to be unnecessary).
BERT produces contextual embeddings: each token's representation depends on the entire input, not just left context.
Example: Using BERT as a Feature Extractor
Extract sentence embeddings from BERT and compute similarity between wireless paper titles.
Implementation
from transformers import AutoModel, AutoTokenizer
import torch
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
sentences = [
"MIMO channel estimation using deep learning",
"Deep neural network for channel prediction",
"Image reconstruction via compressed sensing",
]
with torch.no_grad():
inputs = tokenizer(sentences, return_tensors="pt",
padding=True, truncation=True)
outputs = model(**inputs)
# Use [CLS] token as sentence embedding
embeddings = outputs.last_hidden_state[:, 0, :]
# Cosine similarity
sim = torch.cosine_similarity(
embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=2
)
print("Similarity matrix:")
print(sim.numpy().round(3))
Transformer Architecture Families
| Family | Architecture | Pre-training | Best For | Examples |
|---|---|---|---|---|
| Encoder-only | Bidirectional attention | MLM | Classification, NER, retrieval | BERT, RoBERTa, DeBERTa |
| Decoder-only | Causal attention | Next-token prediction | Generation, few-shot | GPT-2/3/4, LLaMA, Mistral |
| Encoder-Decoder | Cross-attention | Span corruption | Translation, summarization | T5, BART, Flan-T5 |
Common Mistake: Using BERT for Text Generation
Mistake:
Trying to use BERT for autoregressive text generation.
Correction:
BERT is an encoder model trained with MLM — it sees the full input bidirectionally and cannot generate text autoregressively. For generation, use decoder-only (GPT) or encoder-decoder (T5) models.
Key Takeaway
The three transformer families serve different purposes: encoders (BERT) for understanding/classification, decoders (GPT) for generation, and encoder-decoders (T5) for sequence-to-sequence tasks. Modern practice increasingly favors decoder-only models for all tasks.