Language Modeling
Language Models as Foundation
A language model assigns probabilities to sequences of tokens. This seemingly simple task β predicting the next word β turns out to require deep understanding of syntax, semantics, and world knowledge. Modern LLMs are scaled-up language models.
Definition: Autoregressive Language Model
Autoregressive Language Model
An autoregressive language model factorizes the joint probability of a sequence using the chain rule:
The model is trained to minimize the negative log-likelihood:
Perplexity measures model quality:
Lower perplexity means the model is less "surprised" by the data. A perplexity of corresponds to uniform guessing over the vocabulary.
Definition: N-gram Language Model
N-gram Language Model
An -gram model approximates the full history with the last tokens:
The probabilities are estimated from corpus counts:
Smoothing (Laplace, Kneser-Ney) handles zero-count n-grams.
N-gram models are simple and interpretable but cannot capture long-range dependencies beyond the window of words.
Definition: Scaled Dot-Product Self-Attention
Scaled Dot-Product Self-Attention
Given input , self-attention computes:
For autoregressive models, a causal mask sets future positions to before softmax, ensuring only attends to .
import torch
import torch.nn.functional as F
def attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = Q @ K.transpose(-2, -1) / d_k ** 0.5
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
return F.softmax(scores, dim=-1) @ V
The scaling prevents the dot products from becoming too large, which would push softmax into regions with tiny gradients.
Theorem: Self-Attention Computational Complexity
For a sequence of length with embedding dimension :
- Time complexity: (computing all pairwise attention scores)
- Space complexity: (storing the attention matrix)
This quadratic scaling in is the primary bottleneck for long sequences and motivates efficient attention variants (linear attention, sparse attention, Flash Attention).
Each token must compute a similarity score with every other token in the sequence, yielding scores. For and , the attention matrix alone requires ~500 MB in fp32.
Example: RNN Language Model in PyTorch
Build and train a simple RNN language model on a small text corpus. Generate text by sampling from the model.
Model Definition
import torch
import torch.nn as nn
class RNNLM(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim):
super().__init__()
self.embed = nn.Embedding(vocab_size, embed_dim)
self.rnn = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
self.head = nn.Linear(hidden_dim, vocab_size)
def forward(self, x, hidden=None):
e = self.embed(x)
out, hidden = self.rnn(e, hidden)
logits = self.head(out)
return logits, hidden
@torch.no_grad()
def generate(self, start_ids, max_len=50, temperature=0.8):
self.eval()
ids = start_ids.unsqueeze(0)
hidden = None
for _ in range(max_len):
logits, hidden = self(ids[:, -1:], hidden)
probs = (logits[:, -1] / temperature).softmax(-1)
next_id = torch.multinomial(probs, 1)
ids = torch.cat([ids, next_id], dim=1)
return ids[0]
Training Loop
model = RNNLM(vocab_size=len(vocab), embed_dim=64, hidden_dim=128)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
for epoch in range(100):
logits, _ = model(train_ids[:, :-1])
loss = criterion(logits.reshape(-1, len(vocab)),
train_ids[:, 1:].reshape(-1))
optimizer.zero_grad()
loss.backward()
optimizer.step()
Example: Building a Transformer Block
Implement a single transformer decoder block with self-attention, layer normalization, and a feed-forward network.
Implementation
class TransformerBlock(nn.Module):
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.attn = nn.MultiheadAttention(
d_model, n_heads, dropout=dropout, batch_first=True
)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout),
)
self.ln1 = nn.LayerNorm(d_model)
self.ln2 = nn.LayerNorm(d_model)
def forward(self, x, mask=None):
# Pre-norm architecture
h = self.ln1(x)
h, _ = self.attn(h, h, h, attn_mask=mask)
x = x + h
x = x + self.ff(self.ln2(x))
return x
Self-Attention Pattern Visualization
Visualize attention weights between tokens in a sentence
Parameters
Language Model Architecture Comparison
Quick Check
A language model with perplexity 50 on a vocabulary of 10,000 words is equivalent to:
Choosing uniformly among 50 words at each step
Making errors on 50% of predictions
Having 50 parameters per word
Perplexity is the effective number of equally-likely choices the model faces at each prediction step.
Quick Check
Why does an autoregressive transformer use a causal mask?
To reduce memory usage
To prevent information leakage from future tokens during training
To speed up inference
The causal mask ensures position t can only attend to positions 1..t, maintaining the autoregressive property.
Common Mistake: Ignoring Temperature During Generation
Mistake:
Always using temperature=1.0 or greedy decoding (temperature=0).
Correction:
Temperature controls the sharpness of the output distribution. Use temperature < 1.0 for more deterministic output (e.g., code generation) and temperature > 1.0 for more creative output. Top-k and nucleus (top-p) sampling provide additional control.
Common Mistake: Forgetting Positional Encoding
Mistake:
Using a transformer without positional encoding.
Correction:
Self-attention is permutation-equivariant β it treats input as a set, not a sequence. Without positional encoding, the model cannot distinguish "the cat sat on the mat" from "mat the on sat cat the". Use sinusoidal, learned, or rotary positional encodings.
Key Takeaway
Language modeling is the task of predicting the next token given previous tokens. The transformer's self-attention mechanism enables parallel training and captures long-range dependencies, making it the dominant architecture for modern NLP. Perplexity measures model quality.
Key Takeaway
The NLP pipeline β tokenization, embedding, and modeling β transforms raw text into structured predictions. Each stage has design choices (subword vs. word tokens, static vs. contextual embeddings, RNN vs. transformer) that trade off simplicity, speed, and capability.
Why This Matters: Attention and MIMO Detection
The attention mechanism's computation of pairwise affinities is structurally similar to MIMO detection algorithms that compute interference between spatial streams. Transformer-based MIMO detectors (e.g., DeepMIMO) use self-attention to learn inter-stream dependencies, replacing hand-crafted interference cancellation.
See full treatment in Chapter 49
Historical Note: Attention Is All You Need
2017The transformer architecture was introduced by Vaswani et al. (2017) at Google. The paper title "Attention Is All You Need" was deliberately provocative β it replaced the dominant RNN/LSTM architecture entirely with self-attention. The paper has accumulated over 130,000 citations, making it one of the most cited papers in computer science history.
Language Model
A probabilistic model that assigns probabilities to sequences of tokens, typically factorized as a product of conditional next-token probabilities.
Related: Perplexity
Perplexity
The exponential of the average negative log-likelihood per token β measures how well a language model predicts held-out text. Lower is better.
Related: Language Model
Self-Attention
A mechanism that computes weighted sums of all positions in a sequence, where weights are determined by pairwise query-key similarity scores.
Related: Transformer
Transformer
A neural network architecture based entirely on self-attention mechanisms, without recurrence or convolution, enabling parallel training on sequences.
Related: Self-Attention
Language Model Architecture Comparison
| Architecture | Complexity per Step | Parallelism | Long-Range Dependencies |
|---|---|---|---|
| N-gram | Fully parallel | Limited to words | |
| RNN/LSTM | Sequential | Moderate (vanishing gradients) | |
| Transformer | Fully parallel | Full sequence length | |
| Linear Attention | Fully parallel | Full (approximate) |