Language Modeling

Language Models as Foundation

A language model assigns probabilities to sequences of tokens. This seemingly simple task β€” predicting the next word β€” turns out to require deep understanding of syntax, semantics, and world knowledge. Modern LLMs are scaled-up language models.

Definition:

Autoregressive Language Model

An autoregressive language model factorizes the joint probability of a sequence using the chain rule:

P(w1,w2,…,wT)=∏t=1TP(wt∣w1,…,wtβˆ’1)P(w_1, w_2, \ldots, w_T) = \prod_{t=1}^{T} P(w_t \mid w_1, \ldots, w_{t-1})

The model is trained to minimize the negative log-likelihood:

L=βˆ’1Tβˆ‘t=1Tlog⁑P(wt∣w<t)\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P(w_t \mid w_{<t})

Perplexity measures model quality:

PPL=exp⁑(L)=exp⁑ ⁣(βˆ’1Tβˆ‘t=1Tlog⁑P(wt∣w<t))\text{PPL} = \exp(\mathcal{L}) = \exp\!\left(-\frac{1}{T}\sum_{t=1}^{T} \log P(w_t \mid w_{<t})\right)

Lower perplexity means the model is less "surprised" by the data. A perplexity of VV corresponds to uniform guessing over the vocabulary.

Definition:

N-gram Language Model

An nn-gram model approximates the full history with the last nβˆ’1n-1 tokens:

P(wt∣w<t)β‰ˆP(wt∣wtβˆ’n+1,…,wtβˆ’1)P(w_t \mid w_{<t}) \approx P(w_t \mid w_{t-n+1}, \ldots, w_{t-1})

The probabilities are estimated from corpus counts:

P^(wt∣wtβˆ’n+1:tβˆ’1)=count(wtβˆ’n+1,…,wt)count(wtβˆ’n+1,…,wtβˆ’1)\hat{P}(w_t \mid w_{t-n+1:t-1}) = \frac{\text{count}(w_{t-n+1}, \ldots, w_t)}{\text{count}(w_{t-n+1}, \ldots, w_{t-1})}

Smoothing (Laplace, Kneser-Ney) handles zero-count n-grams.

N-gram models are simple and interpretable but cannot capture long-range dependencies beyond the window of nβˆ’1n-1 words.

Definition:

Scaled Dot-Product Self-Attention

Given input X∈RTΓ—d\mathbf{X} \in \mathbb{R}^{T \times d}, self-attention computes:

Q=XWQ,K=XWK,V=XWV\mathbf{Q} = \mathbf{X}\mathbf{W}_Q, \quad \mathbf{K} = \mathbf{X}\mathbf{W}_K, \quad \mathbf{V} = \mathbf{X}\mathbf{W}_V

Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

For autoregressive models, a causal mask sets future positions to βˆ’βˆž-\infty before softmax, ensuring wtw_t only attends to w≀tw_{\le t}.

import torch
import torch.nn.functional as F

def attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k ** 0.5
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    return F.softmax(scores, dim=-1) @ V

The dk\sqrt{d_k} scaling prevents the dot products from becoming too large, which would push softmax into regions with tiny gradients.

Theorem: Self-Attention Computational Complexity

For a sequence of length TT with embedding dimension dd:

  • Time complexity: O(T2d)O(T^2 d) (computing all pairwise attention scores)
  • Space complexity: O(T2+Td)O(T^2 + Td) (storing the attention matrix)

This quadratic scaling in TT is the primary bottleneck for long sequences and motivates efficient attention variants (linear attention, sparse attention, Flash Attention).

Each token must compute a similarity score with every other token in the sequence, yielding T2T^2 scores. For T=8192T = 8192 and d=4096d = 4096, the attention matrix alone requires ~500 MB in fp32.

Example: RNN Language Model in PyTorch

Build and train a simple RNN language model on a small text corpus. Generate text by sampling from the model.

Example: Building a Transformer Block

Implement a single transformer decoder block with self-attention, layer normalization, and a feed-forward network.

Self-Attention Pattern Visualization

Visualize attention weights between tokens in a sentence

Parameters

Language Model Architecture Comparison

Language Model Architecture Comparison
N-gram, RNN, and Transformer approaches to language modeling.

Quick Check

A language model with perplexity 50 on a vocabulary of 10,000 words is equivalent to:

Choosing uniformly among 50 words at each step

Making errors on 50% of predictions

Having 50 parameters per word

Quick Check

Why does an autoregressive transformer use a causal mask?

To reduce memory usage

To prevent information leakage from future tokens during training

To speed up inference

Common Mistake: Ignoring Temperature During Generation

Mistake:

Always using temperature=1.0 or greedy decoding (temperature=0).

Correction:

Temperature controls the sharpness of the output distribution. Use temperature < 1.0 for more deterministic output (e.g., code generation) and temperature > 1.0 for more creative output. Top-k and nucleus (top-p) sampling provide additional control.

Common Mistake: Forgetting Positional Encoding

Mistake:

Using a transformer without positional encoding.

Correction:

Self-attention is permutation-equivariant β€” it treats input as a set, not a sequence. Without positional encoding, the model cannot distinguish "the cat sat on the mat" from "mat the on sat cat the". Use sinusoidal, learned, or rotary positional encodings.

Key Takeaway

Language modeling is the task of predicting the next token given previous tokens. The transformer's self-attention mechanism enables parallel training and captures long-range dependencies, making it the dominant architecture for modern NLP. Perplexity measures model quality.

Key Takeaway

The NLP pipeline β€” tokenization, embedding, and modeling β€” transforms raw text into structured predictions. Each stage has design choices (subword vs. word tokens, static vs. contextual embeddings, RNN vs. transformer) that trade off simplicity, speed, and capability.

Why This Matters: Attention and MIMO Detection

The attention mechanism's computation of pairwise affinities is structurally similar to MIMO detection algorithms that compute interference between spatial streams. Transformer-based MIMO detectors (e.g., DeepMIMO) use self-attention to learn inter-stream dependencies, replacing hand-crafted interference cancellation.

See full treatment in Chapter 49

Historical Note: Attention Is All You Need

2017

The transformer architecture was introduced by Vaswani et al. (2017) at Google. The paper title "Attention Is All You Need" was deliberately provocative β€” it replaced the dominant RNN/LSTM architecture entirely with self-attention. The paper has accumulated over 130,000 citations, making it one of the most cited papers in computer science history.

Language Model

A probabilistic model that assigns probabilities to sequences of tokens, typically factorized as a product of conditional next-token probabilities.

Related: Perplexity

Perplexity

The exponential of the average negative log-likelihood per token β€” measures how well a language model predicts held-out text. Lower is better.

Related: Language Model

Self-Attention

A mechanism that computes weighted sums of all positions in a sequence, where weights are determined by pairwise query-key similarity scores.

Related: Transformer

Transformer

A neural network architecture based entirely on self-attention mechanisms, without recurrence or convolution, enabling parallel training on sequences.

Related: Self-Attention

Language Model Architecture Comparison

ArchitectureComplexity per StepParallelismLong-Range Dependencies
N-gramO(1)O(1)Fully parallelLimited to nβˆ’1n-1 words
RNN/LSTMO(d2)O(d^2)SequentialModerate (vanishing gradients)
TransformerO(T2d)O(T^2 d)Fully parallelFull sequence length
Linear AttentionO(Td2)O(Td^2)Fully parallelFull (approximate)