Ferkans — Interactive Telecom Tutor

Language Models as Foundation

A language model assigns probabilities to sequences of tokens. This seemingly simple task — predicting the next word — turns out to require deep understanding of syntax, semantics, and world knowledge. Modern LLMs are scaled-up language models.

Definition:
Autoregressive Language Model

An autoregressive language model factorizes the joint probability of a sequence using the chain rule:

$P(w_1, w_2, \ldots, w_T) = \prod_{t=1}^{T} P(w_t \mid w_1, \ldots, w_{t-1})$

The model is trained to minimize the negative log-likelihood:

$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P(w_t \mid w_{<t})$

Perplexity measures model quality:

$\text{PPL} = \exp(\mathcal{L}) = \exp\!\left(-\frac{1}{T}\sum_{t=1}^{T} \log P(w_t \mid w_{<t})\right)$

Lower perplexity means the model is less "surprised" by the data. A perplexity of $V$ corresponds to uniform guessing over the vocabulary.

Definition:
N-gram Language Model

An $n$ -gram model approximates the full history with the last $n-1$ tokens:

$P(w_t \mid w_{<t}) \approx P(w_t \mid w_{t-n+1}, \ldots, w_{t-1})$

The probabilities are estimated from corpus counts:

$\hat{P}(w_t \mid w_{t-n+1:t-1}) = \frac{\text{count}(w_{t-n+1}, \ldots, w_t)}{\text{count}(w_{t-n+1}, \ldots, w_{t-1})}$

Smoothing (Laplace, Kneser-Ney) handles zero-count n-grams.

N-gram models are simple and interpretable but cannot capture long-range dependencies beyond the window of $n-1$ words.

Definition:
Scaled Dot-Product Self-Attention

Given input $\mathbf{X} \in \mathbb{R}^{T \times d}$ , self-attention computes:

$\mathbf{Q} = \mathbf{X}\mathbf{W}_Q, \quad \mathbf{K} = \mathbf{X}\mathbf{W}_K, \quad \mathbf{V} = \mathbf{X}\mathbf{W}_V$

$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$

For autoregressive models, a causal mask sets future positions to $-\infty$ before softmax, ensuring $w_t$ only attends to $w_{\le t}$ .

import torch
import torch.nn.functional as F

def attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k ** 0.5
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    return F.softmax(scores, dim=-1) @ V

The $\sqrt{d_k}$ scaling prevents the dot products from becoming too large, which would push softmax into regions with tiny gradients.

Theorem: Self-Attention Computational Complexity

For a sequence of length $T$ with embedding dimension $d$ :

Time complexity: $O(T^2 d)$ (computing all pairwise attention scores)
Space complexity: $O(T^2 + Td)$ (storing the attention matrix)

This quadratic scaling in $T$ is the primary bottleneck for long sequences and motivates efficient attention variants (linear attention, sparse attention, Flash Attention).

Each token must compute a similarity score with every other token in the sequence, yielding $T^2$ scores. For $T = 8192$ and $d = 4096$ , the attention matrix alone requires ~500 MB in fp32.

Example: RNN Language Model in PyTorch

Build and train a simple RNN language model on a small text corpus. Generate text by sampling from the model.

Solution

Model Definition

import torch
import torch.nn as nn

class RNNLM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.head = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, hidden=None):
        e = self.embed(x)
        out, hidden = self.rnn(e, hidden)
        logits = self.head(out)
        return logits, hidden

    @torch.no_grad()
    def generate(self, start_ids, max_len=50, temperature=0.8):
        self.eval()
        ids = start_ids.unsqueeze(0)
        hidden = None
        for _ in range(max_len):
            logits, hidden = self(ids[:, -1:], hidden)
            probs = (logits[:, -1] / temperature).softmax(-1)
            next_id = torch.multinomial(probs, 1)
            ids = torch.cat([ids, next_id], dim=1)
        return ids[0]

Training Loop

model = RNNLM(vocab_size=len(vocab), embed_dim=64, hidden_dim=128)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(100):
    logits, _ = model(train_ids[:, :-1])
    loss = criterion(logits.reshape(-1, len(vocab)),
                   train_ids[:, 1:].reshape(-1))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Example: Building a Transformer Block

Implement a single transformer decoder block with self-attention, layer normalization, and a feed-forward network.

Solution

Implementation

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attn = nn.MultiheadAttention(
            d_model, n_heads, dropout=dropout, batch_first=True
        )
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout),
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        # Pre-norm architecture
        h = self.ln1(x)
        h, _ = self.attn(h, h, h, attn_mask=mask)
        x = x + h
        x = x + self.ff(self.ln2(x))
        return x

Self-Attention Pattern Visualization

Visualize attention weights between tokens in a sentence

Parameters

Language Model Architecture Comparison — N-gram, RNN, and Transformer approaches to language modeling.

Quick Check

A language model with perplexity 50 on a vocabulary of 10,000 words is equivalent to:

Choosing uniformly among 50 words at each step

Making errors on 50% of predictions

Having 50 parameters per word

Correction:

Choosing uniformly among 50 words at each step

Perplexity is the effective number of equally-likely choices the model faces at each prediction step.

Quick Check

Why does an autoregressive transformer use a causal mask?

To reduce memory usage

To prevent information leakage from future tokens during training

To speed up inference

Correction:

To prevent information leakage from future tokens during training

The causal mask ensures position t can only attend to positions 1..t, maintaining the autoregressive property.

Common Mistake: Ignoring Temperature During Generation

Mistake:

Always using temperature=1.0 or greedy decoding (temperature=0).

Correction:

Temperature controls the sharpness of the output distribution. Use temperature < 1.0 for more deterministic output (e.g., code generation) and temperature > 1.0 for more creative output. Top-k and nucleus (top-p) sampling provide additional control.

Common Mistake: Forgetting Positional Encoding

Mistake:

Using a transformer without positional encoding.

Correction:

Self-attention is permutation-equivariant — it treats input as a set, not a sequence. Without positional encoding, the model cannot distinguish "the cat sat on the mat" from "mat the on sat cat the". Use sinusoidal, learned, or rotary positional encodings.

Key Takeaway

Language modeling is the task of predicting the next token given previous tokens. The transformer's self-attention mechanism enables parallel training and captures long-range dependencies, making it the dominant architecture for modern NLP. Perplexity measures model quality.

Key Takeaway

The NLP pipeline — tokenization, embedding, and modeling — transforms raw text into structured predictions. Each stage has design choices (subword vs. word tokens, static vs. contextual embeddings, RNN vs. transformer) that trade off simplicity, speed, and capability.

Why This Matters: Attention and MIMO Detection

The attention mechanism's computation of pairwise affinities is structurally similar to MIMO detection algorithms that compute interference between spatial streams. Transformer-based MIMO detectors (e.g., DeepMIMO) use self-attention to learn inter-stream dependencies, replacing hand-crafted interference cancellation.

See full treatment in Chapter 49

Historical Note: Attention Is All You Need

2017

The transformer architecture was introduced by Vaswani et al. (2017) at Google. The paper title "Attention Is All You Need" was deliberately provocative — it replaced the dominant RNN/LSTM architecture entirely with self-attention. The paper has accumulated over 130,000 citations, making it one of the most cited papers in computer science history.

Language Model

A probabilistic model that assigns probabilities to sequences of tokens, typically factorized as a product of conditional next-token probabilities.

Related: Perplexity

Perplexity

The exponential of the average negative log-likelihood per token — measures how well a language model predicts held-out text. Lower is better.

Related: Language Model

Self-Attention

A mechanism that computes weighted sums of all positions in a sequence, where weights are determined by pairwise query-key similarity scores.

Related: Transformer

Transformer

A neural network architecture based entirely on self-attention mechanisms, without recurrence or convolution, enabling parallel training on sequences.

Related: Self-Attention

Language Model Architecture Comparison

Architecture	Complexity per Step	Parallelism	Long-Range Dependencies
N-gram	$O(1)$	Fully parallel	Limited to $n-1$ words
RNN/LSTM	$O(d^2)$	Sequential	Moderate (vanishing gradients)
Transformer	$O(T^2 d)$	Fully parallel	Full sequence length
Linear Attention	$O(Td^2)$	Fully parallel	Full (approximate)

Language Modeling

Language Models as Foundation

Definition: Autoregressive Language Model

Definition: N-gram Language Model

Definition: Scaled Dot-Product Self-Attention

Theorem: Self-Attention Computational Complexity

Example: RNN Language Model in PyTorch

Model Definition

Training Loop

Example: Building a Transformer Block

Implementation

Self-Attention Pattern Visualization

Parameters

Language Model Architecture Comparison

Quick Check

Quick Check

Common Mistake: Ignoring Temperature During Generation

Common Mistake: Forgetting Positional Encoding

Key Takeaway

Key Takeaway

Why This Matters: Attention and MIMO Detection

Historical Note: Attention Is All You Need

Language Model

Perplexity

Self-Attention

Transformer

Language Model Architecture Comparison

Definition:
Autoregressive Language Model

Definition:
N-gram Language Model

Definition:
Scaled Dot-Product Self-Attention