Ferkans — Interactive Telecom Tutor

From Text to Numbers

Neural networks operate on numbers, not strings. The fundamental challenge of NLP is converting text into numerical representations that preserve meaning. This section covers the pipeline: raw text -> tokenization -> integer IDs -> dense vectors.

Definition:
Tokenization

Tokenization splits raw text into a sequence of discrete units called tokens. Common strategies:

Word-level: split on whitespace/punctuation. $\text{"the cat sat"} \to [\text{the}, \text{cat}, \text{sat}]$
Character-level: each character is a token. $\text{"cat"} \to [\text{c}, \text{a}, \text{t}]$
Subword (BPE, WordPiece, SentencePiece): merge frequent character pairs iteratively. $\text{"unbelievable"} \to [\text{un}, \text{believ}, \text{able}]$

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
ids = tok.encode("Hello, world!")
print(tok.convert_ids_to_tokens(ids))
# ['Hello', ',', ' world', '!']

Subword tokenization is the dominant modern approach. It handles out-of-vocabulary words gracefully by decomposing them into known subunits.

Definition:
Byte Pair Encoding (BPE)

BPE builds a vocabulary by iteratively merging the most frequent pair of adjacent symbols. Starting from characters:

Initialize vocabulary with all unique characters
Count all adjacent symbol pairs in the corpus
Merge the most frequent pair into a new symbol
Repeat steps 2-3 until the desired vocabulary size $V$ is reached

The merge rules are saved and applied deterministically at inference. GPT-2/3/4 use byte-level BPE with $V \approx 50{,}000$ .

# Simplified BPE training
def train_bpe(corpus: list[str], num_merges: int):
    vocab = list(set(c for word in corpus for c in word))
    for _ in range(num_merges):
        pairs = count_pairs(corpus)
        best = max(pairs, key=pairs.get)
        corpus = merge_pair(corpus, best)
        vocab.append(best[0] + best[1])
    return vocab

Definition:
One-Hot Encoding

A token with index $i$ in vocabulary $\mathcal{V}$ is represented as:

$\mathbf{x}_i = [0, \ldots, 0, \underbrace{1}_{i\text{-th}}, 0, \ldots, 0]^T \in \{0,1\}^V$

Properties: $\mathbf{x}_i^T \mathbf{x}_j = \delta_{ij}$ (all words are equidistant), dimension equals vocabulary size ( $V \sim 50{,}000$ ).

One-hot vectors are extremely sparse and encode no semantic similarity. This motivates dense embedding representations.

Definition:
Bag of Words (BoW) and TF-IDF

The bag-of-words representation counts token occurrences:

$\mathbf{b}(d) = [\text{count}(w_1, d), \ldots, \text{count}(w_V, d)]^T$

TF-IDF (Term Frequency - Inverse Document Frequency) reweights:

$\text{TF-IDF}(w, d) = \text{tf}(w, d) \cdot \log\frac{N}{\text{df}(w)}$

where $\text{tf}(w,d)$ is the term frequency in document $d$ , $\text{df}(w)$ is the number of documents containing $w$ , and $N$ is the total number of documents.

TF-IDF is still useful for baseline retrieval and feature engineering. Modern systems use learned embeddings, but TF-IDF provides a strong non-neural baseline.

Definition:
Vocabulary and Token-to-Index Mapping

A vocabulary $\mathcal{V}$ is a finite set of tokens with a bijective mapping to integer indices:

$\text{encode}: \mathcal{V} \to \{0, 1, \ldots, V-1\}$ $\text{decode}: \{0, 1, \ldots, V-1\} \to \mathcal{V}$

Special tokens include <PAD> (padding), <UNK> (unknown), <BOS> (beginning of sequence), and <EOS> (end of sequence).

vocab = {"<PAD>": 0, "<UNK>": 1, "the": 2, "cat": 3, "sat": 4}
def encode(text):
    return [vocab.get(w, vocab["<UNK>"]) for w in text.split()]

Theorem: BPE and Compression

BPE with $k$ merges applied to a corpus $C$ produces an encoding where the total number of tokens satisfies:

$|C_{\text{BPE}}| \le |C_{\text{char}}| - k$

Each merge reduces the corpus length by at least 1 (when the merged pair appears at least once). BPE approximates the byte-level compression achieved by dictionary coding.

BPE greedily finds the most common pair and replaces two tokens with one, shrinking the representation. Frequent words get short encodings (like Huffman coding for substrings).

Example: Comparing Tokenizers on the Same Text

Tokenize "The telecommunications engineer analyzed 5G NR throughput" using word, character, and BPE tokenizers. Compare token counts.

Solution

Implementation

from transformers import AutoTokenizer

text = "The telecommunications engineer analyzed 5G NR throughput"

# Word tokenizer (whitespace)
word_tokens = text.split()
print(f"Word tokens ({len(word_tokens)}): {word_tokens}")

# Character tokenizer
char_tokens = list(text)
print(f"Char tokens ({len(char_tokens)}): {char_tokens[:10]}...")

# GPT-2 BPE
gpt2_tok = AutoTokenizer.from_pretrained("gpt2")
bpe_ids = gpt2_tok.encode(text)
bpe_tokens = gpt2_tok.convert_ids_to_tokens(bpe_ids)
print(f"BPE tokens ({len(bpe_tokens)}): {bpe_tokens}")

Analysis

Word: 7 tokens (cannot handle subwords or OOV). Character: 55 tokens (too fine-grained, long sequences). BPE: ~9 tokens (good balance — "telecommunications" splits into known subwords, numbers handled naturally).

Example: TF-IDF for Retrieving Relevant Papers

Given 3 paper abstracts, use TF-IDF to find which is most relevant to the query "MIMO channel estimation".

Solution

Implementation

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

docs = [
    "We propose a deep learning approach for MIMO channel estimation.",
    "This paper studies OFDM modulation for broadband wireless.",
    "Compressed sensing enables sparse signal recovery in imaging.",
]
query = "MIMO channel estimation"

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(docs + [query])
sims = cosine_similarity(tfidf[-1:], tfidf[:-1])[0]

for i, (doc, sim) in enumerate(zip(docs, sims)):
    print(f"Doc {i}: sim={sim:.3f}  {doc[:50]}...")

Result

Doc 0 has the highest cosine similarity because it shares "MIMO", "channel", and "estimation" with the query.

Example: Building a Vocabulary from a Corpus

Build a vocabulary from a list of sentences, with minimum frequency filtering and special tokens.

Solution

Implementation

from collections import Counter

corpus = [
    "the channel is fading",
    "the signal is strong",
    "fading channel estimation",
]

# Count all tokens
counter = Counter()
for sent in corpus:
    counter.update(sent.lower().split())

# Filter by minimum frequency
min_freq = 1
special = ["<PAD>", "<UNK>", "<BOS>", "<EOS>"]
vocab = {tok: i for i, tok in enumerate(special)}
for word, count in counter.most_common():
    if count >= min_freq:
        vocab[word] = len(vocab)

print(f"Vocabulary size: {len(vocab)}")
print(vocab)

Tokenizer Explorer

Compare word, character, and BPE tokenization on custom text

Parameters

TF-IDF Weight Visualization

Visualize TF-IDF weights across documents and terms

Parameters

Quick Check

Which tokenization strategy best handles out-of-vocabulary words?

Word-level

Character-level

Subword (BPE)

Correction:

Subword (BPE)

BPE decomposes unknown words into known subword units, preserving partial meaning while keeping sequence length manageable.

Common Mistake: Too Many Tokens

Mistake:

Using a fixed word-level vocabulary and mapping all unseen words to .

Correction:

Use subword tokenization (BPE or SentencePiece) which decomposes any word into known subunits, eliminating OOV issues.

Historical Note: From Bag of Words to Embeddings

1950s-2013

The bag-of-words model dates to the 1950s (Luhn, 1957). TF-IDF was formalized by Sparck Jones (1972). These sparse representations dominated information retrieval for decades until dense embeddings (Word2Vec, 2013) showed that learned representations capture semantic relationships that hand-crafted features miss entirely.

Token

The atomic unit of text processing — a word, subword, or character that maps to an integer index in the vocabulary.

Vocabulary

The complete set of tokens recognized by a model, with a bijective mapping between tokens and integer indices.

Related: Token

Byte Pair Encoding (BPE)

A subword tokenization algorithm that iteratively merges the most frequent pair of adjacent symbols to build a vocabulary of variable-length tokens.

Related: Token

Text Representation

From Text to Numbers

Definition: Tokenization

Definition: Byte Pair Encoding (BPE)

Definition: One-Hot Encoding

Definition: Bag of Words (BoW) and TF-IDF

Definition: Vocabulary and Token-to-Index Mapping

Theorem: BPE and Compression

Example: Comparing Tokenizers on the Same Text

Implementation

Analysis

Example: TF-IDF for Retrieving Relevant Papers

Implementation

Result

Example: Building a Vocabulary from a Corpus

Implementation

Tokenizer Explorer

Parameters

TF-IDF Weight Visualization

Parameters

Quick Check

Common Mistake: Too Many Tokens

Historical Note: From Bag of Words to Embeddings

Token

Vocabulary

Byte Pair Encoding (BPE)

Definition:
Tokenization

Definition:
Byte Pair Encoding (BPE)

Definition:
One-Hot Encoding

Definition:
Bag of Words (BoW) and TF-IDF

Definition:
Vocabulary and Token-to-Index Mapping