Text Representation
From Text to Numbers
Neural networks operate on numbers, not strings. The fundamental challenge of NLP is converting text into numerical representations that preserve meaning. This section covers the pipeline: raw text -> tokenization -> integer IDs -> dense vectors.
Definition: Tokenization
Tokenization
Tokenization splits raw text into a sequence of discrete units called tokens. Common strategies:
- Word-level: split on whitespace/punctuation.
- Character-level: each character is a token.
- Subword (BPE, WordPiece, SentencePiece): merge frequent character pairs iteratively.
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
ids = tok.encode("Hello, world!")
print(tok.convert_ids_to_tokens(ids))
# ['Hello', ',', ' world', '!']
Subword tokenization is the dominant modern approach. It handles out-of-vocabulary words gracefully by decomposing them into known subunits.
Definition: Byte Pair Encoding (BPE)
Byte Pair Encoding (BPE)
BPE builds a vocabulary by iteratively merging the most frequent pair of adjacent symbols. Starting from characters:
- Initialize vocabulary with all unique characters
- Count all adjacent symbol pairs in the corpus
- Merge the most frequent pair into a new symbol
- Repeat steps 2-3 until the desired vocabulary size is reached
The merge rules are saved and applied deterministically at inference. GPT-2/3/4 use byte-level BPE with .
# Simplified BPE training
def train_bpe(corpus: list[str], num_merges: int):
vocab = list(set(c for word in corpus for c in word))
for _ in range(num_merges):
pairs = count_pairs(corpus)
best = max(pairs, key=pairs.get)
corpus = merge_pair(corpus, best)
vocab.append(best[0] + best[1])
return vocab
Definition: One-Hot Encoding
One-Hot Encoding
A token with index in vocabulary is represented as:
Properties: (all words are equidistant), dimension equals vocabulary size ().
One-hot vectors are extremely sparse and encode no semantic similarity. This motivates dense embedding representations.
Definition: Bag of Words (BoW) and TF-IDF
Bag of Words (BoW) and TF-IDF
The bag-of-words representation counts token occurrences:
TF-IDF (Term Frequency - Inverse Document Frequency) reweights:
where is the term frequency in document , is the number of documents containing , and is the total number of documents.
TF-IDF is still useful for baseline retrieval and feature engineering. Modern systems use learned embeddings, but TF-IDF provides a strong non-neural baseline.
Definition: Vocabulary and Token-to-Index Mapping
Vocabulary and Token-to-Index Mapping
A vocabulary is a finite set of tokens with a bijective mapping to integer indices:
Special tokens include <PAD> (padding), <UNK> (unknown),
<BOS> (beginning of sequence), and <EOS> (end of sequence).
vocab = {"<PAD>": 0, "<UNK>": 1, "the": 2, "cat": 3, "sat": 4}
def encode(text):
return [vocab.get(w, vocab["<UNK>"]) for w in text.split()]
Theorem: BPE and Compression
BPE with merges applied to a corpus produces an encoding where the total number of tokens satisfies:
Each merge reduces the corpus length by at least 1 (when the merged pair appears at least once). BPE approximates the byte-level compression achieved by dictionary coding.
BPE greedily finds the most common pair and replaces two tokens with one, shrinking the representation. Frequent words get short encodings (like Huffman coding for substrings).
Example: Comparing Tokenizers on the Same Text
Tokenize "The telecommunications engineer analyzed 5G NR throughput" using word, character, and BPE tokenizers. Compare token counts.
Implementation
from transformers import AutoTokenizer
text = "The telecommunications engineer analyzed 5G NR throughput"
# Word tokenizer (whitespace)
word_tokens = text.split()
print(f"Word tokens ({len(word_tokens)}): {word_tokens}")
# Character tokenizer
char_tokens = list(text)
print(f"Char tokens ({len(char_tokens)}): {char_tokens[:10]}...")
# GPT-2 BPE
gpt2_tok = AutoTokenizer.from_pretrained("gpt2")
bpe_ids = gpt2_tok.encode(text)
bpe_tokens = gpt2_tok.convert_ids_to_tokens(bpe_ids)
print(f"BPE tokens ({len(bpe_tokens)}): {bpe_tokens}")
Analysis
Word: 7 tokens (cannot handle subwords or OOV). Character: 55 tokens (too fine-grained, long sequences). BPE: ~9 tokens (good balance — "telecommunications" splits into known subwords, numbers handled naturally).
Example: TF-IDF for Retrieving Relevant Papers
Given 3 paper abstracts, use TF-IDF to find which is most relevant to the query "MIMO channel estimation".
Implementation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
docs = [
"We propose a deep learning approach for MIMO channel estimation.",
"This paper studies OFDM modulation for broadband wireless.",
"Compressed sensing enables sparse signal recovery in imaging.",
]
query = "MIMO channel estimation"
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(docs + [query])
sims = cosine_similarity(tfidf[-1:], tfidf[:-1])[0]
for i, (doc, sim) in enumerate(zip(docs, sims)):
print(f"Doc {i}: sim={sim:.3f} {doc[:50]}...")
Result
Doc 0 has the highest cosine similarity because it shares "MIMO", "channel", and "estimation" with the query.
Example: Building a Vocabulary from a Corpus
Build a vocabulary from a list of sentences, with minimum frequency filtering and special tokens.
Implementation
from collections import Counter
corpus = [
"the channel is fading",
"the signal is strong",
"fading channel estimation",
]
# Count all tokens
counter = Counter()
for sent in corpus:
counter.update(sent.lower().split())
# Filter by minimum frequency
min_freq = 1
special = ["<PAD>", "<UNK>", "<BOS>", "<EOS>"]
vocab = {tok: i for i, tok in enumerate(special)}
for word, count in counter.most_common():
if count >= min_freq:
vocab[word] = len(vocab)
print(f"Vocabulary size: {len(vocab)}")
print(vocab)
Tokenizer Explorer
Compare word, character, and BPE tokenization on custom text
Parameters
TF-IDF Weight Visualization
Visualize TF-IDF weights across documents and terms
Parameters
Quick Check
Which tokenization strategy best handles out-of-vocabulary words?
Word-level
Character-level
Subword (BPE)
BPE decomposes unknown words into known subword units, preserving partial meaning while keeping sequence length manageable.
Common Mistake: Too Many Tokens
Mistake:
Using a fixed word-level vocabulary and mapping all unseen words to
Correction:
Use subword tokenization (BPE or SentencePiece) which decomposes any word into known subunits, eliminating OOV issues.
Historical Note: From Bag of Words to Embeddings
1950s-2013The bag-of-words model dates to the 1950s (Luhn, 1957). TF-IDF was formalized by Sparck Jones (1972). These sparse representations dominated information retrieval for decades until dense embeddings (Word2Vec, 2013) showed that learned representations capture semantic relationships that hand-crafted features miss entirely.
Token
The atomic unit of text processing — a word, subword, or character that maps to an integer index in the vocabulary.
Related: Vocabulary, Byte Pair Encoding (BPE)
Vocabulary
The complete set of tokens recognized by a model, with a bijective mapping between tokens and integer indices.
Related: Token
Byte Pair Encoding (BPE)
A subword tokenization algorithm that iteratively merges the most frequent pair of adjacent symbols to build a vocabulary of variable-length tokens.
Related: Token