Text Representation

From Text to Numbers

Neural networks operate on numbers, not strings. The fundamental challenge of NLP is converting text into numerical representations that preserve meaning. This section covers the pipeline: raw text -> tokenization -> integer IDs -> dense vectors.

Definition:

Tokenization

Tokenization splits raw text into a sequence of discrete units called tokens. Common strategies:

  1. Word-level: split on whitespace/punctuation. "the cat sat"[the,cat,sat]\text{"the cat sat"} \to [\text{the}, \text{cat}, \text{sat}]
  2. Character-level: each character is a token. "cat"[c,a,t]\text{"cat"} \to [\text{c}, \text{a}, \text{t}]
  3. Subword (BPE, WordPiece, SentencePiece): merge frequent character pairs iteratively. "unbelievable"[un,believ,able]\text{"unbelievable"} \to [\text{un}, \text{believ}, \text{able}]
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
ids = tok.encode("Hello, world!")
print(tok.convert_ids_to_tokens(ids))
# ['Hello', ',', ' world', '!']

Subword tokenization is the dominant modern approach. It handles out-of-vocabulary words gracefully by decomposing them into known subunits.

Definition:

Byte Pair Encoding (BPE)

BPE builds a vocabulary by iteratively merging the most frequent pair of adjacent symbols. Starting from characters:

  1. Initialize vocabulary with all unique characters
  2. Count all adjacent symbol pairs in the corpus
  3. Merge the most frequent pair into a new symbol
  4. Repeat steps 2-3 until the desired vocabulary size VV is reached

The merge rules are saved and applied deterministically at inference. GPT-2/3/4 use byte-level BPE with V50,000V \approx 50{,}000.

# Simplified BPE training
def train_bpe(corpus: list[str], num_merges: int):
    vocab = list(set(c for word in corpus for c in word))
    for _ in range(num_merges):
        pairs = count_pairs(corpus)
        best = max(pairs, key=pairs.get)
        corpus = merge_pair(corpus, best)
        vocab.append(best[0] + best[1])
    return vocab

Definition:

One-Hot Encoding

A token with index ii in vocabulary V\mathcal{V} is represented as:

xi=[0,,0,1i-th,0,,0]T{0,1}V\mathbf{x}_i = [0, \ldots, 0, \underbrace{1}_{i\text{-th}}, 0, \ldots, 0]^T \in \{0,1\}^V

Properties: xiTxj=δij\mathbf{x}_i^T \mathbf{x}_j = \delta_{ij} (all words are equidistant), dimension equals vocabulary size (V50,000V \sim 50{,}000).

One-hot vectors are extremely sparse and encode no semantic similarity. This motivates dense embedding representations.

Definition:

Bag of Words (BoW) and TF-IDF

The bag-of-words representation counts token occurrences:

b(d)=[count(w1,d),,count(wV,d)]T\mathbf{b}(d) = [\text{count}(w_1, d), \ldots, \text{count}(w_V, d)]^T

TF-IDF (Term Frequency - Inverse Document Frequency) reweights:

TF-IDF(w,d)=tf(w,d)logNdf(w)\text{TF-IDF}(w, d) = \text{tf}(w, d) \cdot \log\frac{N}{\text{df}(w)}

where tf(w,d)\text{tf}(w,d) is the term frequency in document dd, df(w)\text{df}(w) is the number of documents containing ww, and NN is the total number of documents.

TF-IDF is still useful for baseline retrieval and feature engineering. Modern systems use learned embeddings, but TF-IDF provides a strong non-neural baseline.

Definition:

Vocabulary and Token-to-Index Mapping

A vocabulary V\mathcal{V} is a finite set of tokens with a bijective mapping to integer indices:

encode:V{0,1,,V1}\text{encode}: \mathcal{V} \to \{0, 1, \ldots, V-1\} decode:{0,1,,V1}V\text{decode}: \{0, 1, \ldots, V-1\} \to \mathcal{V}

Special tokens include <PAD> (padding), <UNK> (unknown), <BOS> (beginning of sequence), and <EOS> (end of sequence).

vocab = {"<PAD>": 0, "<UNK>": 1, "the": 2, "cat": 3, "sat": 4}
def encode(text):
    return [vocab.get(w, vocab["<UNK>"]) for w in text.split()]

Theorem: BPE and Compression

BPE with kk merges applied to a corpus CC produces an encoding where the total number of tokens satisfies:

CBPECchark|C_{\text{BPE}}| \le |C_{\text{char}}| - k

Each merge reduces the corpus length by at least 1 (when the merged pair appears at least once). BPE approximates the byte-level compression achieved by dictionary coding.

BPE greedily finds the most common pair and replaces two tokens with one, shrinking the representation. Frequent words get short encodings (like Huffman coding for substrings).

Example: Comparing Tokenizers on the Same Text

Tokenize "The telecommunications engineer analyzed 5G NR throughput" using word, character, and BPE tokenizers. Compare token counts.

Example: TF-IDF for Retrieving Relevant Papers

Given 3 paper abstracts, use TF-IDF to find which is most relevant to the query "MIMO channel estimation".

Example: Building a Vocabulary from a Corpus

Build a vocabulary from a list of sentences, with minimum frequency filtering and special tokens.

Tokenizer Explorer

Compare word, character, and BPE tokenization on custom text

Parameters

TF-IDF Weight Visualization

Visualize TF-IDF weights across documents and terms

Parameters

Quick Check

Which tokenization strategy best handles out-of-vocabulary words?

Word-level

Character-level

Subword (BPE)

Common Mistake: Too Many Tokens

Mistake:

Using a fixed word-level vocabulary and mapping all unseen words to .

Correction:

Use subword tokenization (BPE or SentencePiece) which decomposes any word into known subunits, eliminating OOV issues.

Historical Note: From Bag of Words to Embeddings

1950s-2013

The bag-of-words model dates to the 1950s (Luhn, 1957). TF-IDF was formalized by Sparck Jones (1972). These sparse representations dominated information retrieval for decades until dense embeddings (Word2Vec, 2013) showed that learned representations capture semantic relationships that hand-crafted features miss entirely.

Token

The atomic unit of text processing — a word, subword, or character that maps to an integer index in the vocabulary.

Related: Vocabulary, Byte Pair Encoding (BPE)

Vocabulary

The complete set of tokens recognized by a model, with a bijective mapping between tokens and integer indices.

Related: Token

Byte Pair Encoding (BPE)

A subword tokenization algorithm that iteratively merges the most frequent pair of adjacent symbols to build a vocabulary of variable-length tokens.

Related: Token