Word Embeddings

Why Dense Embeddings?

One-hot vectors treat every word as equally different. Dense embeddings map words into a continuous vector space where semantic similarity corresponds to geometric proximity: sim(eking,equeen)>sim(eking,ebanana)\text{sim}(\mathbf{e}_\text{king}, \mathbf{e}_\text{queen}) > \text{sim}(\mathbf{e}_\text{king}, \mathbf{e}_\text{banana}).

Definition:

Word Embedding

A word embedding is a learned mapping f:Vβ†’Rdf: \mathcal{V} \to \mathbb{R}^d that assigns each token a dense vector. The embedding matrix E∈RVΓ—d\mathbf{E} \in \mathbb{R}^{V \times d} stores all vectors:

ew=E[w,:]=ETxw\mathbf{e}_w = \mathbf{E}[w, :] = \mathbf{E}^T \mathbf{x}_w

where xw\mathbf{x}_w is the one-hot vector for word ww.

import torch.nn as nn
embed = nn.Embedding(num_embeddings=50000, embedding_dim=256)
# embed(torch.tensor([42])) -> shape (1, 256)

The embedding lookup is equivalent to a matrix multiplication with the one-hot vector, but implemented as an index lookup for efficiency.

Definition:

Word2Vec: Skip-gram Model

Skip-gram predicts context words wcw_c given a center word wtw_t:

P(wc∣wt)=exp⁑(uwcTvwt)βˆ‘w∈Vexp⁑(uwTvwt)P(w_c \mid w_t) = \frac{\exp(\mathbf{u}_{w_c}^T \mathbf{v}_{w_t})}{\sum_{w \in \mathcal{V}} \exp(\mathbf{u}_w^T \mathbf{v}_{w_t})}

where vw\mathbf{v}_w and uw\mathbf{u}_w are the input and output embeddings. The loss over the corpus is:

L=βˆ’βˆ‘(wt,wc)log⁑P(wc∣wt)\mathcal{L} = -\sum_{(w_t, w_c)} \log P(w_c \mid w_t)

Negative sampling approximates the expensive softmax:

LNS=βˆ’log⁑σ(uwcTvwt)βˆ’βˆ‘k=1Klog⁑σ(βˆ’uwkTvwt)\mathcal{L}_{\text{NS}} = -\log \sigma(\mathbf{u}_{w_c}^T \mathbf{v}_{w_t}) - \sum_{k=1}^{K} \log \sigma(-\mathbf{u}_{w_k}^T \mathbf{v}_{w_t})

where wkw_k are KK randomly sampled negative words.

Definition:

GloVe: Global Vectors

GloVe learns embeddings by factorizing the log co-occurrence matrix. Given co-occurrence counts XijX_{ij}:

L=βˆ‘i,jf(Xij)(wiTw~j+bi+b~jβˆ’log⁑Xij)2\mathcal{L} = \sum_{i,j} f(X_{ij}) \left(\mathbf{w}_i^T \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij}\right)^2

where f(x)=min⁑ ⁣((x/xmax⁑)Ξ±,1)f(x) = \min\!\left((x/x_{\max})^\alpha, 1\right) is a weighting function that caps the influence of very frequent pairs.

The key insight: wiTwjβ‰ˆlog⁑P(i∣j)\mathbf{w}_i^T \mathbf{w}_j \approx \log P(i \mid j) captures both local and global statistics.

GloVe combines the benefits of matrix factorization (global statistics) with the efficiency of local context window methods (Word2Vec).

Theorem: Linear Analogy Property of Embeddings

Well-trained word embeddings exhibit approximate linear analogies:

ekingβˆ’eman+ewomanβ‰ˆequeen\mathbf{e}_\text{king} - \mathbf{e}_\text{man} + \mathbf{e}_\text{woman} \approx \mathbf{e}_\text{queen}

More precisely, the answer to "A is to B as C is to ?" is:

arg⁑max⁑w∈Vcos⁑(ew,eBβˆ’eA+eC)\arg\max_{w \in \mathcal{V}} \cos(\mathbf{e}_w, \mathbf{e}_B - \mathbf{e}_A + \mathbf{e}_C)

The embedding space encodes semantic relationships as approximately linear directions. The "gender" direction ewomanβˆ’eman\mathbf{e}_\text{woman} - \mathbf{e}_\text{man} is roughly parallel to equeenβˆ’eking\mathbf{e}_\text{queen} - \mathbf{e}_\text{king}.

Theorem: Embedding Dimension and Information Capacity

For a vocabulary of size VV and embedding dimension dd, the embedding matrix E∈RVΓ—d\mathbf{E} \in \mathbb{R}^{V \times d} has Vβ‹…dV \cdot d parameters. The effective capacity scales as:

capacity∝dβ‹…log⁑V\text{capacity} \propto d \cdot \log V

Empirically, downstream task performance improves with dd up to a saturation point dβˆ—β‰ˆ200d^* \approx 200-400400 for most NLP tasks.

Too few dimensions cannot capture the richness of language; too many lead to overfitting on finite corpora.

Example: Training Word2Vec from Scratch

Train a simple Word2Vec skip-gram model on a small telecom corpus and find nearest neighbors for "channel".

Word Embedding Visualization

Explore word embeddings projected to 2D via t-SNE or PCA

Parameters

Embedding Analogy Explorer

Perform vector arithmetic on word embeddings: A - B + C = ?

Parameters

Word2Vec Training Dynamics

Watch how word embeddings evolve during training

Parameters

Text-to-Embedding Pipeline

Text-to-Embedding Pipeline
From raw text through tokenization, indexing, and embedding lookup to dense vectors.

Quick Check

What does Word2Vec skip-gram predict?

The center word given context words

Context words given the center word

The next word given all previous words

Common Mistake: Freezing Pre-Trained Embeddings

Mistake:

Always freezing pre-trained embeddings during fine-tuning.

Correction:

For small domain-specific datasets (e.g., telecom), fine-tuning embeddings often helps. Start with frozen embeddings, then unfreeze if validation loss plateaus. Use a smaller learning rate for the embedding layer.

Why This Matters: Embeddings in Semantic Communication

In semantic communication, the transmitter encodes the meaning of a message (via embeddings) rather than the raw bits. The receiver reconstructs the meaning from the received signal. Word and sentence embeddings serve as the semantic space where distortion is measured in terms of meaning preservation rather than bit error rate.

See full treatment in Chapter 38

Historical Note: The Word2Vec Revolution

2013

Mikolov et al. published Word2Vec in 2013 at Google, demonstrating that simple neural networks trained on large text corpora learn embeddings with remarkable algebraic properties. The "king - man + woman = queen" example captured the imagination of the field and launched the modern era of representation learning in NLP.

Word Embedding

A dense, low-dimensional vector representation of a word learned from text data, where semantic similarity corresponds to geometric proximity.

Related: Word2Vec, GloVe

Word2Vec

A family of models (Skip-gram and CBOW) that learn word embeddings by predicting co-occurrence patterns in text, using shallow neural networks trained with negative sampling.

Related: Word Embedding

GloVe

Global Vectors for Word Representation β€” an embedding method that factorizes the log co-occurrence matrix to learn word vectors capturing both local and global corpus statistics.

Related: Word Embedding, Word2Vec

Word Embedding Methods Comparison

MethodTraining SignalStrengthsWeaknesses
Word2Vec (SG)Local context windowFast training, good analogiesNo global statistics
Word2Vec (CBOW)Predict center from contextFaster than skip-gramWeaker on rare words
GloVeGlobal co-occurrence matrixCaptures global statisticsLarge memory for co-occurrence matrix
FastTextCharacter n-gramsHandles OOV wordsLarger model size
Contextual (BERT)Full sentence contextPolysemy handlingMuch slower, GPU required

Key Takeaway

Word embeddings transform discrete tokens into continuous vectors where semantic similarity maps to geometric proximity. Word2Vec and GloVe produce static embeddings (one vector per word), while contextual models (BERT, GPT) produce dynamic embeddings that depend on the surrounding sentence.