Word Embeddings
Why Dense Embeddings?
One-hot vectors treat every word as equally different. Dense embeddings map words into a continuous vector space where semantic similarity corresponds to geometric proximity: .
Definition: Word Embedding
Word Embedding
A word embedding is a learned mapping that assigns each token a dense vector. The embedding matrix stores all vectors:
where is the one-hot vector for word .
import torch.nn as nn
embed = nn.Embedding(num_embeddings=50000, embedding_dim=256)
# embed(torch.tensor([42])) -> shape (1, 256)
The embedding lookup is equivalent to a matrix multiplication with the one-hot vector, but implemented as an index lookup for efficiency.
Definition: Word2Vec: Skip-gram Model
Word2Vec: Skip-gram Model
Skip-gram predicts context words given a center word :
where and are the input and output embeddings. The loss over the corpus is:
Negative sampling approximates the expensive softmax:
where are randomly sampled negative words.
Definition: GloVe: Global Vectors
GloVe: Global Vectors
GloVe learns embeddings by factorizing the log co-occurrence matrix. Given co-occurrence counts :
where is a weighting function that caps the influence of very frequent pairs.
The key insight: captures both local and global statistics.
GloVe combines the benefits of matrix factorization (global statistics) with the efficiency of local context window methods (Word2Vec).
Theorem: Linear Analogy Property of Embeddings
Well-trained word embeddings exhibit approximate linear analogies:
More precisely, the answer to "A is to B as C is to ?" is:
The embedding space encodes semantic relationships as approximately linear directions. The "gender" direction is roughly parallel to .
Empirical Validation
Mikolov et al. (2013) showed that on a test set of 19,544 analogy questions, Word2Vec achieves ~75% accuracy using the above vector arithmetic. GloVe achieves similar results.
Theorem: Embedding Dimension and Information Capacity
For a vocabulary of size and embedding dimension , the embedding matrix has parameters. The effective capacity scales as:
Empirically, downstream task performance improves with up to a saturation point - for most NLP tasks.
Too few dimensions cannot capture the richness of language; too many lead to overfitting on finite corpora.
Example: Training Word2Vec from Scratch
Train a simple Word2Vec skip-gram model on a small telecom corpus and find nearest neighbors for "channel".
Implementation
import torch
import torch.nn as nn
import torch.optim as optim
# Mini corpus
corpus = """
the wireless channel exhibits fading
channel estimation is critical for MIMO
the fading channel model uses Rayleigh distribution
MIMO systems use multiple antennas for spatial diversity
""".strip().lower().split('\n')
# Build vocabulary
words = set(w for s in corpus for w in s.split())
w2i = {w: i for i, w in enumerate(sorted(words))}
V, d = len(w2i), 32
# Skip-gram pairs (window=2)
pairs = []
for sent in corpus:
tokens = sent.split()
for i, center in enumerate(tokens):
for j in range(max(0,i-2), min(len(tokens),i+3)):
if i != j:
pairs.append((w2i[center], w2i[tokens[j]]))
# Model
class SkipGram(nn.Module):
def __init__(self, V, d):
super().__init__()
self.center = nn.Embedding(V, d)
self.context = nn.Embedding(V, d)
def forward(self, c, ctx):
return (self.center(c) * self.context(ctx)).sum(-1)
model = SkipGram(V, d)
opt = optim.Adam(model.parameters(), lr=0.01)
Word Embedding Visualization
Explore word embeddings projected to 2D via t-SNE or PCA
Parameters
Embedding Analogy Explorer
Perform vector arithmetic on word embeddings: A - B + C = ?
Parameters
Word2Vec Training Dynamics
Watch how word embeddings evolve during training
Parameters
Text-to-Embedding Pipeline
Quick Check
What does Word2Vec skip-gram predict?
The center word given context words
Context words given the center word
The next word given all previous words
Skip-gram maximizes P(context | center), learning embeddings that capture co-occurrence patterns.
Common Mistake: Freezing Pre-Trained Embeddings
Mistake:
Always freezing pre-trained embeddings during fine-tuning.
Correction:
For small domain-specific datasets (e.g., telecom), fine-tuning embeddings often helps. Start with frozen embeddings, then unfreeze if validation loss plateaus. Use a smaller learning rate for the embedding layer.
Why This Matters: Embeddings in Semantic Communication
In semantic communication, the transmitter encodes the meaning of a message (via embeddings) rather than the raw bits. The receiver reconstructs the meaning from the received signal. Word and sentence embeddings serve as the semantic space where distortion is measured in terms of meaning preservation rather than bit error rate.
See full treatment in Chapter 38
Historical Note: The Word2Vec Revolution
2013Mikolov et al. published Word2Vec in 2013 at Google, demonstrating that simple neural networks trained on large text corpora learn embeddings with remarkable algebraic properties. The "king - man + woman = queen" example captured the imagination of the field and launched the modern era of representation learning in NLP.
Word2Vec
A family of models (Skip-gram and CBOW) that learn word embeddings by predicting co-occurrence patterns in text, using shallow neural networks trained with negative sampling.
Related: Word Embedding
GloVe
Global Vectors for Word Representation β an embedding method that factorizes the log co-occurrence matrix to learn word vectors capturing both local and global corpus statistics.
Related: Word Embedding, Word2Vec
Word Embedding Methods Comparison
| Method | Training Signal | Strengths | Weaknesses |
|---|---|---|---|
| Word2Vec (SG) | Local context window | Fast training, good analogies | No global statistics |
| Word2Vec (CBOW) | Predict center from context | Faster than skip-gram | Weaker on rare words |
| GloVe | Global co-occurrence matrix | Captures global statistics | Large memory for co-occurrence matrix |
| FastText | Character n-grams | Handles OOV words | Larger model size |
| Contextual (BERT) | Full sentence context | Polysemy handling | Much slower, GPU required |
Key Takeaway
Word embeddings transform discrete tokens into continuous vectors where semantic similarity maps to geometric proximity. Word2Vec and GloVe produce static embeddings (one vector per word), while contextual models (BERT, GPT) produce dynamic embeddings that depend on the surrounding sentence.