Ferkans — Interactive Telecom Tutor

Why Dense Embeddings?

One-hot vectors treat every word as equally different. Dense embeddings map words into a continuous vector space where semantic similarity corresponds to geometric proximity: $\text{sim}(\mathbf{e}_\text{king}, \mathbf{e}_\text{queen}) > \text{sim}(\mathbf{e}_\text{king}, \mathbf{e}_\text{banana})$ .

Definition:
Word Embedding

A word embedding is a learned mapping $f: \mathcal{V} \to \mathbb{R}^d$ that assigns each token a dense vector. The embedding matrix $\mathbf{E} \in \mathbb{R}^{V \times d}$ stores all vectors:

$\mathbf{e}_w = \mathbf{E}[w, :] = \mathbf{E}^T \mathbf{x}_w$

where $\mathbf{x}_w$ is the one-hot vector for word $w$ .

import torch.nn as nn
embed = nn.Embedding(num_embeddings=50000, embedding_dim=256)
# embed(torch.tensor([42])) -> shape (1, 256)

The embedding lookup is equivalent to a matrix multiplication with the one-hot vector, but implemented as an index lookup for efficiency.

Definition:
Word2Vec: Skip-gram Model

Skip-gram predicts context words $w_c$ given a center word $w_t$ :

$P(w_c \mid w_t) = \frac{\exp(\mathbf{u}_{w_c}^T \mathbf{v}_{w_t})}{\sum_{w \in \mathcal{V}} \exp(\mathbf{u}_w^T \mathbf{v}_{w_t})}$

where $\mathbf{v}_w$ and $\mathbf{u}_w$ are the input and output embeddings. The loss over the corpus is:

$\mathcal{L} = -\sum_{(w_t, w_c)} \log P(w_c \mid w_t)$

Negative sampling approximates the expensive softmax:

$\mathcal{L}_{\text{NS}} = -\log \sigma(\mathbf{u}_{w_c}^T \mathbf{v}_{w_t}) - \sum_{k=1}^{K} \log \sigma(-\mathbf{u}_{w_k}^T \mathbf{v}_{w_t})$

where $w_k$ are $K$ randomly sampled negative words.

Definition:
GloVe: Global Vectors

GloVe learns embeddings by factorizing the log co-occurrence matrix. Given co-occurrence counts $X_{ij}$ :

$\mathcal{L} = \sum_{i,j} f(X_{ij}) \left(\mathbf{w}_i^T \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij}\right)^2$

where $f(x) = \min\!\left((x/x_{\max})^\alpha, 1\right)$ is a weighting function that caps the influence of very frequent pairs.

The key insight: $\mathbf{w}_i^T \mathbf{w}_j \approx \log P(i \mid j)$ captures both local and global statistics.

GloVe combines the benefits of matrix factorization (global statistics) with the efficiency of local context window methods (Word2Vec).

Theorem: Linear Analogy Property of Embeddings

Well-trained word embeddings exhibit approximate linear analogies:

$\mathbf{e}_\text{king} - \mathbf{e}_\text{man} + \mathbf{e}_\text{woman} \approx \mathbf{e}_\text{queen}$

More precisely, the answer to "A is to B as C is to ?" is:

$\arg\max_{w \in \mathcal{V}} \cos(\mathbf{e}_w, \mathbf{e}_B - \mathbf{e}_A + \mathbf{e}_C)$

The embedding space encodes semantic relationships as approximately linear directions. The "gender" direction $\mathbf{e}_\text{woman} - \mathbf{e}_\text{man}$ is roughly parallel to $\mathbf{e}_\text{queen} - \mathbf{e}_\text{king}$ .

Proof

Empirical Validation

Mikolov et al. (2013) showed that on a test set of 19,544 analogy questions, Word2Vec achieves ~75% accuracy using the above vector arithmetic. GloVe achieves similar results.

Theorem: Embedding Dimension and Information Capacity

For a vocabulary of size $V$ and embedding dimension $d$ , the embedding matrix $\mathbf{E} \in \mathbb{R}^{V \times d}$ has $V \cdot d$ parameters. The effective capacity scales as:

$\text{capacity} \propto d \cdot \log V$

Empirically, downstream task performance improves with $d$ up to a saturation point $d^* \approx 200$ - $400$ for most NLP tasks.

Too few dimensions cannot capture the richness of language; too many lead to overfitting on finite corpora.

Example: Training Word2Vec from Scratch

Train a simple Word2Vec skip-gram model on a small telecom corpus and find nearest neighbors for "channel".

Solution

Implementation

import torch
import torch.nn as nn
import torch.optim as optim

# Mini corpus
corpus = """
the wireless channel exhibits fading
channel estimation is critical for MIMO
the fading channel model uses Rayleigh distribution
MIMO systems use multiple antennas for spatial diversity
""".strip().lower().split('\n')

# Build vocabulary
words = set(w for s in corpus for w in s.split())
w2i = {w: i for i, w in enumerate(sorted(words))}
V, d = len(w2i), 32

# Skip-gram pairs (window=2)
pairs = []
for sent in corpus:
    tokens = sent.split()
    for i, center in enumerate(tokens):
        for j in range(max(0,i-2), min(len(tokens),i+3)):
            if i != j:
                pairs.append((w2i[center], w2i[tokens[j]]))

# Model
class SkipGram(nn.Module):
    def __init__(self, V, d):
        super().__init__()
        self.center = nn.Embedding(V, d)
        self.context = nn.Embedding(V, d)
    def forward(self, c, ctx):
        return (self.center(c) * self.context(ctx)).sum(-1)

model = SkipGram(V, d)
opt = optim.Adam(model.parameters(), lr=0.01)

Word Embedding Visualization

Explore word embeddings projected to 2D via t-SNE or PCA

Parameters

Embedding Analogy Explorer

Perform vector arithmetic on word embeddings: A - B + C = ?

Parameters

Word2Vec Training Dynamics

Watch how word embeddings evolve during training

Parameters

Text-to-Embedding Pipeline — From raw text through tokenization, indexing, and embedding lookup to dense vectors.

Quick Check

What does Word2Vec skip-gram predict?

The center word given context words

Context words given the center word

The next word given all previous words

Correction:

Context words given the center word

Skip-gram maximizes P(context | center), learning embeddings that capture co-occurrence patterns.

Common Mistake: Freezing Pre-Trained Embeddings

Mistake:

Always freezing pre-trained embeddings during fine-tuning.

Correction:

For small domain-specific datasets (e.g., telecom), fine-tuning embeddings often helps. Start with frozen embeddings, then unfreeze if validation loss plateaus. Use a smaller learning rate for the embedding layer.

Why This Matters: Embeddings in Semantic Communication

In semantic communication, the transmitter encodes the meaning of a message (via embeddings) rather than the raw bits. The receiver reconstructs the meaning from the received signal. Word and sentence embeddings serve as the semantic space where distortion is measured in terms of meaning preservation rather than bit error rate.

See full treatment in Chapter 38

Historical Note: The Word2Vec Revolution

2013

Mikolov et al. published Word2Vec in 2013 at Google, demonstrating that simple neural networks trained on large text corpora learn embeddings with remarkable algebraic properties. The "king - man + woman = queen" example captured the imagination of the field and launched the modern era of representation learning in NLP.

Word Embedding

A dense, low-dimensional vector representation of a word learned from text data, where semantic similarity corresponds to geometric proximity.

Related: Word2Vec, GloVe

Word2Vec

A family of models (Skip-gram and CBOW) that learn word embeddings by predicting co-occurrence patterns in text, using shallow neural networks trained with negative sampling.

Related: Word Embedding

GloVe

Global Vectors for Word Representation — an embedding method that factorizes the log co-occurrence matrix to learn word vectors capturing both local and global corpus statistics.

Related: Word Embedding, Word2Vec

Word Embedding Methods Comparison

Method	Training Signal	Strengths	Weaknesses
Word2Vec (SG)	Local context window	Fast training, good analogies	No global statistics
Word2Vec (CBOW)	Predict center from context	Faster than skip-gram	Weaker on rare words
GloVe	Global co-occurrence matrix	Captures global statistics	Large memory for co-occurrence matrix
FastText	Character n-grams	Handles OOV words	Larger model size
Contextual (BERT)	Full sentence context	Polysemy handling	Much slower, GPU required

Key Takeaway

Word embeddings transform discrete tokens into continuous vectors where semantic similarity maps to geometric proximity. Word2Vec and GloVe produce static embeddings (one vector per word), while contextual models (BERT, GPT) produce dynamic embeddings that depend on the surrounding sentence.

Word Embeddings

Why Dense Embeddings?

Definition: Word Embedding

Definition: Word2Vec: Skip-gram Model

Definition: GloVe: Global Vectors

Theorem: Linear Analogy Property of Embeddings

Empirical Validation

Theorem: Embedding Dimension and Information Capacity

Example: Training Word2Vec from Scratch

Implementation

Word Embedding Visualization

Parameters

Embedding Analogy Explorer

Parameters

Word2Vec Training Dynamics

Parameters

Text-to-Embedding Pipeline

Quick Check

Common Mistake: Freezing Pre-Trained Embeddings

Why This Matters: Embeddings in Semantic Communication

Historical Note: The Word2Vec Revolution

Word Embedding

Word2Vec

GloVe

Word Embedding Methods Comparison

Key Takeaway

Definition:
Word Embedding

Definition:
Word2Vec: Skip-gram Model

Definition:
GloVe: Global Vectors