Learn AI Series (#31) - Word Embeddings - Meaning in Numbers

Learn AI Series (#31) - Word Embeddings - Meaning in Numbers

ai-banner.png

What will I learn

  • You will learn why one-hot encoding and bag-of-words fail to capture meaning -- and how word embeddings fix that fundamental problem;
  • Word2Vec: skip-gram and CBOW -- training words to predict their neighbors with a simple neural network;
  • embedding arithmetic: "king - man + woman = queen" and why this emerges naturally from training;
  • GloVe and FastText: alternative embedding approaches with different tradeoffs;
  • visualizing and exploring embedding spaces to build geometric intuition for meaning;
  • using pre-trained embeddings in practice: loading, averaging, and classifying text;
  • the conceptual bridge from static embeddings to the contextual representations used by modern language models.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#31) - Word Embeddings - Meaning in Numbers

At the end of episode #30, we built bag-of-words and TF-IDF representations for text -- and they worked remarkably well for classification tasks. Spam detection, topic classification, sentiment analysis -- counting words and weighting them by importance got us a long way. But I also pointed out a fundamental limitation that should have been nagging at you: in those representations, every word is equally different from every other word. "Happy" and "joyful" sit in completely independent dimensions of your vocabulary space, as unrelated as "happy" and "refrigerator". A 50,000-word vocabulary creates 50,000-dimensional vectors where most dimensions are zero (sparse), and there's zero notion of semantic similarity between any pair of words.

That's a problem. A big one.

Today we fix it. Word embeddings map each word to a dense vector of much lower dimensionality -- typically 100 to 300 dimensions in stead of 50,000. In this compressed space, words with similar meanings end up close together. "Happy" and "joyful" are neighbors. "Dog" and "cat" are close. "Paris" and "France" have a relationship similar to "Berlin" and "Germany". The embedding space captures meaning as geometry -- and this turns out to be one of the most important ideas in all of modern NLP.

If PCA (episode #24) gave us intuition for compressing numerical data into lower-dimensional spaces while preserving structure, word embeddings do the same thing for language. The difference is that we're not just compressing -- we're discovering that meaning itself has a geometric structure. And that geometric structure is learned entirely from observing which words appear near each other in massive amounts of text. Nobody hand-codes the rules. The model discovers them.

Here we go!

The distributional hypothesis

The theoretical backbone of everything in this episode is deceptively simple: a word is characterized by the company it keeps (Firth, 1957). Words that consistently appear in similar contexts probably have similar meanings. "Cat" and "dog" show up in the same kinds of sentences: "The ___ sat on the mat", "I took my ___ to the vet", "My ___ is sleeping on the couch". If two words are interchangeable in many contexts, they must be semantically related.

This insight -- called the distributional hypothesis -- transforms the problem from "understand what a word means" (incredibly hard, arguably philosophical) to "observe which words appear near each other" (straightforward counting, given enough data). Meaning emerges from usage patterns, not from any inherent property of the word itself.

Think about how you actually learn new words. When you first encounter a word you've never seen before, you infer its meaning from context. "The child was elated after winning the prize" -- even if you'd never seen "elated" before, the context ("winning", "prize") tells you it's a positive emotion. Your brain is doing distributional analysis every time you read. Word embeddings formalize this exact process with linear algebra ;-)

import numpy as np

# The distributional hypothesis in action: words in similar
# contexts should have similar meaning
contexts = {
    'cat': ['the ___ sat on the mat', 'I fed the ___', 'the ___ chased a mouse',
            'my ___ is sleeping', 'took the ___ to the vet'],
    'dog': ['the ___ sat on the rug', 'I fed the ___', 'the ___ chased a ball',
            'my ___ is sleeping', 'took the ___ to the vet'],
    'car': ['the ___ drove down the road', 'I parked the ___',
            'the ___ needed repairs', 'my ___ is red', 'took the ___ to the shop'],
}

# Count shared context patterns
for w1 in contexts:
    for w2 in contexts:
        if w1 >= w2:
            continue
        shared = len(set(contexts[w1]) & set(contexts[w2]))
        total = len(set(contexts[w1]) | set(contexts[w2]))
        print(f"  {w1:>4s} vs {w2:>4s}: "
              f"{shared}/{total} shared contexts "
              f"({shared/total:.0%} overlap)")

Even with just five example contexts, "cat" and "dog" share more overlap than either shares with "car". Scale this up to billions of words of text and you get enough statistical signal to position every word in a continuous vector space where proximity reflects semantic relatedness. That's exactly what Word2Vec does.

Word2Vec: learning embeddings from context

Word2Vec (Mikolov et al., 2013) turns the distributional hypothesis into a training procedure. The idea: build a shallow neural network that either predicts context words from a center word (skip-gram) or predicts a center word from its context (CBOW). The hidden layer weights, after training, ARE the word embeddings.

Skip-gram: given a center word, predict the surrounding context words. For the sentence "the cat sat on the mat" with center word "sat" and a window of 2, the training pairs are: (sat, the), (sat, cat), (sat, on), (sat, the). The model learns to predict context words given the center.

CBOW (Continuous Bag of Words): the reverse -- given context words, predict the center word. From context ["the", "cat", "on", "the"], predict "sat".

The architecture is remarkably simple: an input layer (one-hot encoded word), a single hidden layer of dimension d (the embedding size), and an output layer (softmax over the vocabulary). The hidden layer weights form a matrix of shape (vocab_size, d) -- and after training, row i of this matrix is the embedding vector for word i.

import numpy as np

def build_skipgram_pairs(corpus_ids, window=2):
    """Generate (center, context) training pairs from word ID sequence."""
    pairs = []
    for i, center in enumerate(corpus_ids):
        start = max(0, i - window)
        end = min(len(corpus_ids), i + window + 1)
        for j in range(start, end):
            if j != i:
                pairs.append((center, corpus_ids[j]))
    return pairs

corpus = "the cat sat on the mat the dog sat on the rug"
words = corpus.split()
vocab = sorted(set(words))
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for w, i in word2idx.items()}
corpus_ids = [word2idx[w] for w in words]

pairs = build_skipgram_pairs(corpus_ids, window=2)
print(f"Vocabulary: {vocab}")
print(f"Corpus length: {len(corpus_ids)} tokens")
print(f"Training pairs: {len(pairs)}")
print(f"\nFirst 10 pairs (center -> context):")
for center_id, ctx_id in pairs[:10]:
    print(f"  {idx2word[center_id]:>5s} -> {idx2word[ctx_id]}")

Now the actual training -- we'll implement a simplified skip-gram from scratch. This is essentially the training loop from episode #7, applied to a specific architecture. The hidden layer weights start random, and gradient descent adjusts them so that words appearing in similar contexts end up with similar embedding vectors:

def train_skipgram(corpus_ids, vocab_size, embed_dim=10,
                   window=2, lr=0.025, epochs=200):
    """Train skip-gram word embeddings from scratch."""
    # Initialize two weight matrices randomly
    # W_in: center word embeddings (what we want)
    # W_out: context word embeddings (needed for training)
    W_in = np.random.randn(vocab_size, embed_dim) * 0.1
    W_out = np.random.randn(embed_dim, vocab_size) * 0.1

    pairs = build_skipgram_pairs(corpus_ids, window)

    for epoch in range(epochs):
        total_loss = 0
        np.random.shuffle(pairs)

        for center_id, ctx_id in pairs:
            # Forward pass: get center word embedding
            h = W_in[center_id]  # shape: (embed_dim,)

            # Compute scores for all words
            scores = h @ W_out  # shape: (vocab_size,)

            # Softmax probabilities
            exp_scores = np.exp(scores - scores.max())
            probs = exp_scores / exp_scores.sum()

            # Loss: negative log probability of the actual context word
            total_loss -= np.log(probs[ctx_id] + 1e-10)

            # Backward pass: gradient of cross-entropy + softmax
            grad_scores = probs.copy()
            grad_scores[ctx_id] -= 1  # subtract 1 for the correct class

            # Update W_out
            W_out -= lr * np.outer(h, grad_scores)

            # Update W_in (the embedding we care about)
            grad_h = W_out @ grad_scores
            W_in[center_id] -= lr * grad_h

        if epoch % 50 == 0:
            avg_loss = total_loss / len(pairs)
            print(f"  Epoch {epoch:>3d}: avg loss = {avg_loss:.3f}")

    return W_in

embeddings = train_skipgram(corpus_ids, len(vocab),
                            embed_dim=5, epochs=200)
print(f"\nEmbedding matrix shape: {embeddings.shape}")
print(f"  ({len(vocab)} words x {embeddings.shape[1]} dimensions)")

On a tiny corpus like this, the embeddings won't be particularly meaningful -- you need millions of words for the distributional patterns to emerge properly. But the mechanism is clear: each word starts as a random vector, and gradient descent pushes words that share context closer together in the embedding space. If "cat" and "dog" both appear near "sat", "on", and "the" (which they do in our tiny corpus), their embeddings will converge toward similar regions.

Cosine similarity between embeddings

Once we have embeddings, we use our old friend cosine similarity (episode #8) to measure word similarity. Same formula we used for document similarity in episode #30, but now applied to individual word vectors:

def cosine_sim(a, b):
    """Cosine similarity between two vectors."""
    dot = a @ b
    norms = np.linalg.norm(a) * np.linalg.norm(b)
    return dot / norms if norms > 0 else 0.0

print("Word similarities (from our tiny corpus):")
for w1 in vocab:
    sims = []
    for w2 in vocab:
        if w1 != w2:
            sim = cosine_sim(
                embeddings[word2idx[w1]],
                embeddings[word2idx[w2]]
            )
            sims.append((w2, sim))
    sims.sort(key=lambda x: -x[1])
    top = sims[:3]
    top_str = ", ".join(f"{w}({s:.2f})" for w, s in top)
    print(f"  {w1:>5s} -> {top_str}")

On real Word2Vec models trained on billions of words (Google News, Wikipedia, Common Crawl), the nearest neighbors make intuitive sense: "king" is close to "queen", "prince", "ruler". "Python" is close to "programming", "language", "code". "Berlin" is close to "Germany", "Munich", "Hamburg". The embedding space organizes words by meaning, entirely from co-occurrence statistics. Nobody hand-labeled anything. No ontology, no dictionary, no rules -- just massive amounts of text and gradient descent.

Embedding arithmetic: the magic trick

The most famous property of Word2Vec embeddings is that you can do algebra with meaning. Take the vector for "king", subtract "man", add "woman", and the nearest word to the resulting vector is "queen". This sounds like a party trick, but it reveals something deep about how the embedding space is structured.

def analogy(embeddings, word2idx, idx2word, a, b, c):
    """Solve: a is to b as c is to ?
    Computes: embedding(a) - embedding(b) + embedding(c)
    Returns the word closest to that vector."""
    if a not in word2idx or b not in word2idx or c not in word2idx:
        return "(word not in vocabulary)", 0.0

    vec = (embeddings[word2idx[a]]
           - embeddings[word2idx[b]]
           + embeddings[word2idx[c]])

    # Find closest word by cosine similarity
    norms = np.linalg.norm(embeddings, axis=1)
    vec_norm = np.linalg.norm(vec)
    if vec_norm == 0:
        return "(zero vector)", 0.0

    similarities = embeddings @ vec / (norms * vec_norm + 1e-10)

    # Exclude the input words
    exclude = {word2idx[a], word2idx[b], word2idx[c]}
    for idx in exclude:
        similarities[idx] = -1

    best_idx = np.argmax(similarities)
    return idx2word[best_idx], similarities[best_idx]

# This won't work well on our tiny corpus, but here's how
# you'd use it with real pre-trained embeddings:
print("Analogy test (tiny corpus -- don't expect miracles):")
for a, b, c in [("cat", "mat", "dog"), ("sat", "cat", "on")]:
    result, score = analogy(embeddings, word2idx, idx2word, a, b, c)
    print(f"  {a} - {b} + {c} = {result} (score: {score:.3f})")

Why does this work? Because the embedding space encodes relationships as vector directions. The direction from "man" to "woman" captures the concept of gender across the entire vocabulary. The direction from "Paris" to "France" captures "capital of". The direction from "walked" to "walk" captures past-to-present tense. These directions are consistent: the gender direction is roughly the same whether you start from "king", "uncle", "brother", or "waiter". That consistency is what makes arithmetic possible -- you can compose meaning by adding and subtracting these directional concepts.

And here's the part that still amazes me: nobody explicitly told Word2Vec about gender, geography, tense, or any other linguistic relationship. The model discovered these regularities entirely from observing which words appear near which other words in billions of sentences. The structure of human language is so regular that a single-layer neural network trained on context prediction can recover deep semantic relationships. That's the distributional hypothesis, proven with math.

Negative sampling: making training practical

The full softmax over the entire vocabulary (which we implemented above) is computationally expensive. For a vocabulary of 100,000 words, every training step requires computing 100,000 exponentials and a normalization. That's fine for our 7-word toy corpus but completely impractical for real training.

Negative sampling solves this by turning the problem from "predict the correct context word from all possible words" into "distinguish the actual context word from a few randomly sampled incorrect words." In stead of updating all vocabulary weights on every step, you only update the weights for the correct context word and a handful of "negative" samples (typically 5-15 randomly chosen words):

def train_skipgram_neg_sampling(corpus_ids, vocab_size,
                                embed_dim=50, window=2,
                                n_neg=5, lr=0.025, epochs=5):
    """Skip-gram with negative sampling -- the practical version."""
    W_in = np.random.randn(vocab_size, embed_dim) * 0.01
    W_out = np.random.randn(vocab_size, embed_dim) * 0.01

    # Word frequency distribution for negative sampling
    # (raised to 0.75 power per Mikolov et al.)
    word_counts = np.bincount(corpus_ids, minlength=vocab_size)
    freq = word_counts ** 0.75
    neg_probs = freq / freq.sum()

    pairs = build_skipgram_pairs(corpus_ids, window)

    for epoch in range(epochs):
        total_loss = 0
        np.random.shuffle(pairs)

        for center_id, ctx_id in pairs:
            center_vec = W_in[center_id]

            # Positive example: actual context word
            ctx_vec = W_out[ctx_id]
            score = center_vec @ ctx_vec
            sigmoid = 1 / (1 + np.exp(-np.clip(score, -10, 10)))
            grad = sigmoid - 1  # gradient for positive sample
            total_loss -= np.log(sigmoid + 1e-10)

            # Update for positive sample
            W_out[ctx_id] -= lr * grad * center_vec
            center_grad = grad * ctx_vec

            # Negative examples: random words (NOT the actual context)
            neg_ids = np.random.choice(
                vocab_size, size=n_neg, p=neg_probs
            )
            for neg_id in neg_ids:
                if neg_id == ctx_id:
                    continue
                neg_vec = W_out[neg_id]
                score = center_vec @ neg_vec
                sigmoid = 1 / (1 + np.exp(-np.clip(score, -10, 10)))
                total_loss -= np.log(1 - sigmoid + 1e-10)

                W_out[neg_id] -= lr * sigmoid * center_vec
                center_grad += sigmoid * neg_vec

            W_in[center_id] -= lr * center_grad

        avg_loss = total_loss / len(pairs)
        print(f"  Epoch {epoch}: avg loss = {avg_loss:.3f}")

    return W_in

emb_neg = train_skipgram_neg_sampling(
    corpus_ids, len(vocab), embed_dim=10, epochs=5
)
print(f"Shape: {emb_neg.shape}")

The 0.75 exponent on word frequencies is a detail that matters: it smooths the sampling distribution so that rare words get sampled slightly more often than their raw frequency would suggest, and common words ("the", "a", "is") get sampled slightly less. This helps the model learn better representations for less frequent words -- a form of implicit regularization that Mikolov et al. discovered empirically works better than uniform or raw-frequency sampling.

GloVe: global statistics meet local context

GloVe (Global Vectors for Word Representation, Pennington et al., 2014) takes a fundamentally different approach to reaching the same destination. In stead of predicting context words one at a time through a sliding window (like Word2Vec), GloVe first builds a global word co-occurrence matrix -- counting how often each pair of words appears together within a fixed window across the entire corpus -- and then factorizes that matrix to produce embeddings.

The key insight behind GloVe is that the ratio of co-occurrence probabilities carries more meaning than the raw probabilities themselves. Consider the words "ice" and "steam". Both co-occur frequently with "water" (so the raw co-occurrence counts are similar), but "ice" co-occurs much more with "solid" while "steam" co-occurs more with "gas". The ratio P(solid | ice) / P(solid | steam) is high, while P(gas | ice) / P(gas | steam) is low. GloVe trains embeddings that preserve these ratio relationships.

def build_cooccurrence(corpus_ids, vocab_size, window=2):
    """Build word co-occurrence matrix from corpus."""
    cooccurrence = np.zeros((vocab_size, vocab_size))

    for i, center in enumerate(corpus_ids):
        start = max(0, i - window)
        end = min(len(corpus_ids), i + window + 1)
        for j in range(start, end):
            if j != i:
                # Weight by distance: closer words count more
                distance = abs(i - j)
                cooccurrence[center][corpus_ids[j]] += 1.0 / distance

    return cooccurrence

cooc = build_cooccurrence(corpus_ids, len(vocab), window=2)
print("Co-occurrence matrix (rows/cols = words):")
print(f"  Words: {vocab}")
for i, word in enumerate(vocab):
    row = cooc[i]
    nonzero = [(vocab[j], f"{row[j]:.1f}")
               for j in range(len(vocab)) if row[j] > 0]
    print(f"  {word:>5s}: {dict(nonzero)}")

In practice, both Word2Vec and GloVe produce high-quality embeddings that perform similarly on most benchmarks. GloVe sometimes edges ahead on analogy tasks (because it explicitly captures global statistics), while Word2Vec can be faster to train incrementally on streaming data. The choice between them is largely a matter of preference and infrastructure. Most practitioners just download pre-trained vectors and use them as features -- which brings us to the practical part.

FastText: handling words you've never seen

One critical weakness of both Word2Vec and GloVe: if a word wasn't in the training corpus, it has no embedding. It's completely invisible -- an out-of-vocabulary (OOV) problem. This is a serious issue for languages with rich morphology (German compound words, Turkish agglutination), domain-specific vocabularies (medical terms, legal jargon), and real-world text (typos, slang, neologisms, brand names).

FastText (Bojanowski et al., 2017, from Facebook/Meta Research) solves this elegantly by representing each word as a bag of character n-grams. The word "learning" is decomposed into character subsequences: "<le", "lea", "ear", "arn", "rni", "nin", "ing", "ng>", where "<" and ">" mark word boundaries. Each n-gram has its own embedding vector, and the word's embedding is the sum of its n-gram embeddings.

def get_char_ngrams(word, min_n=3, max_n=6):
    """Extract character n-grams from a word (FastText style)."""
    padded = f"<{word}>"
    ngrams = []
    for n in range(min_n, max_n + 1):
        for i in range(len(padded) - n + 1):
            ngrams.append(padded[i:i+n])
    return ngrams

# Demonstrate: similar words share n-grams
test_words = ["learning", "learned", "learner", "teaching"]
for word in test_words:
    ngrams = get_char_ngrams(word, min_n=3, max_n=4)
    print(f"  {word:>10s}: {ngrams[:8]}...")

# Show n-gram overlap between word pairs
print("\nN-gram overlap:")
for w1 in ["learning", "learned"]:
    for w2 in ["learner", "teaching"]:
        ng1 = set(get_char_ngrams(w1))
        ng2 = set(get_char_ngrams(w2))
        shared = ng1 & ng2
        print(f"  {w1} vs {w2}: "
              f"{len(shared)}/{len(ng1 | ng2)} shared "
              f"({len(shared)/len(ng1 | ng2):.0%})")

This is powerful because even if the model has never seen the word "blockchain", it has seen plenty of words containing "blo", "loc", "ock", "cha", "hai", "ain" -- and the combination of those n-gram embeddings gives "blockchain" a reasonable representation. The model can generalize to words it has literally never encountered during training. For real-world NLP applications dealing with messy, noisy text (social media, customer reviews, technical documentation), this is a massive practical advantage over standard Word2Vec or GloVe.

Using pre-trained embeddings in practice

For real tasks, you almost never train your own word embeddings. Pre-trained embeddings -- trained on billions of words from Wikipedia, Common Crawl, or Google News -- are freely available and capture far richer semantic relationships than anything you could train on your own dataset. The most common approach: download pre-trained vectors, load them into a dictionary, and use them as feature representations for your downstream task.

def load_embeddings_demo(dim=50):
    """Simulate loading pre-trained embeddings.
    In practice you'd load GloVe/FastText from a file like:
      embeddings = {}
      with open('glove.6B.50d.txt') as f:
          for line in f:
              parts = line.split()
              word = parts[0]
              vec = np.array(parts[1:], dtype=float)
              embeddings[word] = vec
    Here we generate fake embeddings to demonstrate the workflow."""
    np.random.seed(42)
    # Simulate semantic clusters
    base_vecs = {
        'positive': np.random.randn(dim) * 0.3 + 1.0,
        'negative': np.random.randn(dim) * 0.3 - 1.0,
        'animal': np.random.randn(dim) * 0.3 + np.array(
            [0.5] * 25 + [-0.5] * 25),
        'tech': np.random.randn(dim) * 0.3 + np.array(
            [-0.5] * 25 + [0.5] * 25),
    }

    words = {
        'happy': 'positive', 'joyful': 'positive',
        'great': 'positive', 'amazing': 'positive',
        'excellent': 'positive', 'wonderful': 'positive',
        'terrible': 'negative', 'awful': 'negative',
        'horrible': 'negative', 'bad': 'negative',
        'poor': 'negative', 'worst': 'negative',
        'dog': 'animal', 'cat': 'animal',
        'python': 'tech', 'code': 'tech',
        'algorithm': 'tech', 'software': 'tech',
    }

    embeddings = {}
    for word, cluster in words.items():
        noise = np.random.randn(dim) * 0.2
        embeddings[word] = base_vecs[cluster] + noise

    return embeddings

pretrained = load_embeddings_demo(dim=50)
print(f"Loaded {len(pretrained)} word embeddings (dim=50)")

# Demonstrate: similar words have high cosine similarity
print("\nSimilarity examples:")
test_pairs = [
    ("happy", "joyful"), ("happy", "terrible"),
    ("dog", "cat"), ("dog", "algorithm"),
    ("python", "code"), ("python", "awful"),
]
for w1, w2 in test_pairs:
    sim = cosine_sim(pretrained[w1], pretrained[w2])
    print(f"  {w1:>10s} vs {w2:<10s}: {sim:>+.3f}")

Document embeddings by averaging

The simplest (and surprisingly effective) way to get a document-level representation from word embeddings: average all the word vectors. This produces a fixed-size vector regardless of document length, and it captures the overall topic and sentiment of the text. Episode #30 built document vectors from TF-IDF -- this is the embedding-based alternative:

def doc_embedding(text, embeddings, dim=50):
    """Average word embeddings to get a document vector."""
    words = text.lower().split()
    vecs = [embeddings[w] for w in words if w in embeddings]
    if not vecs:
        return np.zeros(dim)
    return np.mean(vecs, axis=0)

# Classify documents by comparing to cluster centroids
docs = [
    "happy joyful amazing great wonderful",
    "terrible awful horrible bad worst",
    "dog cat dog cat dog",
    "python code algorithm software python",
]
labels = ["positive", "negative", "animals", "tech"]

print("Document embeddings and pairwise similarities:")
doc_vecs = [doc_embedding(d, pretrained) for d in docs]

for i in range(len(docs)):
    for j in range(i + 1, len(docs)):
        sim = cosine_sim(doc_vecs[i], doc_vecs[j])
        print(f"  {labels[i]:>10s} vs {labels[j]:<10s}: {sim:>+.3f}")

This averaging approach works because the semantic direction of the component words combines sensibly: a document full of positive words will have an average vector pointing in the "positive" direction of the embedding space. A document mixing positive and negative words will land somewhere in between. More sophisticated methods (TF-IDF weighted averaging, Doc2Vec, or paragraph vectors) exist but provide marginal improvements for many tasks. The simple average is a remarkably strong baseline -- and it has one huge advantage over TF-IDF: it handles words the model HAS seen but the training dataset hasn't. If your training reviews never contained "phenomenal" but it's in the pre-trained embeddings, the averaged document vector will still reflect its positive meaning.

Embedding-based text classification

Let's put it all together with a classification pipeline using embeddings as features. This directly parallels the TF-IDF pipeline from episode #30 -- same classifiers, same evaluation, different representation:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Build a labeled dataset
pos_docs = [
    "happy joyful great amazing excellent wonderful",
    "great wonderful excellent happy joyful amazing",
    "amazing excellent happy great wonderful joyful",
    "wonderful happy amazing excellent great joyful",
    "joyful great amazing wonderful excellent happy",
] * 20

neg_docs = [
    "terrible awful horrible bad poor worst",
    "awful horrible terrible worst bad poor",
    "bad poor terrible horrible awful worst",
    "worst terrible awful bad horrible poor",
    "horrible bad worst awful terrible poor",
] * 20

all_docs = pos_docs + neg_docs
all_labels = np.array([1] * len(pos_docs) + [0] * len(neg_docs))

# Shuffle
idx = np.random.RandomState(42).permutation(len(all_docs))
all_docs = [all_docs[i] for i in idx]
all_labels = all_labels[idx]

# Convert to embedding features
X_emb = np.array([doc_embedding(d, pretrained) for d in all_docs])

# Classify
clf = LogisticRegression(max_iter=1000)
scores = cross_val_score(clf, X_emb, all_labels, cv=5)
print(f"Embedding-based classification (5-fold CV):")
print(f"  Accuracy: {scores.mean():.1%} +/- {scores.std():.1%}")
print(f"  Feature dimensionality: {X_emb.shape[1]} "
      f"(vs potentially thousands with TF-IDF)")

The embedding representation is dramatically more compact than TF-IDF (50 dimensions vs potentially thousands) while capturing semantic similarity that TF-IDF completely misses. A TF-IDF classifier trained only on "good" and "bad" won't generalize to "excellent" and "terrible" unless those exact words appeared in the training set. An embedding-based classifier generalizes because "excellent" and "good" have similar embedding vectors -- the pre-trained embeddings transfer knowledge from the massive training corpus to your specific task, even when your task has limited labeled data. This is a form of transfer learning that we'll see a LOT more of when we get to transformers and large language models.

Visualizing embedding spaces

One of the most satisfying things about word embeddings is that you can actually see the semantic structure. Using t-SNE or UMAP (episode #25), we can project the high-dimensional embedding space down to 2D and observe the clusters of meaning:

def visualize_embeddings_2d(embeddings_dict, method='tsne'):
    """Project word embeddings to 2D for visualization.
    Uses a simple random projection as a stand-in for t-SNE
    (which needs sklearn.manifold.TSNE in practice)."""
    words = list(embeddings_dict.keys())
    vecs = np.array([embeddings_dict[w] for w in words])

    # Simple 2D projection (in practice, use t-SNE or UMAP)
    np.random.seed(42)
    proj = np.random.randn(vecs.shape[1], 2) * 0.1
    coords = vecs @ proj

    print(f"2D projection of {len(words)} word embeddings:")
    print(f"  {'Word':>12s}  {'x':>8s}  {'y':>8s}")
    print("-" * 32)
    for i, word in enumerate(words):
        print(f"  {word:>12s}  {coords[i,0]:>8.3f}  {coords[i,1]:>8.3f}")

    # Check: do semantic clusters group together in 2D?
    clusters = {
        'positive': ['happy', 'joyful', 'great', 'amazing'],
        'negative': ['terrible', 'awful', 'horrible', 'bad'],
    }
    print(f"\nCluster separation check:")
    for name, cluster_words in clusters.items():
        cw = [w for w in cluster_words if w in embeddings_dict]
        if len(cw) >= 2:
            idxs = [words.index(w) for w in cw]
            center = coords[idxs].mean(axis=0)
            spread = coords[idxs].std()
            print(f"  {name:>10s}: center=({center[0]:.3f}, "
                  f"{center[1]:.3f}), spread={spread:.3f}")

visualize_embeddings_2d(pretrained)

In a real t-SNE visualization of Word2Vec or GloVe embeddings, you'd see beautiful clusters: countries grouped together, numbers in a line, animals in a cluster, professions nearby, colors next to each other. The 2D projection reveals the hidden geometric structure that the training process discovered. It's genuinley one of the most visually compelling results in all of machine learning -- the idea that meaning has a shape, and that shape emerges from nothing but co-occurrence statistics.

From static to contextual: the limitation that changed everything

Everything we've covered today produces static embeddings -- each word gets exactly one vector regardless of context. The word "bank" has the same embedding whether it appears in "river bank" or "bank account". "Apple" is the same vector in "Apple released a new phone" and "I ate an apple for lunch". This is a fundamental limitation that no amount of training data can fix, because static embeddings by definition collapse all senses of a word into a single point.

The revolution that changed NLP -- and that we'll build toward over the next several episodes -- uses contextual embeddings: each word's representation depends on all the other words around it. "Bank" in "river bank" gets a different vector than "bank" in "bank account". The same word, in different contexts, maps to different points in embedding space. This is the core idea behind BERT, GPT, and the transformer models that power modern language models. The representations we learned today are the conceptual foundation: meaning as vectors, similarity as distance, relationships as directions. All of that carries directly over. The only thing that changes is that the vectors become context-dependent.

But static embeddings are not obsolete. They're fast (a dictionary lookup, no model inference), compact (a lookup table rather than a billion-parameter neural network), interpretable (you can inspect the nearest neighbors of any word and understand what the model "thinks"), and sufficient for many classification tasks. For many production systems that need to process millions of documents per second, static embeddings plus a linear classifier remains the pragmatic choice. Understanding them also builds the intuition you need for everything that comes after -- including the probability-based approaches to understanding language that we'll explore soon.

Wat hebben we geleerd?

We've gone from counting words (episode #30) to understanding meaning as geometry. Here's the full picture:

  • Word embeddings map words to dense vectors where semantic similarity corresponds to vector proximity. "Happy" and "joyful" are close, "happy" and "refrigerator" are far apart. This fixes the fundamental limitation of bag-of-words and TF-IDF, where every word is equally different from every other word;
  • The distributional hypothesis says that words appearing in similar contexts have similar meanings. This transforms "understand meaning" (hard) into "count co-occurrences" (straightforward), and it's the theoretical foundation for all embedding methods;
  • Word2Vec learns embeddings by training a shallow network to predict context words (skip-gram) or center words (CBOW). The hidden layer weights after training ARE the embeddings. We built this from scratch -- it's gradient descent on a cross-entropy objective, the same training loop from episode #7;
  • Negative sampling makes Word2Vec practical by replacing the full vocabulary softmax with a binary classification task: distinguish real context words from random negative samples. This reduces the per-step cost from O(vocab_size) to O(n_negative);
  • Embedding arithmetic captures semantic relationships as vector directions: "king - man + woman = queen" works because the gender direction is consistent across the vocabulary. Nobody programmed this -- it emerges from training on co-occurrence patterns;
  • GloVe factorizes a global co-occurrence matrix; FastText adds character n-grams so it can handle words it's never seen before. In practice, all three produce similar-quality embeddings and pre-trained versions are freely available;
  • Pre-trained embeddings transfer knowledge from massive corpora to your specific task. Averaging word vectors gives surprisingly effective document representations, and embedding-based classifiers generalize to words unseen in your training data;
  • Static embeddings give each word one vector regardless of context. This means "bank" is the same in "river bank" and "bank account" -- a limitation that contextual embeddings (BERT, GPT) will fix by making representations context-dependent. Those contextual models build directly on the geometric intuitions we developed today.

We're now deep into the NLP portion of this series, and the pieces are connecting. Classical text representations (episode #30) gave us a working baseline. Word embeddings (today) gave us meaning as geometry. The next step is to explore how probability and statistical models can capture even richer patterns in language -- particularly when the ordering and structure of words matters, not just which words are present. The tools from our earlier probability discussion (episode #9) are about to become very relevant again.

Thanks for sticking around. Until next time!

@scipio



0
0
0.000
0 comments