Learn AI Series (#30) - Natural Language Processing - Text as Data

@scipio 68

about 13 hours ago

StemSocial

Learn AI Series (#30) - Natural Language Processing - Text as Data

What will I learn

You will learn the fundamental challenge of NLP: converting human language into numbers machines can process;
tokenization -- breaking text into units (words, characters, subwords) and why the choice matters;
bag of words -- the simplest text-to-numbers conversion and why it works despite throwing away word order;
TF-IDF -- weighting words by importance so that "the" doesn't dominate your feature vectors;
text preprocessing: lowercasing, stop words, stemming vs lemmatization;
n-grams -- capturing short sequences of words for better context;
building a complete text classification pipeline with scikit-learn;
the limitations of classical NLP and what comes next.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#30) - Natural Language Processing - Text as Data

At the end of episode #29, I mentioned that all the data we've been working with throughout this entire series -- numerical measurements, category labels, pixel values, timestamps, user ratings -- shared one common property: it was structured and quantitative. Everything could be expressed as numbers in a matrix, or at least converted to numbers through straightforward encoding (one-hot vectors in episode #3, feature engineering in episode #15, lag features in episode #28). The conversion from "raw data" to "numbers a model can eat" was usually mechanical and deterministic.

Today that changes completely.

Natural Language Processing (NLP) deals with arguably the hardest data type in all of machine learning: human language. Free-form, ambiguous, context-dependent, culturally loaded, and deceptively complex. A sentence isn't just a sequence of characters -- it carries meaning, intent, tone, sarcasm, implicit knowledge, and cultural references that humans decode effortlessly but machines find profoundly difficult. "The bank was steep" and "The bank was closed" use the exact same word with completely different meanings. "Not bad at all" is a compliment despite containing a negative word. "Yeah right" can be enthusiastic agreement or dripping sarcasm depending on context that no feature vector can trivially capture.

And yet -- NLP works. It works remarkably well, in fact. The classical techniques we cover today powered spam filters, search engines, sentiment analysis systems, and document classifiers for decades, and many of them are STILL used in production alongside (or sometimes in stead of) modern deep learning approaches. Understanding these foundations isn't just historical education -- it builds the intuition you'll need for everything that follows in the NLP portion of this series.

Here we go!

The representation problem

The core challenge of NLP is the same one we've been solving since episode #3: how do you turn data into numbers? For images, the answer was straightforward -- pixels are already numbers. For tabular data, we had columns of measurements and categories. For time series (episodes #28-29), we had ordered sequences of values.

But text? "The cat sat on the mat" and "A feline rested upon the rug" mean essentially the same thing but share zero words beyond articles. "The bank by the river" and "The bank approved my loan" share the word "bank" but describe completely different things. "I saw her duck" is genuinely ambiguous -- did I see her pet duck, or did I see her dodge?

Every NLP system must solve this representation problem: convert variable-length, ambiguous text into fixed-size numerical vectors that preserve enough meaning for the downstream task. Different representations throw away different things. The classical methods we cover today throw away quite a lot (most notably word order) and still work brilliantly for many tasks. That should tell you something about how much redundancy human language contains ;-)

import numpy as np
from collections import Counter

# Two sentences that mean the same thing
sent1 = "The cat sat on the mat"
sent2 = "A feline rested upon the rug"

# As raw character sequences, they share very little
chars1 = set(sent1.lower())
chars2 = set(sent2.lower())
overlap = chars1 & chars2
print(f"Sentence 1: '{sent1}'")
print(f"Sentence 2: '{sent2}'")
print(f"Shared characters: {sorted(overlap)}")
print(f"Character overlap: {len(overlap)}/{len(chars1 | chars2)}")
print(f"Shared words: {set(sent1.lower().split()) & set(sent2.lower().split())}")
print("  --> Same meaning, almost nothing in common at the surface level")

This is fundamentally different from the numerical data we've worked with before. Two temperature readings of 21.5 and 21.7 are clearly similar. Two sentences can be completely similar in meaning while being completely different in their surface form. The representation needs to capture semantic similarity, not just string similarity. That's the whole game.

Tokenization: deciding what the units are

Before we can count, weight, or embed anything, we need to decide what the basic units of text are. This process is called tokenization -- splitting a string into a sequence of tokens:

text = "Machine learning is amazing! Machine learning transforms industries."

# Word-level tokenization (simplest approach)
word_tokens = text.lower().replace("!", "").replace(".", "").split()
print(f"Word tokens: {word_tokens}")
print(f"Unique words: {len(set(word_tokens))}")
print(f"Vocab: {sorted(set(word_tokens))}")

print()

# Character-level tokenization
char_tokens = list(text.lower())
print(f"Char tokens (first 30): {char_tokens[:30]}")
print(f"Unique chars: {len(set(char_tokens))}")

print()

# Simple sentence tokenization
sentences = [s.strip() for s in text.split(".") if s.strip()]
print(f"Sentences: {sentences}")

There are three main approaches, and they involve a classic tradeoff:

Word-level tokenization is the most natural: split on whitespace and punctuation. The vocabulary is every unique word in the corpus. Simple and interpretable, but the vocabulary grows large fast (English has ~170,000 words in common use, and every misspelling or rare technical term creates a new vocabulary entry). Words the model hasn't seen during training become out-of-vocabulary (OOV) tokens -- effectively invisible.

Character-level tokenization uses individual characters as tokens. Tiny vocabulary (26 letters + digits + punctuation, roughly 100 entries), no OOV problem at all, but loses all word-level meaning. The model has to learn from scratch that "c-a-t" represents a furry animal. That's asking a lot, especially for classical methods.

Subword tokenization (BPE, WordPiece, SentencePiece) is the modern compromise used by GPT, BERT, and essentially all large language models. Common words stay as single tokens ("the", "learning"), while rare words get split into pieces ("un##expect##ed", "bio##chem##istry"). This balances vocabulary size against representation power. We'll dig into subword tokenization properly when we get to transformers later in the series.

For today's classical methods, we'll stick with word-level tokenization. It's been the standard for decades and works well with the counting-based representations we're about to build.

Bag of words: ignoring order entirely

The simplest way to turn text into numbers: count how many times each word appears. Each document becomes a vector where each dimension corresponds to a unique word in the vocabulary, and the value is the count of that word in this document:

def bag_of_words(documents):
    """Convert documents to bag-of-words vectors."""
    # Build vocabulary from all documents
    vocab = sorted(set(
        w for doc in documents for w in doc.lower().split()
    ))
    word_to_idx = {w: i for i, w in enumerate(vocab)}

    # Count word occurrences per document
    X = np.zeros((len(documents), len(vocab)))
    for i, doc in enumerate(documents):
        for w in doc.lower().split():
            X[i, word_to_idx[w]] += 1

    return X, vocab

docs = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat chased the dog",
]

X, vocab = bag_of_words(docs)
print(f"Vocabulary ({len(vocab)} words): {vocab}")
print(f"Vectors shape: {X.shape}")
print(f"\nBag of Words matrix:")
for i, doc in enumerate(docs):
    print(f"  Doc {i}: {X[i].astype(int)} -> '{doc}'")

The method is called "bag" deliberately -- it's like dumping all the words into a bag, shaking them up, and counting what's there. Word order is completely gone. "Dog bites man" and "Man bites dog" produce identical bag-of-words vectors. That seems like a catastrophic loss of information, and in some ways it is. But for many classification tasks -- spam detection, topic classification, sentiment analysis -- the presence of certain words carries more signal than their order.

Think about it: if an email contains "free", "winner", "congratulations", "click", and "prize", it's probably spam regardless of how those words are arranged. The bag of words captures that pattern perfectly. A sentiment classifier mostly needs to know that "terrible", "waste", and "broken" are present, not their exact sequence. Order matters for understanding nuance, but for coarse classification tasks, word presence is often sufficient.

Comparing documents with bag of words

Once documents are vectors, we can use all our old vector tools. Cosine similarity from episode #8 measures how aligned two document vectors are:

def cosine_sim(a, b):
    """Cosine similarity between two vectors."""
    dot = a @ b
    norms = np.linalg.norm(a) * np.linalg.norm(b)
    return dot / norms if norms > 0 else 0.0

print("Document similarity (cosine):")
for i in range(len(docs)):
    for j in range(i + 1, len(docs)):
        sim = cosine_sim(X[i], X[j])
        print(f"  Doc {i} vs Doc {j}: {sim:.3f}")
        print(f"    '{docs[i]}'")
        print(f"    '{docs[j]}'")

Documents 0 and 1 ("cat sat on the mat" vs "dog sat on the log") should be fairly similar -- they share the same sentence structure and several common words. Document 2 ("cat chased the dog") shares some words with both but has a different verb. The cosine similarity captures these relationships purely from word overlap.

TF-IDF: words that actually matter

Raw bag-of-words has a fatal flaw: common words like "the", "is", "a", "and" dominate the vectors. They appear in every document, carry zero discriminative information, and yet they get the highest counts. This is where TF-IDF (Term Frequency-Inverse Document Frequency) comes in -- one of the most important ideas in information retrieval, and still used extensively in search engines and text classification today.

The idea is elegant: weight each word by how important it is to a specific document relative to the entire corpus.

TF (Term Frequency): how often the word appears in THIS document. More occurrences = more relevant to this particular document.
IDF (Inverse Document Frequency): log(total documents / documents containing this word). Words that appear in EVERY document (like "the") get a low IDF. Words that appear in only a few documents (like "cryptocurrency" in a general news corpus) get a high IDF.

The product TF x IDF gives high weight to words that are both frequent in a specific document AND rare across the corpus -- exactly the words that characterize and distinguish that document. Common words get suppressed automatically. It's a brilliantly simple form of feature weighting that connects directly to the feature engineering principles from episode #15.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Machine learning is transforming finance and healthcare",
    "Deep learning models require large datasets and GPU compute",
    "Natural language processing handles text and speech data",
    "Computer vision processes images using convolutional networks",
    "Reinforcement learning trains agents through trial and error",
    "Random forests and gradient boosting dominate tabular data",
    "Support vector machines draw boundaries in high dimensions",
]

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)

print(f"Shape: {X_tfidf.shape}")
print(f"  ({X_tfidf.shape[0]} documents, {X_tfidf.shape[1]} features)")
print(f"\nTop 3 TF-IDF features per document:")
feature_names = tfidf.get_feature_names_out()
for i, doc in enumerate(corpus):
    scores = X_tfidf[i].toarray().flatten()
    top_idx = np.argsort(scores)[::-1][:3]
    top = [(feature_names[j], round(scores[j], 3)) for j in top_idx]
    print(f"  Doc {i}: {top}")
    print(f"    -> '{doc[:50]}...'")

Look at what TF-IDF does. "And" appears in almost every document, so its IDF is near zero and it gets suppressed. But "convolutional" appears in only one document, so its IDF is very high and it dominates that document's vector. "Finance" and "healthcare" are distinctive to document 0, "GPU" and "compute" are distinctive to document 1. TF-IDF automatically identifies the words that make each document unique -- without any manual feature engineering or domain knowledge. That's why it's been the backbone of text classification and information retrieval for decades.

Text preprocessing: cleaning up messy language

Raw text is messy. People capitalize inconsistently, use different word forms ("running", "runs", "ran"), and pad their writing with filler words that carry no meaning. Preprocessing normalizes text to reduce noise and shrink the vocabulary:

from sklearn.feature_extraction.text import TfidfVectorizer

# Demonstrate different preprocessing options
text_samples = [
    "The RUNNING dogs are running quickly",
    "These dogs run faster than those dogs ran",
    "A dog's speed is amazing when it runs",
]

# No preprocessing
tfidf_raw = TfidfVectorizer(lowercase=False)
X_raw = tfidf_raw.fit_transform(text_samples)
print(f"No preprocessing:  {X_raw.shape[1]} features")
print(f"  Vocab sample: {list(tfidf_raw.vocabulary_.keys())[:10]}")

# With lowercasing (default)
tfidf_lower = TfidfVectorizer()
X_lower = tfidf_lower.fit_transform(text_samples)
print(f"Lowercased:        {X_lower.shape[1]} features")

# With stop word removal
tfidf_stop = TfidfVectorizer(stop_words='english')
X_stop = tfidf_stop.fit_transform(text_samples)
print(f"+ Stop words:      {X_stop.shape[1]} features")
print(f"  Removed: {sorted(set(tfidf_lower.vocabulary_) - set(tfidf_stop.vocabulary_))}")

Let me walk through the preprocessing steps:

Lowercasing: "Machine" and "machine" become the same token. Almost always a good idea -- you rarely want capitalization to create separate features. The exception is Named Entity Recognition where "Apple" (company) differs from "apple" (fruit), but that's a specialized task.

Stop word removal: Remove common words ("the", "is", "a", "and", "of", "to") that carry no discriminative information. Sklearn's TfidfVectorizer(stop_words='english') handles this with a built-in English stop word list. This reduces dimensionality AND removes noise -- stop words just add uniform noise across all document vectors.

Stemming: Reduce words to their root form by chopping off suffixes. "running" -> "run", "studies" -> "studi", "faster" -> "faster". The Porter Stemmer is the classic algorithm -- fast but crude. "university" becomes "univers", "organization" becomes "organ" (oops).

Lemmatization: Like stemming but uses actual linguistic rules and a dictionary. "running" -> "run", "studies" -> "study", "better" -> "good". More accurate than stemming but slower, because it needs to look up each word's base form.

In practice, for TF-IDF with classical ML, the combination of lowercasing + stop word removal works well enough for most tasks. Stemming and lemmatization provide marginal improvment on some datasets but can hurt on others by conflating words that SHOULD be different (consider "banking" the industry vs "banking" a turn -- stemming merges them).

N-grams: capturing a little bit of order

Bag of words completely ignores word order, but we can recover SOME order information by counting sequences of consecutive words instead of individual words. These sequences are called n-grams:

Unigram (n=1): individual words -- "machine", "learning", "is", "amazing"
Bigram (n=2): pairs of consecutive words -- "machine learning", "learning is", "is amazing"
Trigram (n=3): triples -- "machine learning is", "learning is amazing"

Bigrams capture local context that unigrams miss. "Not good" as a bigram is a negative feature, while "not" and "good" as separate unigrams send conflicting signals. "New York" as a bigram is meaningful; "new" and "york" separately are much less so.

# Compare unigrams, bigrams, and both
configs = [
    ("Unigrams only", (1, 1)),
    ("Bigrams only", (2, 2)),
    ("Uni + bigrams", (1, 2)),
    ("Uni + bi + trigrams", (1, 3)),
]

texts = [
    "machine learning is not just statistics",
    "deep learning requires more data than traditional methods",
    "natural language processing is a subfield of machine learning",
    "not all machine learning is deep learning",
]

for name, ngram_range in configs:
    v = TfidfVectorizer(ngram_range=ngram_range, stop_words='english')
    X = v.fit_transform(texts)
    print(f"{name:>22s}: {X.shape[1]:>4d} features")
    if ngram_range[1] >= 2:
        # Show some interesting bigrams
        bigrams = [f for f in v.get_feature_names_out() if " " in f]
        print(f"    Bigram samples: {bigrams[:5]}")

Notice the feature explosion: adding bigrams roughly quadruples the vocabulary size. Adding trigrams makes it even larger. This is the classic tradeoff of n-grams: you capture more context but the feature space grows exponentially, which means you need more data to avoid overfitting (episode #13, the gift that keeps on giving). In practice, unigrams + bigrams is the sweet spot for most text classification tasks. Trigrams help occasionally but the added dimensonality often does more harm than good unless you have a large corpus.

Building a sentiment classifier: putting it all together

Let's build a complete text classification system using everything we've covered. We'll use scikit-learn's Pipeline (from episode #16) to chain together TF-IDF vectorization and classification into a single object:

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

# Sentiment dataset (positive = 1, negative = 0)
positive = [
    "This product is amazing and I love it",
    "Great experience, highly recommend to everyone",
    "Excellent value for the price, very satisfied",
    "Best purchase I have made this year by far",
    "Fantastic quality and fast shipping, impressed",
    "Wonderful service, exceeded all my expectations",
    "Really happy with this, works perfectly fine",
    "Outstanding quality, will definitely buy again",
    "Superb craftsmanship and beautiful design overall",
    "Delighted with the results, money well spent",
    "Incredible performance and very easy to use",
    "Top notch product, five stars all the way",
] * 10

negative = [
    "Terrible quality, complete waste of money here",
    "Awful customer service, never buying again ever",
    "Broken on arrival, very disappointed with this",
    "Horrible product, do not buy this at all",
    "Worst product ever, want my money back now",
    "Cheaply made junk that broke after one week",
    "Completely useless, does not work as described",
    "Very poor quality, not worth a single penny",
    "Disgusting experience from start to finish here",
    "Absolute garbage, returning it for a refund",
    "Terrible fit and finish, looks nothing like photos",
    "Waste of time and money, total disappointment",
] * 10

texts = positive + negative
labels = np.array([1] * len(positive) + [0] * len(negative))

# Shuffle
idx = np.random.RandomState(42).permutation(len(texts))
texts = [texts[i] for i in idx]
labels = labels[idx]

# Test three different classifiers with TF-IDF features
classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Naive Bayes': MultinomialNB(),
    'Linear SVM': LinearSVC(max_iter=2000),
}

print("Sentiment classification (5-fold CV):")
print(f"  {'Classifier':>22s}  {'Accuracy':>10s}")
print("-" * 38)

for name, clf in classifiers.items():
    pipe = Pipeline([
        ('tfidf', TfidfVectorizer(
            max_features=1000,
            ngram_range=(1, 2),
            stop_words='english'
        )),
        ('clf', clf),
    ])
    scores = cross_val_score(pipe, texts, labels, cv=5)
    print(f"  {name:>22s}  {scores.mean():.1%} +/- {scores.std():.1%}")

All three classifiers should perform well because sentiment analysis with clear positive/negative language is one of the easiest NLP tasks -- the signal is strong and word presence is highly predictive. Logistic Regression with TF-IDF features is a remarkably powerful baseline for text classification. It's fast, interpretable (you can inspect which words have the highest positive and negative coefficients), and competitive with much more complex approaches on datasets up to tens of thousands of documents.

Inspecting what the model learned

One of the huge advantages of TF-IDF + linear models over deep learning: you can look inside and see exactly which words drive the predictions.

# Train on all data and inspect coefficients
pipe_inspect = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=500,
        ngram_range=(1, 2),
        stop_words='english'
    )),
    ('clf', LogisticRegression(max_iter=1000)),
])
pipe_inspect.fit(texts, labels)

# Get feature names and coefficients
features = pipe_inspect['tfidf'].get_feature_names_out()
coefs = pipe_inspect['clf'].coef_[0]

# Top positive indicators
top_pos_idx = np.argsort(coefs)[::-1][:10]
top_neg_idx = np.argsort(coefs)[:10]

print("Most POSITIVE words:")
for i in top_pos_idx:
    print(f"  {features[i]:>25s}  coef = {coefs[i]:>+.3f}")

print("\nMost NEGATIVE words:")
for i in top_neg_idx:
    print(f"  {features[i]:>25s}  coef = {coefs[i]:>+.3f}")

You should see words like "amazing", "excellent", "fantastic", "love" at the top of the positive list, and "terrible", "waste", "horrible", "worst" at the top of the negative list. This is deeply satisfying -- the model discovered what positive and negative language looks like, entirely from the data, no manual rules required.

This interpretability matters enormously in production. When a stakeholder asks "why did the model flag this customer review?", you can point to specific words with their coefficients. Try doing that with a 175-billion parameter language model ;-)

The effect of preprocessing choices

Different preprocessing choices can have a suprising impact on classification performance. Let's compare systematically:

preprocessing_configs = {
    'Raw (no preprocessing)': TfidfVectorizer(),
    'Lowercase + stop words': TfidfVectorizer(stop_words='english'),
    'Bigrams + stop words': TfidfVectorizer(
        ngram_range=(1, 2), stop_words='english'),
    'Max 500 features': TfidfVectorizer(
        max_features=500, stop_words='english'),
    'Max 2000 + bigrams': TfidfVectorizer(
        max_features=2000, ngram_range=(1, 2),
        stop_words='english'),
    'Min df=2 (ignore rare)': TfidfVectorizer(
        min_df=2, stop_words='english'),
}

print("Preprocessing comparison (LogisticRegression, 5-fold CV):")
print(f"  {'Config':>28s}  {'Features':>9s}  {'Accuracy':>10s}")
print("-" * 52)

for name, vectorizer in preprocessing_configs.items():
    pipe = Pipeline([
        ('tfidf', vectorizer),
        ('clf', LogisticRegression(max_iter=1000)),
    ])
    scores = cross_val_score(pipe, texts, labels, cv=5)
    # Count features
    vectorizer_copy = vectorizer.__class__(**vectorizer.get_params())
    X_temp = vectorizer_copy.fit_transform(texts)
    n_features = X_temp.shape[1]
    print(f"  {name:>28s}  {n_features:>9d}  "
          f"{scores.mean():.1%} +/- {scores.std():.1%}")

The results usually show that preprocessing helps, but the magnitude of improvement depends heavily on the dataset. For this simple sentiment dataset, even raw TF-IDF works well because the vocabulary is small and the signal is strong. On messier, larger datasets (think: thousands of customer reviews with typos, slang, and sarcasm), stop word removal and n-grams make a much bigger difference. The lesson from episode #15 applies here too: feature engineering (in this case, text preprocessing) often matters more than model selection.

A more realistic example: topic classification

Sentiment analysis is almost too easy -- let's try something harder. Multi-class topic classification from short text descriptions:

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Simulated topic classification dataset
topic_data = {
    'sports': [
        "The team scored three goals in the second half",
        "She won the championship with a record time",
        "The coach made substitutions in the final minutes",
        "Players trained intensively before the tournament",
        "The match ended in a controversial penalty decision",
        "He broke the world record at the athletics event",
        "The league standings changed after the weekend games",
        "Fans celebrated the victory outside the stadium",
    ] * 8,
    'technology': [
        "The new processor delivers twice the performance",
        "Software updates fixed several critical security bugs",
        "Cloud computing costs decreased significantly this year",
        "The startup raised funding for their database product",
        "Open source contributions grew by thirty percent",
        "The browser update improved page loading speed",
        "Server infrastructure was migrated to the new platform",
        "Battery technology breakthrough extends device lifespan",
    ] * 8,
    'cooking': [
        "Preheat the oven to 180 degrees before starting",
        "Whisk the eggs and sugar until light and fluffy",
        "The recipe calls for fresh herbs and olive oil",
        "Let the dough rest for at least thirty minutes",
        "Season the meat with salt pepper and garlic powder",
        "Simmer the sauce on low heat for two hours",
        "The cake needs to cool completely before frosting",
        "Chop the vegetables into small uniform pieces first",
    ] * 8,
}

X_topics = []
y_topics = []
for topic, samples in topic_data.items():
    X_topics.extend(samples)
    y_topics.extend([topic] * len(samples))

X_train, X_test, y_train, y_test = train_test_split(
    X_topics, y_topics, test_size=0.25, random_state=42,
    stratify=y_topics
)

pipe_topic = Pipeline([
    ('tfidf', TfidfVectorizer(
        ngram_range=(1, 2),
        stop_words='english',
        max_features=1000
    )),
    ('clf', LogisticRegression(max_iter=1000)),
])

pipe_topic.fit(X_train, y_train)
y_pred = pipe_topic.predict(X_test)

print("Topic Classification Results:")
print(classification_report(y_test, y_pred))

# Show the most distinctive words per topic
features = pipe_topic['tfidf'].get_feature_names_out()
for i, topic in enumerate(pipe_topic['clf'].classes_):
    coefs = pipe_topic['clf'].coef_[i]
    top_idx = np.argsort(coefs)[::-1][:5]
    top_words = [features[j] for j in top_idx]
    print(f"  {topic:>12s} keywords: {', '.join(top_words)}")

This is where TF-IDF really shines. The topic-specific vocabulary creates natural clusters in TF-IDF space -- sports documents cluster together because they share words like "goal", "team", "match", while cooking documents cluster around "oven", "recipe", "heat". The classifier doesn't need to understand WHAT these words mean -- it just needs to detect that certain words co-occur with certain labels. It's pattern matching on word distributions, and it works remarkably well.

The limitations we're going to fix

Everything we've covered today treats text as a collection of independent tokens (or short n-gram sequences). This works for many classification tasks, but it misses critical aspects of language that become important for more demanding applications:

Word order carries meaning. "Not good" means the opposite of "good", but bag-of-words just sees one count each of "not" and "good". We partially fix this with bigrams ("not good" as a single feature), but that only captures very local context. "I would definitely not say this product is any good" has "not" and "good" separated by 8 words -- no reasonable n-gram captures that.

Words have meaning, not just identity. In our TF-IDF vectors, "happy" and "joyful" are completely unrelated features -- orthogonal dimensions in the vocabulary space. There's no sense that they're synonyms. A model trained on "happy" reviews won't generalize to "joyful" reviews unless it's seen both. This is a fundamental limitation of any counting-based representation: it can't capture meaning, only identity.

Context determines meaning. "I put money in the bank" and "I sat on the river bank" use "bank" with totally different meanings. TF-IDF treats them as the same feature. "Apple released a new product" and "I ate an apple" -- same problem. Classical methods have no mechanism for disambiguation beyond what the surrounding counted words might suggest.

Rare and unseen words are invisible. If the training data never contained the word "phenomenal", the model has no way to understand that it's a positive sentiment indicator. It's not in the vocabulary, so it doesn't exist. A human immediately understands that "phenomenal" is positive, but the bag-of-words model is completely blind to it.

These aren't just theoretical limitations -- they're why modern NLP has moved toward representations that capture meaning rather than just word identity. The key insight that bridges classical NLP to modern approaches is this: what if, in stead of representing words as sparse one-hot vectors (where every word is equally different from every other word), we represented them as dense vectors in a continuous space where similar words are close together? That idea -- giving words a geometry of meaning -- turns out to be one of the most powerful ideas in all of machine learning, and it connects directly to the dimensionality reduction concepts we explored in episodes #24-25.

Dussssss, wat hebben we geleerd?

We've entered a completely new domain of ML today -- one where the raw data is messy, ambiguous human language rather than clean numerical matrices. Here's what we covered:

The fundamental NLP challenge: converting variable-length, ambiguous text into fixed-size numerical vectors a model can process. Same problem as episode #3, but dramatically harder because language carries meaning that surface-level encoding can't trivially capture;
Tokenization splits text into processing units -- words, characters, or subwords -- each with different vocabulary size vs information tradeoffs. Word-level tokenization is the classical standard;
Bag of words counts word occurrences per document, completely discarding order. Sounds terrible, works brilliantly for many classification tasks because word presence is often more informative than word arrangement;
TF-IDF weights words by their importance: frequent in this document but rare across the corpus means high weight. Common words like "the" get automatically suppressed. This is the single most important classical NLP technique and it's still everywhere in production;
Preprocessing (lowercasing, stop words, stemming, lemmatization) normalizes text to reduce vocabulary size and noise. The combination of lowercasing + stop word removal + TF-IDF is the strong default;
N-grams capture short word sequences ("not good", "machine learning") to recover some of the order information that bag-of-words discards. Unigrams + bigrams is the practical sweet spot;
TF-IDF + logistic regression is a remarkably strong baseline for text classification -- fast, interpretable, and competitive with complex models on many tasks. When you can inspect exactly which words drive each prediction, that's powerful;
Classical methods miss word order, synonymy, context, and unseen words. These limitations point directly toward the next phase: representing words as dense vectors where meaning is encoded as geometry.

This episode marks a transition point in our series. We've completed the classical ML toolkit -- supervised learning (episodes #10-21), unsupervised learning (#22-27), time series (#28-29), and now classical NLP (#30). Everything from here builds toward the deep learning revolution: representations that capture meaning, models that understand sequence, and architectures that scale to billions of parameters. The classical foundations we've built aren't going away -- they're the vocabulary you need to understand WHY the modern approaches work and WHEN the simpler methods are still the better choice.

Bedankt en tot de volgende keer!

@scipio

stem stemsocial steemstem python programming

0.000

0 comments

Learn AI Series (#30) - Natural Language Processing - Text as Data

Learn AI Series (#30) - Natural Language Processing - Text as Data

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#30) - Natural Language Processing - Text as Data

The representation problem

Tokenization: deciding what the units are

Bag of words: ignoring order entirely

Comparing documents with bag of words

TF-IDF: words that actually matter

Text preprocessing: cleaning up messy language

N-grams: capturing a little bit of order

Building a sentiment classifier: putting it all together

Inspecting what the model learned

The effect of preprocessing choices

A more realistic example: topic classification

The limitations we're going to fix

Dussssss, wat hebben we geleerd?

Bedankt en tot de volgende keer!

Curriculum (of the `Learn AI Series`):