Learn AI Series (#15) - Feature Engineering and Selection

avatar

Learn AI Series (#15) - Feature Engineering and Selection

ai-banner.png

What will I learn

  • You will learn how to create new features from raw data that help your model find patterns it couldn't see before;
  • interaction features and polynomial features -- capturing relationships between variables;
  • cyclical encoding for time-based features (hours, days, months) using sine and cosine;
  • basic text features: bag-of-words and TF-IDF, built from scratch;
  • feature selection techniques: correlation analysis, model-based importance ranking, and mutual information;
  • the curse of dimensionality -- why more features isn't always better;
  • a practical feature engineering pipeline that ties everything together.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#15) - Feature Engineering and Selection

At the end of episode #14 I promised we'd learn how to get more out of your raw data by constructing better inputs for your models. That promise is what today is about. We've now got models (linear regression in #10, logistic regression in #12), evaluation tools (#13), and a complete data preparation pipeline (#14 -- split, impute, outliers, encode, scale). Solid infrastructure.

And after years of working with data, I keep coming back to the same conclusion: the single biggest lever you have for improving a model is not the algorithm -- it's the features. Two people can take the exact same dataset, use the exact same model, and get wildly different results. The difference? How they represent the data. What they feed into the model. Which columns they construct, which they discard, and how they transform what's there.

Feature engineering is the art of constructing informative inputs from raw data. Feature selection is the discipline of keeping what helps and throwing away what hurts. Together, they're often the highest-leverage activity in any ML project. I'd go as far as saying that a mediocre algorithm with great features will beat a great algorithm with mediocre features, nine times out of ten. And the beautiful part? None of this requires new math or new algorithms. It's about thinking about your data differently ;-)

Let's go.

Why raw features aren't enough

Think about predicting apartment prices (our running example since episode #10). You have square meters, number of rooms, building age, and floor number. A linear model can learn that more square meters means higher price, and older buildings cost less. Straightforward.

But what about this: a 5th-floor apartment in a building WITHOUT an elevator is a very different proposition than a 5th-floor apartment WITH one. The floor feature and the elevator feature interact -- the effect of one depends on the value of the other. A linear model using raw features can't capture that. It learns a weight for floor and a separate weight for elevator, but it has no way to express "high floor is bad ONLY when there's no elevator."

The fix is to create a new feature that explicitly encodes that interaction. And this is the core idea of feature engineering: you're not changing the model, you're changing the representation. You're giving the model information that was implicit in the raw data but invisible in its original form.

Interaction features

Sometimes the relationship isn't in individual features but in combinations of them. Let me show you concretely:

import numpy as np

np.random.seed(42)

# Apartment data: sqm, floor, has_elevator
n = 200
sqm = np.random.uniform(30, 150, n)
floor = np.random.randint(0, 10, n).astype(float)
has_elevator = np.random.randint(0, 2, n).astype(float)

# The REAL relationship: high floor WITHOUT elevator is bad for price
# (who wants to carry groceries up 8 flights of stairs?)
price = (200000 + 2500 * sqm
         - 5000 * floor * (1 - has_elevator)
         + np.random.randn(n) * 15000)

# Version 1: raw features only
X_basic = np.column_stack([sqm, floor, has_elevator])

# Version 2: add the interaction feature
floor_no_elevator = floor * (1 - has_elevator)
X_enhanced = np.column_stack([sqm, floor, has_elevator, floor_no_elevator])

# Compare using the normal equation (quick fit, from episode #11)
print("Feature engineering impact:\n")
for name, X in [("Basic (3 features)", X_basic),
                ("With interaction (4 features)", X_enhanced)]:
    X_b = np.column_stack([X, np.ones(len(X))])
    w = np.linalg.lstsq(X_b, price, rcond=None)[0]
    pred = X_b @ w
    rmse = np.sqrt(np.mean((price - pred) ** 2))
    r2 = 1 - np.sum((price - pred)**2) / np.sum((price - price.mean())**2)
    print(f"  {name:>30s}: RMSE = EUR {rmse:>8,.0f}  R^2 = {r2:.4f}")

Run it. The interaction feature captures a relationship that neither floor nor has_elevator can express alone. The RMSE drops and R-squared jumps -- often dramatically -- because you've handed the model information about the true data-generating process that it simply couldn't derive from the raw inputs.

This pattern appears everywhere in real-world data. The effect of education on income depends on the industry. The effect of a drug depends on dosage AND patient weight. Temperature satisfaction depends on humidity AND wind speed. Whenever the effect of one variable changes depending on the value of another, you need an interaction feature to capture it.

A quick rule of thumb: if you suspect two features have a joint effect, multiply them together and add the product as a new column. It costs you nothing but one extra feature, and if the interaction exists in the data, the model's weights will pick it up. If it doesn't exist, the weight will be near zero and it causes no harm.

Polynomial features

We already covered this in episode #11 when we discussed polynomial regression. But it's worth framing explicitly as a feature engineering technique, because that's exactly what it is.

A linear model computes w0*x + bias -- a straight line. By adding x^2, x^3, and so on as new features, you give the model the ability to fit curves without changing the model itself. The model is still linear in the weights -- it's still computing w0*x + w1*x^2 + w2*x^3 + bias -- but the features are nonlinear.

def polynomial_features(X, degree=2):
    """Create polynomial features up to given degree."""
    features = [X]
    for d in range(2, degree + 1):
        features.append(X ** d)
    return np.column_stack(features)

# Demo: salary vs experience (recall episode #11)
experience = np.linspace(0, 25, 100).reshape(-1, 1)
X_poly = polynomial_features(experience, degree=3)

print(f"Original: 1 feature")
print(f"Polynomial degree 3: {X_poly.shape[1]} features")
print(f"\nSample: experience=5.0 -> "
      f"[{5.0:.1f}, {5.0**2:.1f}, {5.0**3:.1f}]")
print(f"Sample: experience=10.0 -> "
      f"[{10.0:.1f}, {10.0**2:.1f}, {10.0**3:.1f}]")

Remember the bias-variance tradeoff from episode #11 -- too many polynomial terms and you start fitting noise. Degree 2-3 is usually enough for most features. Going to degree 10 on 12 data points is how you get the catastrophic overfitting we demonstrated last time. Feature engineering gives you power, but regularization (Ridge, Lasso from episode #11) keeps that power in check.

Having said that, you can also combine polynomial features with interaction features. If you have features x1 and x2, you could create x1^2, x2^2, x1*x2, x1^2*x2, etc. The feature space explodes quickly -- which is why feature selection (later in this episode) is equally important.

Time features: cyclical encoding

Time features are tricky in a way that catches a lot of beginners off guard. Consider the hour of the day as a feature. If you encode it as a plain number 0 through 23, the model sees hour 23 and hour 0 as maximally distant -- 23 units apart. But in reality, 11 PM and midnight are ONE hour apart. They're adjacent. The numeric representation breaks the circular nature of time.

The fix: encode cyclical features using sine and cosine. This maps hours (or days, months, etc.) onto a circle, where adjacent time points are always close together regardless of where the "wrap-around" occurs.

def cyclical_encode(values, period):
    """Encode cyclical features as sin/cos pair."""
    angle = 2 * np.pi * values / period
    return np.sin(angle), np.cos(angle)

# Hours of the day: period = 24
hours = np.array([0, 3, 6, 12, 18, 21, 23])
hour_sin, hour_cos = cyclical_encode(hours, 24)

print("Cyclical hour encoding:")
print(f"{'Hour':>6s}  {'sin':>8s}  {'cos':>8s}")
for h, s, c in zip(hours, hour_sin, hour_cos):
    print(f"  {h:>4d}  {s:>+8.3f}  {c:>+8.3f}")

# Key test: distance between hour 23 and hour 0 should be SMALL
dist_23_0 = np.sqrt((hour_sin[-1] - hour_sin[0])**2 +
                     (hour_cos[-1] - hour_cos[0])**2)
dist_0_12 = np.sqrt((hour_sin[0] - hour_sin[3])**2 +
                     (hour_cos[0] - hour_cos[3])**2)
print(f"\nDistance between hour 23 and hour 0:  {dist_23_0:.3f}  (should be small)")
print(f"Distance between hour 0 and hour 12:  {dist_0_12:.3f}  (should be large)")

The sin/cos pair preserves the circular relationship perfectly. Points that are close on the clock are close in the encoded space. Points that are far apart on the clock are far apart in the encoded space. And the wrap-around is handled automatically because sin/cos are periodic functions -- they complete one full cycle over the period.

Use this for:

  • Hours: period=24
  • Day of week: period=7
  • Month: period=12
  • Day of year: period=365 (or 366 for leap year tolerance, though that level of precision rarely matters)

One feature becomes TWO features (sin and cos). You need both -- sin alone can't distinguish 3 AM from 9 PM (they have the same sine value). Together they uniquely identify every point on the circle.

Text features: bag-of-words

So far all our data has been numeric. But real datasets are full of text -- product descriptions, customer reviews, medical notes, email subjects. Back in episode #3 we talked about how machines see everything as numbers. Text is no exception. You need to convert it to a numeric representation before any model can use it.

The simplest approach is bag-of-words: count how often each word appears in a document.

# Four simple documents about ML
documents = [
    "machine learning is great",
    "deep learning is a subset of machine learning",
    "neural networks power deep learning",
    "machine learning and statistics overlap",
]

# Step 1: build a vocabulary (unique words -> indices)
vocab = {}
for doc in documents:
    for word in doc.lower().split():
        if word not in vocab:
            vocab[word] = len(vocab)

print(f"Vocabulary: {len(vocab)} unique words")
for word, idx in sorted(vocab.items(), key=lambda x: x[1]):
    print(f"  {idx:>2d}: '{word}'")

# Step 2: create the bag-of-words matrix
bow = np.zeros((len(documents), len(vocab)))
for i, doc in enumerate(documents):
    for word in doc.lower().split():
        bow[i, vocab[word]] += 1

print(f"\nBoW matrix shape: {bow.shape}")
print(f"  ({len(documents)} documents x {len(vocab)} words)")

# Show doc 1's word counts
print(f"\nDoc 1: '{documents[1]}'")
for word, idx in sorted(vocab.items(), key=lambda x: x[1]):
    if bow[1, idx] > 0:
        print(f"  '{word}': {bow[1, idx]:.0f}")

Bag-of-words ignores word order entirely -- "dog bites man" and "man bites dog" produce the same vector. That's a real limitation, but for many tasks (topic classification, spam detection, sentiment analysis at a coarse level) the words that appear matter more than their order. A document containing "neural", "network", "gradient", and "backpropagation" is almost certainly about deep learning regardless of how those words are arranged.

TF-IDF: weighting by importance

Raw word counts have a problem: common words like "is", "a", and "the" dominate the vectors. These words appear in every document and carry almost zero information about the topic. TF-IDF (Term Frequency - Inverse Document Frequency) fixes this by downweighting words that appear in many documents:

def compute_tfidf(bow_matrix):
    """TF-IDF: downweight words that appear everywhere."""
    # TF: we already have this (the raw counts)
    tf = bow_matrix.copy()

    # IDF: words in many docs get lower weight
    n_docs = bow_matrix.shape[0]
    doc_freq = (bow_matrix > 0).sum(axis=0)
    idf = np.log(n_docs / (doc_freq + 1)) + 1  # +1 avoids log(0)

    return tf * idf

tfidf = compute_tfidf(bow)

# Compare raw counts vs TF-IDF for document 1
doc_idx = 1
print(f"Doc: '{documents[doc_idx]}'")
print(f"\n{'Word':>12s}  {'Raw count':>10s}  {'TF-IDF':>10s}")
print("-" * 36)

word_data = []
for word, idx in vocab.items():
    if bow[doc_idx, idx] > 0:
        word_data.append((word, bow[doc_idx, idx], tfidf[doc_idx, idx]))

for word, raw, weighted in sorted(word_data, key=lambda x: -x[2]):
    print(f"  {word:>10s}  {raw:>10.0f}  {weighted:>10.3f}")

Look at the results. Common words like "is" and "a" get low TF-IDF scores because they appear in many documents (high document frequency -> low IDF weight). Distinctive words like "subset" or "deep" get higher scores because they appear in fewer documents. This is far more informative for text classification than raw counts.

TF-IDF is one of the most widely used text representations in practice. It's the default for many text classification pipelines, email filtering systems, and document retrieval engines. We'll eventually encounter more sophisticated text representations (word embeddings, transformers) much later in this series, but TF-IDF remains the baseline that everything else gets compared against.

Feature selection: keeping what matters

So far we've been ADDING features -- interactions, polynomials, cyclical encodings, text representations. Each addition gives the model more information to work with. But there's a cost. Irrelevant features add noise, slow down training, and -- crucially -- can cause overfitting. Remember from episode #11: a degree-11 polynomial with 12 data points memorized the noise. More features means more parameters means more capacity to memorize in stead of learning.

Feature selection is the opposite of feature engineering: you remove what doesn't help. The goal is a compact set of features that captures the signal and leaves out the noise.

Method 1: correlation-based filtering

The simplest approach: check how strongly each feature correlates with the target. Features with near-zero correlation contribute nothing. Features that are highly correlated with each other are redundant -- keep one, drop the rest.

# Generate data with useful, redundant, and useless features
np.random.seed(42)
n = 300

useful_1 = np.random.randn(n)
useful_2 = np.random.randn(n)
redundant = useful_1 * 0.95 + np.random.randn(n) * 0.1   # ~copy of useful_1
noise_1 = np.random.randn(n)    # pure noise
noise_2 = np.random.randn(n)    # more pure noise

y = 3 * useful_1 + 2 * useful_2 + np.random.randn(n) * 0.5

X = np.column_stack([useful_1, useful_2, redundant, noise_1, noise_2])
names = ["useful_1", "useful_2", "redundant", "noise_1", "noise_2"]

# Feature-target correlations
print("Feature-target correlations:")
for i, name in enumerate(names):
    corr = np.corrcoef(X[:, i], y)[0, 1]
    verdict = ""
    if abs(corr) < 0.05:
        verdict = " <- DROP (no signal)"
    elif abs(corr) > 0.5:
        verdict = " <- KEEP"
    print(f"  {name:>12s}: {corr:+.3f}{verdict}")

# Feature-feature correlations (detect redundancy)
print("\nFeature-feature correlations (|r| > 0.5 only):")
for i in range(len(names)):
    for j in range(i + 1, len(names)):
        corr = np.corrcoef(X[:, i], X[:, j])[0, 1]
        if abs(corr) > 0.5:
            print(f"  {names[i]:>12s} <-> {names[j]:<12s}: "
                  f"{corr:+.3f} <- REDUNDANT")

The noise features have near-zero correlation with the target -- they carry no predictive information. The redundant feature correlates strongly with useful_1 because it's basically a noisy copy. Keeping both wastes model capacity and can cause problems with multicollinearity (correlated features make weight estimates unstable, as we discussed in episode #10 with the normal equation).

A practical recipe: compute correlations, drop features with |correlation| < some threshold (0.05 is conservative), and when two features correlate strongly with each other (|r| > 0.9), drop the one that has weaker correlation with the target.

Method 2: model-based importance

A more sophisticated approach: train a model, then look at which features it actually uses. Features with near-zero weights contribute nothing:

# Scale features first (critical for weight comparison -- episode #11)
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)

# Fit linear regression using the normal equation
X_b = np.column_stack([X_scaled, np.ones(n)])
w = np.linalg.lstsq(X_b, y, rcond=None)[0]

print("Feature importance (|weight| after scaling):")
for i, name in enumerate(names):
    importance = abs(w[i])
    verdict = ""
    if importance < 0.1:
        verdict = " <- DROP"
    elif importance > 0.5:
        verdict = " <- IMPORTANT"
    print(f"  {name:>12s}: |w| = {importance:.3f}{verdict}")

# Compare: model with ALL features vs selected features only
keep_mask = np.array([abs(w[i]) > 0.1 for i in range(len(names))])
X_selected = X_scaled[:, keep_mask]
selected_names = [n for n, k in zip(names, keep_mask) if k]

X_sel_b = np.column_stack([X_selected, np.ones(n)])
w_sel = np.linalg.lstsq(X_sel_b, y, rcond=None)[0]

# RMSE comparison
pred_all = X_b @ w
pred_sel = X_sel_b @ w_sel
rmse_all = np.sqrt(np.mean((y - pred_all) ** 2))
rmse_sel = np.sqrt(np.mean((y - pred_sel) ** 2))

print(f"\nAll {len(names)} features:     RMSE = {rmse_all:.4f}")
print(f"Selected {len(selected_names)} features: RMSE = {rmse_sel:.4f}")
print(f"Kept: {selected_names}")

The noise features should have near-zero weights -- the model learned to ignore them. The redundant feature might have a smallish weight because Ridge-like behavior (which lstsq approximates for correlated features) spreads the weight across correlated features. In practice, you'd use Lasso regularization (from episode #11) here -- remember, L1 drives irrelevant weights to exactly zero, which IS automatic feature selection.

This connection is important. When we built Lasso in episode #11 I said it performs "automatic feature selection." Now you can see what that means concretely: Lasso selects features by eliminating the ones with zero importance. Feature engineering creates candidates; feature selection (or Lasso) prunes the losers.

Method 3: permutation importance

Here's a clever trick that works with any model, not just linear ones. The idea: if a feature is important, shuffling its values randomly should destroy the model's performance. If shuffling doesn't change anything, the feature wasn't being used:

# Split into train/test
idx = np.random.permutation(n)
split = int(0.8 * n)
X_tr, X_te = X_scaled[idx[:split]], X_scaled[idx[split:]]
y_tr, y_te = y[idx[:split]], y[idx[split:]]

# Train
X_tr_b = np.column_stack([X_tr, np.ones(len(X_tr))])
X_te_b = np.column_stack([X_te, np.ones(len(X_te))])
w_perm = np.linalg.lstsq(X_tr_b, y_tr, rcond=None)[0]

# Baseline performance
baseline_rmse = np.sqrt(np.mean((y_te - X_te_b @ w_perm) ** 2))

print(f"Baseline test RMSE: {baseline_rmse:.4f}")
print(f"\nPermutation importance:")

for i, name in enumerate(names):
    # Shuffle this feature's values in the test set
    X_shuffled = X_te.copy()
    X_shuffled[:, i] = np.random.permutation(X_shuffled[:, i])
    X_shuffled_b = np.column_stack([X_shuffled, np.ones(len(X_shuffled))])

    shuffled_rmse = np.sqrt(np.mean((y_te - X_shuffled_b @ w_perm) ** 2))
    importance = shuffled_rmse - baseline_rmse

    bar = "+" * int(importance * 20)
    print(f"  {name:>12s}: delta RMSE = {importance:>+.4f}  {bar}")

This is model-agnostic -- you can use it with linear regression, decision trees, neural networks, anything. Shuffle one feature at a time, measure how much the performance degrades. Big degradation = important feature. No degradation = useless feature. We'll revisit this technique when we get to more complex models that don't have simple weight interpretations.

The curse of dimensionality

Now I want to show you why indiscriminately adding features is dangerous, even if each feature seems like it could help. This concept has a dramatic-sounding name -- the curse of dimensionality -- and it's one of the most important ideas in all of machine learning.

As the number of dimensions (features) increases, something counterintuitive happens: data becomes increasingly sparse. Points that seem "close" in low dimensions become isolated in high dimensions. The space grows exponentially, but your data doesn't.

# Demonstrate: how distance behaves as dimensions increase
for n_dims in [2, 5, 10, 50, 100, 500]:
    points = np.random.randn(200, n_dims)

    # Compute distances between pairs of points
    dists = []
    for i in range(50):
        for j in range(i + 1, 50):
            d = np.sqrt(np.sum((points[i] - points[j]) ** 2))
            dists.append(d)

    avg_dist = np.mean(dists)
    min_dist = np.min(dists)
    max_dist = np.max(dists)
    ratio = min_dist / avg_dist

    print(f"  {n_dims:>4d} dims: avg={avg_dist:>7.1f}  "
          f"min={min_dist:>6.1f}  max={max_dist:>7.1f}  "
          f"min/avg={ratio:.3f}")

Watch the min/avg ratio as dimensions increase. In 2D, the nearest neighbor is MUCH closer than the average distance -- there's a clear distinction between "close" and "far." In 500 dimensions, the nearest neighbor is almost as far as the average -- everything is roughly equidistant. All points are equally lonely. This is the curse.

Why does this matter for ML? Many algorithms rely on the concept of "closeness" -- nearest-neighbor methods, kernel methods, clustering. When all points are equidistant, "find the nearest neighbor" becomes meaningless. Even models that don't explicitly use distances (like linear regression) suffer because high-dimensional spaces need exponentially more data to fill. With 500 features and 200 samples, your data points are scattered like 200 grains of sand in a warehouse. Good luck finding patterns.

The practical takeaway: every feature must earn its place. Adding a feature that carries genuine signal about the target is almost always worth it. Adding a feature that's noise? You've just made the warehouse bigger without adding any sand. Your model is now searching a larger space for the same signal, and the chance of finding spurious correlations (noise that happens to look like patterns) goes up.

This is why feature selection isn't optional -- it's a necesary counterbalance to feature engineering. Create candidate features aggressively, then prune ruthlessly.

A complete feature engineering pipeline

Let me bring everything together into a realistic workflow. We'll start with raw apartment data and systematically engineer, select, and evaluate features:

# Realistic apartment dataset
np.random.seed(42)
n = 500

# Raw features
sqm = np.random.uniform(30, 150, n)
rooms = np.random.randint(1, 6, n).astype(float)
age = np.random.uniform(0, 50, n)
floor = np.random.randint(0, 10, n).astype(float)
has_elevator = np.random.randint(0, 2, n).astype(float)
hour_listed = np.random.randint(0, 24, n).astype(float)

# True relationship (the model needs to discover this)
price = (2500 * sqm
         + 800 * rooms
         - 300 * age
         - 4000 * floor * (1 - has_elevator)  # interaction!
         + 20 * sqm * (50 - age) / 50         # subtle: newer large apts worth more
         + np.random.randn(n) * 15000)

# Step 1: raw features only
X_raw = np.column_stack([sqm, rooms, age, floor, has_elevator, hour_listed])
raw_names = ["sqm", "rooms", "age", "floor", "elevator", "hour"]

# Step 2: engineer new features
hour_sin, hour_cos = cyclical_encode(hour_listed, 24)
floor_no_elev = floor * (1 - has_elevator)
sqm_age = sqm * age
sqm_squared = sqm ** 2

X_engineered = np.column_stack([
    sqm, rooms, age, floor, has_elevator,    # originals (no hour_listed!)
    hour_sin, hour_cos,                       # cyclical hour
    floor_no_elev,                            # interaction
    sqm_age,                                  # interaction
    sqm_squared,                              # polynomial
])
eng_names = ["sqm", "rooms", "age", "floor", "elevator",
             "hour_sin", "hour_cos",
             "floor*no_elev", "sqm*age", "sqm^2"]

# Step 3: split FIRST (episode #14's golden rule)
idx = np.random.permutation(n)
split = int(0.8 * n)

# Step 4: evaluate raw vs engineered
print("Feature engineering comparison:\n")
for label, X_data, f_names in [("Raw features", X_raw, raw_names),
                                ("Engineered features", X_engineered, eng_names)]:
    X_tr = X_data[idx[:split]]
    X_te = X_data[idx[split:]]
    y_tr = price[idx[:split]]
    y_te = price[idx[split:]]

    # Scale (fit on train, transform both)
    tr_mean = X_tr.mean(axis=0)
    tr_std = X_tr.std(axis=0)
    tr_std[tr_std == 0] = 1
    X_tr_s = (X_tr - tr_mean) / tr_std
    X_te_s = (X_te - tr_mean) / tr_std

    # Fit
    X_tr_b = np.column_stack([X_tr_s, np.ones(len(X_tr_s))])
    X_te_b = np.column_stack([X_te_s, np.ones(len(X_te_s))])
    w = np.linalg.lstsq(X_tr_b, y_tr, rcond=None)[0]

    # Evaluate
    te_pred = X_te_b @ w
    rmse = np.sqrt(np.mean((y_te - te_pred) ** 2))
    r2 = 1 - np.sum((y_te - te_pred)**2) / np.sum((y_te - y_te.mean())**2)

    print(f"  {label} ({len(f_names)} features):")
    print(f"    Test RMSE: EUR {rmse:>8,.0f}")
    print(f"    Test R^2:  {r2:.4f}")
    print()

The engineered features should handily beat the raw ones. The interaction feature floor*no_elev directly encodes a relationship that exists in the data. The sqm*age interaction captures the subtle effect that newer large apartments command a premium. And we replaced the raw hour_listed (which has the wrap-around problem) with sin/cos encodings.

Notice what we did NOT do: we didn't blindly create every possible interaction and polynomial. We thought about which combinations make domain sense. sqm * floor might capture something (higher floors in larger buildings?), but rooms * hour_listed is almost certainly noise. Domain knowledge guides feature engineering. Blind combinatorial explosion creates the curse of dimensionality.

When to engineer vs when to use better algorithms

I want to be honest with you about something. Everything we've done in this episode -- interaction features, polynomial features, cyclical encoding, manually constructing new columns -- is the traditional approach to feature engineering. It works, and it works well. Many production ML systems in finance, healthcare, and ad-tech still run on hand-engineered features fed into linear models or gradient-boosted trees.

But there's a whole family of models coming later in this series -- specifically, decision trees and their more powerful variants -- that can discover interaction effects and nonlinear patterns on their own, without you manually creating floor * (1 - has_elevator). And neural networks go even further: they learn feature representations automatically from raw data. The entire deep learning revolution is basically about models that do their own feature engineering.

Does that make what we learned today useless? Not at all. Three reasons:

  1. You need to understand what features ARE before you can evaluate whether a model is learning good ones. If you've never manually engineered features, you won't recognize when a neural network's internal representations are capturing the right structure.

  2. For small datasets and simple models, hand-crafted features still win. A linear model with 5 thoughtfully engineered features and 200 samples will often outperform a neural network on the same data, because the neural network doesn't have enough examples to learn the representations from scratch.

  3. Feature engineering is always the first thing to try. Training a neural network takes hours. Adding an interaction feature takes seconds. If you can get 90% of the performance improvement from a simple feature transformation, why burn compute?

Let's recap

We stepped up from data preparation (#14) to data transformation today. Feature engineering and selection are the bridge between "clean data" and "good model performance." Here's what we covered:

  • Interaction features capture joint effects between variables -- multiply two columns to create a feature that neither can express alone. If high floors hurt price ONLY without an elevator, floor * (1 - elevator) is the feature that captures that;
  • Polynomial features let linear models fit curves by adding x^2, x^3 etc. as new columns. Same technique from episode #11, now understood as feature engineering;
  • Cyclical encoding (sin/cos pairs) handles periodic features like hours and months. Without it, the model thinks midnight and 11 PM are 23 units apart when they're actually 1;
  • Bag-of-words converts text to numbers by counting word occurrences per document. TF-IDF improves on this by downweighting words that appear in every document (common words = low information);
  • Feature selection removes noise: correlation analysis for quick filtering, model weights for importance ranking, permutation importance for model-agnostic evaluation. Lasso (episode #11) performs automatic feature selection by driving useless weights to zero;
  • The curse of dimensionality means more features requires exponentially more data. In high dimensions, all points become equidistant and distance-based reasoning breaks down. Every feature must earn its place;
  • Domain knowledge guides feature engineering. Don't blindly create every possible combination -- think about which relationships make sense in your data.

We've now got the complete "classical ML" toolkit: models (#10-12), evaluation (#13), data preparation (#14), and feature engineering (this episode). That's everything you need to tackle real problems with linear and logistic regression. But we've been doing everything by hand -- writing our own gradient descent, our own StandardScaler, our own cross-validation, our own feature engineering functions. There's a library that packages ALL of this into a clean, tested, professional API. It's the standard tool that every ML practitioner reaches for first. And it follows the exact same fit()/transform()/predict() pattern we've been building from scratch. We'll get our hands on it soon ;-)

Thanks for reading! Vragen of suggesties, drop ze in de comments ;-)

@scipio



0
0
0.000
0 comments