Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About

Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About

ai-banner.png

What will I learn

  • You will learn the three-way split (train/validation/test) and why two sets are not enough for serious ML work;
  • the cardinal sin of data leakage -- how test set information silently poisons your training process;
  • three strategies for handling missing values: dropping, imputation, and the "missing indicator" trick;
  • categorical encoding methods: one-hot, label encoding, and target encoding (with leakage warnings);
  • the pipeline pattern -- preprocessing that never leaks, using the same fit()/transform() design from episode #11;
  • outlier detection with Z-scores and IQR -- identifying extreme values before they wreck your weights;
  • a complete data preparation workflow that chains all of the above into one reusable function.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About

At the end of episode #13 I hinted at a gap in everything we've done so far. We built linear regression from scratch in episode #10. We extended it with polynomial features, regularization, and the normal equation in episode #11. We crossed over to classification with logistic regression in #12. And we built a full evaluation toolkit in #13 -- confusion matrices, precision, recall, F1, AUC-ROC, cross-validation. Solid foundations all around.

But here's the uncomfortable truth I've been quietly dodging: all of our data has been synthetic. We generated it ourselves with np.random. Clean. Complete. No missing values. No weird formats. No text columns that need converting to numbers. No outliers from a broken sensor dumping 999999 into your temperature column. Every dataset we've used has been perfectly behaved, because I designed it that way.

Real data is not like that. Not even close.

Ask any working data scientist what takes up most of their time. The answer is never "building models." It's always some variation of: cleaning data, fixing missing values, transforming features, figuring out why column 47 has a different format than column 46, debugging why the pipeline that worked yesterday suddenly produces garbage today. The glamorous part -- building and training the model -- is maybe 20% of the work. The other 80% is data preparation. And nobody writes blog posts about it, because it's not exciting. It's essential. I'd argue it's the single biggest factor that separates ML projects that succeed from ones that fail ;-)

Today we tackle that 80%. By the end of this episode you'll have a complete data preparation toolkit -- and more importantly, you'll understand why each step matters and what happens when you skip it.

Let's go.

The three-way split: train, validation, test

Back in episode #10, we split our apartment price data into two sets: training and test. The model learned from the training set, we evaluated it on the test set. Simple, clean, and for a first introduction it was exactly right.

But two sets aren't enough for real work. Here's why.

Remember in episode #11, we swept through different regularization strengths (alpha values) and picked the one with the best test performance? And in episode #13, we tuned the classification threshold by looking at which threshold gave the best F1 score on test data? Each time we used the test set to make a decision, we implicitly leaked information from the test set into our development process. The test set is supposed to represent data the model has NEVER seen and that influenced NONE of our decisions. If we use it to choose hyperparameters, it stops being "unseen" -- it becomes just another input to our optimization process.

The fix is simple: use three sets in stead of two.

  • Training set (~60-70%): the model learns its weights from this
  • Validation set (~15-20%): used during development to tune hyperparameters, compare models, select thresholds -- all the decisions that happen BEFORE your final evaluation
  • Test set (~15-20%): touched ONCE at the very end, as the final honest check. This set must influence zero decisions.
import numpy as np

np.random.seed(42)

def three_way_split(X, y, train_frac=0.7, val_frac=0.15):
    """Split data into train, validation, and test sets."""
    n = len(y)
    idx = np.random.permutation(n)

    train_end = int(n * train_frac)
    val_end = int(n * (train_frac + val_frac))

    return {
        'X_train': X[idx[:train_end]],      'y_train': y[idx[:train_end]],
        'X_val':   X[idx[train_end:val_end]], 'y_val': y[idx[train_end:val_end]],
        'X_test':  X[idx[val_end:]],         'y_test': y[idx[val_end:]],
    }

# Example: a 1000-sample classification dataset
n = 1000
X = np.random.randn(n, 5)
y = (X[:, 0] + X[:, 1] > 0).astype(float)

splits = three_way_split(X, y)
for name in ['train', 'val', 'test']:
    count = len(splits[f'y_{name}'])
    print(f"  {name:>5s}: {count:>4d} samples ({count/n:.0%})")

The workflow is: train on the training set. Tune hyperparameters using the validation set. When you're happy with everything, do ONE final evaluation on the test set and report that number. If you go back and change something after seeing the test result, you've contaminated the test set and your reported performance is optimistic.

Having said that, in practice many people still do a simple train/test split and use cross-validation (from episode #13) in stead of a dedicated validation set. Cross-validation gives you K different train/validation splits from the same data, which is more data-efficient. Both approaches work. The three-way split is conceptually cleaner; cross-validation is more practical when data is scarce. The point is the same either way: never use the test set to make decisions.

Data leakage: the cardinal sin

Data leakage happens when information from the test set bleeds into the training process. It makes your model look dramatically better than it actually is. You get excited. You deploy it. It fails. And then you spend three days figuring out why your "95% accurate" model performs at 60% in production.

I hammered on this in episode #11 with the StandardScaler -- fit on train only, transform both. And again in episode #13 when we discussed how cross-validation must scale features independently inside each fold. But leakage is so common and so destructive that it deserves its own section here, with the most common traps laid out explicitly.

Trap 1: scaling before splitting

# WRONG: scaling before splitting
# The scaler sees test data statistics (mean, std include test samples)
X_all_scaled = (X - X.mean(axis=0)) / X.std(axis=0)  # uses ALL data
X_train_leaked = X_all_scaled[:800]
X_test_leaked = X_all_scaled[800:]

# RIGHT: scale after splitting, fit on train only
X_train_raw, X_test_raw = X[:800], X[800:]
mean = X_train_raw.mean(axis=0)
std = X_train_raw.std(axis=0)
std[std == 0] = 1
X_train_clean = (X_train_raw - mean) / std
X_test_clean = (X_test_raw - mean) / std  # same mean/std from train!

print("Leakage check:")
print(f"  Leaked train mean: {X_train_leaked.mean(axis=0)[:3].round(4)}")
print(f"  Clean train mean:  {X_train_clean.mean(axis=0)[:3].round(4)}")
print(f"  Clean TEST mean:   {X_test_clean.mean(axis=0)[:3].round(4)}")
print(f"  (Test mean should NOT be zero -- and it isn't!)")

Notice the difference. In the "leaked" version, the training data has near-zero mean because the scaler was fit on ALL data including the test portion. In the "clean" version, the training data has exactly zero mean (because we fit on it), and the test data has a non-zero mean (because we applied the training statistics). That non-zero test mean is correct -- it reflects that the test data might have a slightly different distribution, which is exactly what happens in the real world.

Trap 2: imputing missing values before splitting

Same problem, different disguise. If you fill in missing values using statistics computed from the entire dataset, the imputed values in the training set contain information from the test set. We'll build proper imputation in the next section -- but the rule is the same as scaling: fit on train, apply to both.

Trap 3: time-series data split randomly

This one is sneaky and I see it constantly in beginner projects. If your data has a time component -- stock prices, weather measurements, sensor readings, user activity over months -- you can NOT split it randomly. A random split puts future observations in the training set and past observations in the test set. The model literally trains on the future to predict the past. Of course it performs well -- it's cheating.

For time-series data, always split chronologically: the training set is all data before time T, the test set is everything after. No exceptions.

Trap 4: duplicate rows in train and test

If your dataset has duplicate rows (more common than you'd think -- scraped data, merged tables, repeated measurements), a random split might put the same data point in both train and test. The model "remembers" exact copies from training and gets them right on the test set. Easy to detect: deduplicate before splitting.

The rule that covers all of these: do everything as if the test set doesn't exist until the final evaluation. If any step in your pipeline would change its behavior based on whether the test data was included or not, you have leakage.

Handling missing values

Real data has gaps. Sensors fail. Users skip form fields. Databases have nulls from migrations gone wrong. If you've ever worked with any real-world dataset -- medical records, survey data, financial records, web scrapes -- you know this. Missing values are the norm, not the exception.

NumPy uses np.nan (Not a Number) to represent missing values. Let me show you the three main strategies for dealing with them.

# Simulated apartment dataset with missing values
data = np.array([
    [ 65,  2, np.nan, 185000],
    [ 82,  3,  5,     210000],
    [np.nan, 1, 30,   145000],
    [120,  4,  2,     320000],
    [ 55,  np.nan, 10, 165000],
    [ 90,  3,  8,     235000],
    [ 70,  2, 20,     190000],
    [110,  4,  1,     295000],
    [ 48,  1, 35,     135000],
    [ 95,  3, 12,     250000],
])

feature_names = ["sqm", "rooms", "age", "price"]

# Show which values are missing
print("Missing value report:")
for i, name in enumerate(feature_names):
    n_missing = np.isnan(data[:, i]).sum()
    pct = n_missing / len(data)
    if n_missing > 0:
        print(f"  {name}: {n_missing} missing ({pct:.0%})")
    else:
        print(f"  {name}: complete")

Strategy 1: drop rows with missing values

The simplest approach. Any row with a NaN gets deleted. Quick, no assumptions made.

# Drop rows with any NaN
mask = ~np.isnan(data).any(axis=1)
data_dropped = data[mask]
lost = len(data) - len(data_dropped)
print(f"Before: {len(data)} rows  After: {len(data_dropped)} rows  "
      f"Lost: {lost} rows ({lost/len(data):.0%})")

The downside is obvious: you're throwing away entire rows because of a single missing value. In our small example that's 3 out of 10 rows gone -- 30% of our data. If you have 10,000 samples and 5% have missing values, losing 500 rows might be acceptable. If 40% have missing values, you'd lose almost half your dataset. Not great.

Dropping is reasonable when: (a) very few rows are affected, (b) you have plenty of data to spare, and (c) the missingness is truly random (not correlated with the target -- if all high-risk patients have missing values, dropping them biases your dataset). Otherwise, you need imputation.

Strategy 2: impute (fill in) missing values

Replace each missing value with a reasonable estimate. The most common approaches:

def impute_column(data, col, strategy='mean'):
    """Impute missing values in a single column."""
    result = data.copy()
    mask = np.isnan(result[:, col])
    if not mask.any():
        return result

    known_values = result[~mask, col]
    if strategy == 'mean':
        fill = np.mean(known_values)
    elif strategy == 'median':
        fill = np.median(known_values)
    elif strategy == 'zero':
        fill = 0.0
    else:
        raise ValueError(f"Unknown strategy: {strategy}")

    result[mask, col] = fill
    return result


def impute_all(data, strategy='mean'):
    """Impute all columns using the given strategy."""
    result = data.copy()
    for col in range(data.shape[1]):
        mask = np.isnan(result[:, col])
        if mask.any():
            known = result[~mask, col]
            if strategy == 'mean':
                fill_val = np.nanmean(result[:, col])
            elif strategy == 'median':
                fill_val = np.nanmedian(result[:, col])
            else:
                fill_val = 0.0
            result[mask, col] = fill_val
            print(f"  {feature_names[col]}: filled {mask.sum()} NaN(s) "
                  f"with {strategy}={fill_val:.1f}")
    return result


print("Mean imputation:")
data_mean = impute_all(data, 'mean')
print("\nMedian imputation:")
data_median = impute_all(data, 'median')

Mean is the default choice. It doesn't shift the overall distribution much. But it's sensitive to outliers -- if one apartment is 400 sqm and the rest are around 80, the mean gets dragged up to maybe 110, which is too high for a typical apartment.

Median is more robust. It's the middle value, so outliers don't affect it. For skewed data (incomes, house prices, anything with a long tail), median imputation is almost always better than mean.

For categorical data (which we'll handle in the encoding section), you'd use the mode -- the most frequent value.

The critical rule (you saw this coming): compute the imputation values from the training set only, then apply them to both train and test. If you compute the mean from the full dataset, you've leaked test information into your imputation. Same sin as scaling before splitting.

Strategy 3: add a "missing" indicator

Sometimes the fact that a value is missing is itself informative. Think about it: a missing income field in a loan application might correlate with unemployment. A missing blood pressure reading might mean the patient was too sick to sit still for the measurement. The absence of data carries information that pure imputation destroys.

The strategy: impute the value (so you have numbers for the model to work with) AND add a binary column that says "this value was originally missing."

# Create binary indicators for missing values
X_features = data[:, :3]  # sqm, rooms, age
missing_flags = np.isnan(X_features).astype(float)
X_imputed = impute_all(data, 'mean')[:, :3]

# Combine: imputed features + missing flags
X_augmented = np.column_stack([X_imputed, missing_flags])

print(f"\nOriginal features: {X_features.shape[1]} columns")
print(f"With missing flags: {X_augmented.shape[1]} columns")
print(f"  (3 original + 3 indicator flags)")

# Show one row that had a missing value (row 0: age was NaN)
print(f"\nRow 0 (age was missing):")
print(f"  sqm={X_augmented[0, 0]:.0f}  rooms={X_augmented[0, 1]:.0f}  "
      f"age={X_augmented[0, 2]:.1f} (imputed)")
print(f"  sqm_missing={X_augmented[0, 3]:.0f}  rooms_missing={X_augmented[0, 4]:.0f}  "
      f"age_missing={X_augmented[0, 5]:.0f}")

The model can now learn patterns like "when the age field is missing AND the price is low, it probably means X." You've given it strictly more information than imputation alone. The downside is doubling your feature count (or close to it, if many columns have missing values), but that's usually a worthwhile tradeoff.

Categorical encoding: turning text into numbers

Back in episode #3, we talked about how machines see everything as numbers. Text, images, categories -- it all needs to become a numeric array before any model can touch it. We briefly mentioned one-hot encoding there. Now let's formalize the three main approaches and -- crucially -- understand when each one is appropriate and when it's dangerous.

# Categorical data: Amsterdam neighborhood names
neighborhoods = ["centrum", "zuid", "west", "oost", "noord",
                 "centrum", "west", "zuid", "oost", "centrum"]
prices = np.array([350, 280, 220, 240, 180, 340, 210, 290, 250, 360]) * 1000

Label encoding: each category becomes a number

# Sort categories for deterministic ordering
unique_cats = sorted(set(neighborhoods))
label_map = {cat: i for i, cat in enumerate(unique_cats)}
labels = np.array([label_map[n] for n in neighborhoods])

print("Label encoding:")
for cat in unique_cats:
    print(f"  {cat} -> {label_map[cat]}")

print(f"\nEncoded: {labels}")

Label encoding assigns each category an integer. Simple, memory-efficient, one column. But there's a trap: it implies an ordering that may not exist. The model sees "centrum=0, noord=1, oost=2, west=3, zuid=4" and might interpret that as noord being "between" centrum and oost, or that zuid is "more" than west. For a linear model, this creates nonsensical relationships.

Label encoding is only safe for ordinal categories where the order genuinely matters: shirt sizes (S < M < L < XL), education level (high school < bachelor < master < PhD), satisfaction ratings (1-5). For anything without a natural order -- colors, cities, product categories, neighborhoods -- you need one-hot encoding.

One-hot encoding: each category gets its own column

# One-hot encoding: binary column per category
n_cats = len(unique_cats)
onehot = np.zeros((len(neighborhoods), n_cats))
for i, n_hood in enumerate(neighborhoods):
    onehot[i, label_map[n_hood]] = 1

print(f"One-hot shape: {len(neighborhoods)} samples x {n_cats} categories")
print(f"Categories: {unique_cats}")
print(f"\nFirst 5 samples:")
for i in range(5):
    print(f"  {neighborhoods[i]:>8s} -> {onehot[i].astype(int)}")

Each category becomes a binary column: 1 if the sample belongs to that category, 0 otherwise. No implied ordering. The model learns an independent weight for each category. This is what we introduced in episode #3 and it's the safest default for nominal categories.

The downside: if you have a category with 10,000 unique values (like ZIP codes or product IDs), one-hot encoding creates 10,000 new columns. That's called the curse of dimensionality in practice, and it can make your model slow and prone to overfitting. For high-cardinality categories, you need something smarter.

Target encoding: powerful but dangerous

Target encoding replaces each category with the mean of the target variable for that category. "Centrum apartments average EUR 350,000? Replace 'centrum' with 350000." It's powerful because it creates a single column that captures the relationship between the category and the target.

# Target encoding -- demo on full data first (this is the WRONG way!)
print("Target encoding (computed on ALL data -- LEAKY!):\n")
for cat in unique_cats:
    mask = np.array([n == cat for n in neighborhoods])
    mean_price = prices[mask].mean()
    count = mask.sum()
    print(f"  {cat}: avg price = EUR {mean_price:,.0f} ({count} samples)")

The danger is obvious if you think about it for a second: if you compute the target encoding from the full dataset, the encoding for each training sample includes information from the test set's target values. That's leakage. And it's worse than leakage from scaling, because you're leaking the TARGET -- the thing you're trying to predict. The model effectively gets to see the answers.

The correct approach: compute target encoding statistics on training data only, then apply to both train and test. For categories that appear in the test set but not in the training set, use the global mean as a fallback. And even then, target encoding with small category counts (few samples per category) is noisy and can still overfit. Using smoothing (blending the category mean with the global mean, weighted by sample count) helps, but we'll save that for when we have proper library support to handle it cleanly.

The pipeline pattern

If there's one design pattern from this episode that I want you to remember above all others, it's this: fit on train, transform on both. We saw it with the StandardScaler in episode #11. We saw it with imputation and encoding in this episode. The pattern is universal, and it's worth formalizing into a proper class.

class PrepPipeline:
    """Preprocessing pipeline: impute + scale, fit on train only."""
    def __init__(self):
        self.impute_values = None
        self.mean = None
        self.std = None
        self.fitted = False

    def fit(self, X):
        """Learn all statistics from training data."""
        # Step 1: learn imputation values (column means)
        self.impute_values = np.nanmean(X, axis=0)

        # Step 2: impute, then learn scaling statistics
        X_filled = X.copy()
        for col in range(X.shape[1]):
            mask = np.isnan(X_filled[:, col])
            X_filled[mask, col] = self.impute_values[col]

        self.mean = X_filled.mean(axis=0)
        self.std = X_filled.std(axis=0)
        self.std[self.std == 0] = 1
        self.fitted = True
        return self

    def transform(self, X):
        """Apply learned statistics to any data."""
        if not self.fitted:
            raise RuntimeError("Call fit() before transform()")

        X_filled = X.copy()
        for col in range(X.shape[1]):
            mask = np.isnan(X_filled[:, col])
            X_filled[mask, col] = self.impute_values[col]

        return (X_filled - self.mean) / self.std

    def fit_transform(self, X):
        """Convenience: fit and transform in one call."""
        return self.fit(X).transform(X)


# Demo: same apartment data, properly pipelined
X_raw = data[:, :3]  # sqm, rooms, age
y_raw = data[:, 3]   # price

# Split first (70/30 for this small dataset)
n = len(y_raw)
idx = np.random.permutation(n)
split = int(0.7 * n)
X_train = X_raw[idx[:split]]
X_test = X_raw[idx[split:]]

# Pipeline: fit on train, transform both
pipe = PrepPipeline()
X_train_processed = pipe.fit_transform(X_train)
X_test_processed = pipe.transform(X_test)

print("Pipeline results:")
print(f"  Train shape: {X_train_processed.shape}")
print(f"  Test shape:  {X_test_processed.shape}")
print(f"  Train mean:  {X_train_processed.mean(axis=0).round(4)}")
print(f"  Test mean:   {X_test_processed.mean(axis=0).round(4)}")
print(f"  (Train mean is ~0, test mean is NOT -- this is correct)")

This fit()/transform() pattern is exactly how scikit-learn organizes its preprocessors. The StandardScaler we built in episode #11, the Pipeline class here -- they follow the same API design. When we start using scikit-learn (and we will, soon), you'll see the exact same .fit(), .transform(), .fit_transform() methods on every preprocessor, every model, every transformer. The reason we're building them ourselves first is so you understand what's happening under the hood. There's no magic -- it's just "learn from train, apply to everything" ;-)

Outlier detection and handling

Outliers can silently wreck your model. Remember from episode #10 how linear regression minimizes the sum of squared errors? Squaring means large errors dominate. A single apartment listed at EUR 50 million (data entry mistake? money laundering front? who knows) pulls every weight toward accommodating that one absurd data point. The mean shifts. The standard deviation inflates. Gradient descent chases outliers in stead of fitting the bulk of the data.

There are two standard methods for detecting outliers.

Z-score method

If a value is more than K standard deviations from the mean, flag it. The typical threshold is 3.0 (about 0.3% of normally distributed data falls beyond 3 sigma).

def detect_outliers_zscore(X, threshold=3.0):
    """Flag values more than `threshold` standard deviations from the mean."""
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    std[std == 0] = 1
    z_scores = np.abs((X - mean) / std)
    return z_scores > threshold


# Example: 200 normal samples + 2 injected outliers
np.random.seed(42)
clean_data = np.random.randn(200, 3) * np.array([30, 1, 10]) + np.array([80, 3, 15])
clean_data[0, 0] = 500   # sqm outlier -- 500 sqm apartment??
clean_data[1, 2] = -20   # negative age -- data entry error

outliers_z = detect_outliers_zscore(clean_data)
print("Z-score outlier detection:")
print(f"  Flagged {outliers_z.sum()} values "
      f"in {outliers_z.any(axis=1).sum()} rows")

# Show which specific values were flagged
for row in range(len(clean_data)):
    if outliers_z[row].any():
        cols = np.where(outliers_z[row])[0]
        vals = [f"col {c}: {clean_data[row, c]:.1f}" for c in cols]
        print(f"  Row {row}: {', '.join(vals)}")

IQR method (more robust)

The Z-score method uses mean and standard deviation, which are themselves affected by outliers (catch-22, right?). The IQR (Interquartile Range) method uses percentiles instead, which are robust to extreme values.

def detect_outliers_iqr(X, factor=1.5):
    """Flag values beyond factor * IQR from Q1/Q3."""
    q1 = np.percentile(X, 25, axis=0)
    q3 = np.percentile(X, 75, axis=0)
    iqr = q3 - q1
    lower = q1 - factor * iqr
    upper = q3 + factor * iqr
    return (X < lower) | (X > upper)


outliers_iqr = detect_outliers_iqr(clean_data)
print("\nIQR outlier detection:")
print(f"  Flagged {outliers_iqr.sum()} values "
      f"in {outliers_iqr.any(axis=1).sum()} rows")
print(f"\nComparison:")
print(f"  Z-score: {outliers_z.any(axis=1).sum()} outlier rows")
print(f"  IQR:     {outliers_iqr.any(axis=1).sum()} outlier rows")

The IQR method typically flags more values because its "normal" range is tighter. With factor=1.5, values beyond 1.5 times the interquartile range from Q1 or Q3 are flagged. Bump the factor to 3.0 for a more conservative threshold.

What to do with outliers

Once you've found them, you have three options:

  1. Remove them. Delete the rows entirely. Simplest, but you lose data. Only remove from training -- NEVER remove outliers from the test set, because in production you'll encounter outliers and your model needs to handle them gracefully.

  2. Clip (cap) them. Replace extreme values with the nearest "acceptable" boundary. A 500 sqm apartment becomes 200 sqm (or whatever your cap is). This preserves the row while limiting its influence.

  3. Use robust methods. Some algorithms are naturally outlier-resistant. Median-based imputation in stead of mean. Tree-based models (which split on rank order, not magnitude) in stead of linear models. We'll see these alternatives in upcoming episodes.

# Clipping example
def clip_outliers(X, lower_pct=1, upper_pct=99):
    """Clip values to the given percentile range."""
    result = X.copy()
    lower = np.percentile(X, lower_pct, axis=0)
    upper = np.percentile(X, upper_pct, axis=0)
    result = np.clip(result, lower, upper)
    return result

clipped = clip_outliers(clean_data)
print(f"\nBefore clipping: max sqm = {clean_data[:, 0].max():.0f}")
print(f"After clipping:  max sqm = {clipped[:, 0].max():.0f}")
print(f"Before clipping: min age = {clean_data[:, 2].min():.0f}")
print(f"After clipping:  min age = {clipped[:, 2].min():.1f}")

Feature type detection: knowing what you're working with

Before you can encode or scale anything, you need to know what type of data you're dealing with. In real datasets (CSV files, database exports, API responses), you'll encounter a mix:

  • Numeric continuous: square meters, temperature, price -- scale these
  • Numeric discrete: number of rooms, year built -- scale or leave as-is
  • Categorical nominal: neighborhood, color, country -- one-hot encode
  • Categorical ordinal: quality rating, education level -- label encode with meaningful order
  • Date/time: transaction date, birth year -- extract features (day of week, month, year)
  • Free text: descriptions, comments -- a whole different problem (we'll get to NLP much later in this series)

In pure NumPy we don't have mixed-type columns (NumPy arrays are homogeneous). Real-world data preparation usually starts in pandas, which we'll use extensively once we get to scikit-learn. For now, the important thing is understanding the concept: you need different preprocessing for different column types, and applying the wrong transformation (like scaling a categorical column, or one-hot encoding a continuous one) produces garbage.

# Simulating a mixed-type dataset as separate arrays
# (in practice you'd use pandas DataFrames for this)
sqm = np.array([65, 82, 120, 55, 90, 70, 110, 48, 95, 75], dtype=float)
rooms = np.array([2, 3, 4, 1, 3, 2, 4, 1, 3, 2], dtype=float)
neighborhoods_arr = np.array([0, 1, 2, 3, 4, 0, 2, 1, 3, 0])  # label encoded
condition = np.array([2, 3, 4, 1, 3, 2, 4, 1, 3, 2])  # ordinal: 1=poor, 4=excellent
price = np.array([185, 210, 320, 165, 235, 190, 295, 135, 250, 195]) * 1000

print("Feature types in our apartment dataset:")
print(f"  sqm:           continuous -- scale")
print(f"  rooms:         discrete   -- scale (or leave)")
print(f"  neighborhood:  nominal    -- one-hot encode")
print(f"  condition:     ordinal    -- label encode (1-4)")
print(f"  price:         target     -- predict this")

The 80% in practice: a complete pipeline

Let me bring everything together into one realistic preprocessing workflow. This is the function you'd write at the start of a real ML project -- before you even think about which model to train.

def prepare_data(raw_features, raw_target, train_frac=0.8):
    """
    Complete data preparation pipeline:
    1. Split (before anything else!)
    2. Impute missing values (fit on train)
    3. Detect and remove outliers (train only)
    4. Scale features (fit on train)

    Returns processed X_train, y_train, X_test, y_test.
    """
    # 1. Split FIRST, before any transformation
    n = len(raw_target)
    idx = np.random.permutation(n)
    split = int(n * train_frac)
    X_train = raw_features[idx[:split]].copy()
    X_test = raw_features[idx[split:]].copy()
    y_train = raw_target[idx[:split]].copy()
    y_test = raw_target[idx[split:]].copy()

    print(f"Step 1 - Split: {len(y_train)} train, {len(y_test)} test")

    # 2. Handle missing values (fit on train)
    impute_vals = np.nanmean(X_train, axis=0)
    for col in range(X_train.shape[1]):
        train_nans = np.isnan(X_train[:, col]).sum()
        test_nans = np.isnan(X_test[:, col]).sum()
        if train_nans > 0 or test_nans > 0:
            X_train[np.isnan(X_train[:, col]), col] = impute_vals[col]
            X_test[np.isnan(X_test[:, col]), col] = impute_vals[col]
            print(f"Step 2 - Impute col {col}: {train_nans} train, "
                  f"{test_nans} test NaNs filled with {impute_vals[col]:.1f}")

    # 3. Remove outliers from training set ONLY
    z_scores = np.abs((X_train - X_train.mean(axis=0)) / X_train.std(axis=0))
    outlier_rows = (z_scores > 4).any(axis=1)
    n_outliers = outlier_rows.sum()
    X_train = X_train[~outlier_rows]
    y_train = y_train[~outlier_rows]
    print(f"Step 3 - Outliers: removed {n_outliers} rows from training set")
    # NOTE: we do NOT remove outliers from test set!

    # 4. Scale (fit on train)
    mean = X_train.mean(axis=0)
    std = X_train.std(axis=0)
    std[std == 0] = 1
    X_train = (X_train - mean) / std
    X_test = (X_test - mean) / std
    print(f"Step 4 - Scale: fit on train, applied to both")

    print(f"\nFinal: {len(y_train)} train samples, {len(y_test)} test samples")
    return X_train, y_train, X_test, y_test


# Generate a messy dataset to test our pipeline
np.random.seed(42)
n = 200
X_messy = np.random.randn(n, 4) * np.array([30, 1, 10, 5]) + np.array([80, 3, 15, 50])
y_messy = 2500 * X_messy[:, 0] + 800 * X_messy[:, 1] + np.random.randn(n) * 10000

# Inject some missing values
missing_mask = np.random.random((n, 4)) < 0.05
X_messy[missing_mask] = np.nan

# Inject two outliers
X_messy[10, 0] = 800  # ridiculous sqm
X_messy[50, 2] = -50   # negative age

print("--- Full Data Preparation Pipeline ---\n")
X_tr, y_tr, X_te, y_te = prepare_data(X_messy, y_messy)

Look at the order: split, impute, outliers, scale. Every step that learns statistics (imputation values, mean/std for scaling) learns from the training data only. Outliers are removed only from the training set -- the test set stays as-is because in production you can't cherry-pick which data points your model sees.

This is the workflow. Memorize the order. Get it wrong and you introduce leakage. Get it right and your evaluation numbers will actually reflect real-world performance.

How bad data preparation destroys good models

Let me end with a concrete demonstration. Same data, same model, same evaluation -- but one version does preparation correctly and the other doesn't. Watch the numbers:

# Generate a clean classification dataset
np.random.seed(42)
n = 500
X_demo = np.random.randn(n, 5)
y_demo = ((X_demo[:, 0] + 0.5 * X_demo[:, 1] - X_demo[:, 2]
           + np.random.randn(n) * 0.5) > 0).astype(float)

# Inject missing values and outliers
X_dirty = X_demo.copy()
missing = np.random.random((n, 5)) < 0.08
X_dirty[missing] = np.nan
X_dirty[0, 0] = 50
X_dirty[1, 1] = -40

# --- Method A: WRONG (leaky) ---
# Scale + impute on ALL data, then split
X_wrong = X_dirty.copy()
for col in range(5):
    mask = np.isnan(X_wrong[:, col])
    X_wrong[mask, col] = np.nanmean(X_wrong[:, col])  # impute from ALL data
mean_all = X_wrong.mean(axis=0)
std_all = X_wrong.std(axis=0)
std_all[std_all == 0] = 1
X_wrong = (X_wrong - mean_all) / std_all  # scale from ALL data

idx = np.random.permutation(n)
split = int(0.8 * n)
Xtr_wrong, Xte_wrong = X_wrong[idx[:split]], X_wrong[idx[split:]]
ytr_wrong, yte_wrong = y_demo[idx[:split]], y_demo[idx[split:]]

# Train logistic regression (from episode #12)
X_b = np.column_stack([Xtr_wrong, np.ones(len(Xtr_wrong))])
w = np.zeros(X_b.shape[1])
for _ in range(500):
    z = X_b @ w
    y_hat = 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    w -= 0.1 * (1/len(ytr_wrong)) * X_b.T @ (y_hat - ytr_wrong)

X_b_test = np.column_stack([Xte_wrong, np.ones(len(Xte_wrong))])
proba_wrong = 1 / (1 + np.exp(-np.clip(X_b_test @ w, -500, 500)))
preds_wrong = (proba_wrong >= 0.5).astype(int)
acc_wrong = np.mean(preds_wrong == yte_wrong)

# --- Method B: RIGHT (clean) ---
Xtr_right = X_dirty[idx[:split]].copy()
Xte_right = X_dirty[idx[split:]].copy()
ytr_right = y_demo[idx[:split]]
yte_right = y_demo[idx[split:]]

# Impute from train only
imp_vals = np.nanmean(Xtr_right, axis=0)
for col in range(5):
    Xtr_right[np.isnan(Xtr_right[:, col]), col] = imp_vals[col]
    Xte_right[np.isnan(Xte_right[:, col]), col] = imp_vals[col]

# Remove outliers from train
zs = np.abs((Xtr_right - Xtr_right.mean(axis=0)) / Xtr_right.std(axis=0))
keep = ~(zs > 4).any(axis=1)
Xtr_right = Xtr_right[keep]
ytr_right = ytr_right[keep]

# Scale from train only
m_tr = Xtr_right.mean(axis=0)
s_tr = Xtr_right.std(axis=0)
s_tr[s_tr == 0] = 1
Xtr_right = (Xtr_right - m_tr) / s_tr
Xte_right = (Xte_right - m_tr) / s_tr

# Train same model
X_b2 = np.column_stack([Xtr_right, np.ones(len(Xtr_right))])
w2 = np.zeros(X_b2.shape[1])
for _ in range(500):
    z2 = X_b2 @ w2
    yh2 = 1 / (1 + np.exp(-np.clip(z2, -500, 500)))
    w2 -= 0.1 * (1/len(ytr_right)) * X_b2.T @ (yh2 - ytr_right)

X_b2_test = np.column_stack([Xte_right, np.ones(len(Xte_right))])
proba_right = 1 / (1 + np.exp(-np.clip(X_b2_test @ w2, -500, 500)))
preds_right = (proba_right >= 0.5).astype(int)
acc_right = np.mean(preds_right == yte_right)

print("Data preparation comparison:\n")
print(f"  Method A (LEAKY):  test accuracy = {acc_wrong:.1%}")
print(f"  Method B (CLEAN):  test accuracy = {acc_right:.1%}")
print(f"\n  The leaky version may look better, but it's lying.")
print(f"  Deploy Method A in production and watch it underperform.")

The leaky version might actually show a higher accuracy number. That's the insidious part -- leakage makes your model look BETTER, not worse. It's a liar that flatters you. The clean version gives you an honest number that you can trust. When you deploy the clean model, its performance in production should roughly match what you measured. Deploy the leaky model and you're in for a rude awakening.

Let's recap

We stepped back from algorithms and math today and focused on the unsexy but absolutely critical work that happens before any model gets trained. Every ML practioner I've worked with says the same thing: get your data preparation right and a simple model will outperform a complex model with sloppy data. Here's what we covered:

  • The three-way split (train/validation/test) prevents the test set from influencing development decisions. The validation set is your playground; the test set is your final exam;
  • Data leakage occurs when test information contaminates training. The four biggest traps: scaling before splitting, imputing before splitting, random splitting of time-series data, and duplicate rows across sets;
  • Missing values can be handled by dropping rows (simple but wasteful), imputation (mean/median, fit on train), or adding missing-indicator flags (preserves the information that something was missing);
  • Categorical encoding: label encoding for ordinal data, one-hot for nominal data, target encoding for high-cardinality (but watch for leakage!);
  • The pipeline pattern (fit on train, transform on both) prevents leakage by construction. This is the pattern that every professional ML library follows;
  • Outliers can be detected with Z-scores or IQR, then removed, clipped, or handled with robust methods. Remove from train only -- never from test;
  • The complete preparation order is: split -> impute -> outliers -> encode -> scale. Every step that learns statistics does so from training data only.

We now have models (episodes #10-12), evaluation tools (#13), and data preparation (#14). The next step is learning how to get more out of your raw data by constructing better inputs for your models -- extracting, combining, and transforming features in ways that help the model find patterns it couldn't find in the raw data. That's a skill that seperates good ML practitioners from great ones, and it doesn't require any new algorithms at all.

Bedankt en tot de volgende keer!

@scipio



0
0
0.000
0 comments