Learn AI Series (#11) - Making Linear Regression Real

@scipio 71

4 months ago

StemSocial

Learn AI Series (#11) - Making Linear Regression Real

What will I learn

You will learn multivariate regression -- handling many features at once with the matrix form from episode #10;
polynomial regression -- fitting curves when straight lines simply don't cut it;
the bias-variance tradeoff with concrete, visual examples you can run yourself;
regularization with L2 (Ridge) and L1 (Lasso) -- your weapons against overfitting;
why and how to scale features properly (building a StandardScaler from scratch);
the normal equation -- solving linear regression without gradient descent entirely;
how to chain all of these into a complete pipeline: split, features, scale, regularize, evaluate.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#11) - Making Linear Regression Real

Last episode was a milestone. We built linear regression from absolute scratch -- the cost function, the gradient derivation, gradient descent, feature scaling, train/test splits, weight interpretation, the whole thing packaged into a clean class with .fit() and .predict(). If you ran all the code in episode #10, you have a working model that can predict apartment prices from square meters, rooms, and age. And you understand every single line of it.

So... we're done with linear regression, right?

Not even close ;-)

What I glossed over in episode #10 is this: we used synthetic data where the true relationship was actually linear. Price was a weighted sum of features plus noise, and our linear model captured that perfectly. Real data doesn't work like that. Real relationships curve. Some features matter a lot, others are useless noise. And a model that fits your training data too perfectly might be completely worthless on new data it hasn't seen. That is the gap between "my model works on clean data" and "my model works in the real world."

This episode bridges that gap. We're going to take the linear regression engine from episode #10 and learn the tools that make it practical: polynomial features for curves, regularization for overfitting, a proper StandardScaler class, the normal equation as an alternative to gradient descent, and a complete pipeline that chains everything together. Same engine, better driving skills. Let's go.

When straight lines aren't enough

The model from episode #10 computes y_hat = Xw -- a weighted sum of features. That's a hyperplane. In 2D it's a straight line, in 3D it's a flat plane, and in higher dimensions it's still flat. Always flat. But look at real data and you'll find plenty of relationships that curve.

Think about it: does an employee's salary increase linearly with experience forever? Does each extra year of experience always add the same amount? Of course not. The jump from 0 to 5 years is massive (learning the basics, becoming productive). The jump from 15 to 20 years? Smaller. The relationship curves -- steep at first, flattening out over time.

If we try fitting a straight line to curved data, we'll systematically get it wrong. The line will overestimate at some points and underestimate at others in a pattern that's clearly NOT random. That systematic error is a sign our model is too simple -- it has high bias.

The trick: create new features that are powers of the existing ones. If you have a feature x, you add x^2, x^3, and so on. The model is still linear in the weights (it's still computing w0*x + w1*x^2 + w2*x^3 + bias), but it can now fit curves because the features themselves are nonlinear. Neat, right?

import numpy as np

np.random.seed(42)

# Curved relationship: salary grows with sqrt of experience
# (each year adds LESS than the previous one)
experience = np.linspace(0, 25, 50)
salary = 30000 + 5000 * np.sqrt(experience) + np.random.randn(50) * 2000

# Attempt 1: straight line (episode #10 approach)
X_linear = np.column_stack([experience, np.ones(50)])
w_linear = np.linalg.lstsq(X_linear, salary, rcond=None)[0]
pred_linear = X_linear @ w_linear

# Attempt 2: add experience^2 as a feature
X_poly = np.column_stack([experience, experience**2, np.ones(50)])
w_poly = np.linalg.lstsq(X_poly, salary, rcond=None)[0]
pred_poly = X_poly @ w_poly

linear_mse = np.mean((salary - pred_linear) ** 2)
poly_mse = np.mean((salary - pred_poly) ** 2)

print(f"Linear MSE:     {linear_mse:>12,.0f}")
print(f"Polynomial MSE: {poly_mse:>12,.0f}")
print(f"Improvement:    {(1 - poly_mse/linear_mse)*100:.1f}%")

# Sample predictions at different experience levels
for yr in [0, 5, 10, 15, 20, 25]:
    idx = np.argmin(np.abs(experience - yr))
    print(f"  {yr:>2d} yrs: actual={salary[idx]:>8,.0f}  "
          f"linear={pred_linear[idx]:>8,.0f}  poly={pred_poly[idx]:>8,.0f}")

Run it. Look at the predictions at 0 years and at 25 years specifically. The straight line overshoots at the extremes and undershoots in the middle (or vice versa, depending on the noise). The polynomial model bends to follow the curve. The MSE improvement should be dramatic.

But -- and this is critical -- adding polynomial features gives the model more flexibility. More flexibility means more risk of fitting noise in stead of signal. Add too many polynomial terms and you'll discover that firsthand.

The bias-variance tradeoff: overfitting in action

Let me show you overfitting concretely. Not as a concept, not as a warning -- as actual numbers that you can run and see yourself. We'll fit polynomials of increasing degree to a small dataset and watch the training error go down while the test error goes UP.

# Small dataset to make overfitting visible
np.random.seed(42)
n = 12
X_small = np.random.uniform(0, 10, n)
y_small = 2 * np.sin(X_small) + np.random.randn(n) * 0.5

# Separate test data from the same distribution
X_test_ov = np.random.uniform(0, 10, 20)
y_test_ov = 2 * np.sin(X_test_ov) + np.random.randn(20) * 0.5

# Fit polynomials of degree 1, 3, 7, and 11
for degree in [1, 3, 7, 11]:
    # Create polynomial features: 1, x, x^2, ..., x^degree
    X_poly = np.column_stack([X_small**d for d in range(degree + 1)])

    w = np.linalg.lstsq(X_poly, y_small, rcond=None)[0]
    train_pred = X_poly @ w
    train_mse = np.mean((y_small - train_pred) ** 2)

    # Evaluate on test data
    X_test_poly = np.column_stack([X_test_ov**d for d in range(degree + 1)])
    test_pred = X_test_poly @ w
    test_mse = np.mean((y_test_ov - test_pred) ** 2)

    flag = ""
    if test_mse > 5 * train_mse:
        flag = " <-- OVERFITTING"
    print(f"Degree {degree:>2d}: train MSE={train_mse:>8.3f}  "
          f"test MSE={test_mse:>10.3f}{flag}")

The degree-11 polynomial has 12 parameters for 12 data points. It can pass through every single training point exactly -- training MSE drops to essentially zero. But between the data points it goes wild. Huge oscillations, absurd predictions on new data. That's overfitting: the model memorized the training examples (including the noise) in stead of learning the underlying pattern.

Remember back in episode #10 when I talked about the test/train MAE ratio and planted a seed about overfitting? This is that seed growing into a full-sized problem. With a simple linear model and a linear ground truth, overfitting was rare. The moment you give the model enough capacity to memorize -- too many polynomial terms, too many features, too many parameters -- it WILL memorize if you let it.

This tension has a name: the bias-variance tradeoff.

High bias (underfitting): model too simple, can't capture the real pattern. Bad on training data AND test data. The degree-1 line trying to fit a sine wave.
High variance (overfitting): model too complex, captures noise. Excellent on training data, terrible on test data. The degree-11 polynomial passing through every point but hallucinating between them.
The sweet spot: complex enough to capture the real pattern, simple enough to ignore noise. The degree-3 polynomial in our example.

Every model you'll ever build navigates this tradeoff. Every single one. From linear regression to neural networks with billions of parameters. The tools change, the strategies evolve, but the fundamental tension remains. You'll see it again when we get to decision trees, random forests, and eventually deep learning. It never goes away -- you just get better at managing it ;-)

Regularization: penalizing complexity

So here's the practical question: how do you stop a model from overfitting without manually picking the "right" polynomial degree? One answer: let the model be as complex as it wants, but penalize it for using that complexity aggressively.

This is regularization -- adding a penalty term to the cost function that discourages large weights. Remember from episode #10, our cost function was J(w) = (1/m) * sum((yi - pred_i)^2). Regularization adds a term that depends on the weights themselves:

J_regularized(w) = MSE + lambda * penalty(w)

The lambda parameter (also written as alpha in code to avoid clashing with Python's keyword) controls the strength of the penalty. Large lambda means strong penalty means simpler model. Small lambda means weak penalty means closer to the unregularized version.

There are two main flavors.

L2 Regularization (Ridge)

Add the sum of squared weights to the cost:

J_ridge(w) = MSE + alpha * sum(wi^2)

Ridge doesn't eliminate weights -- it shrinks them toward zero. Every weight gets pulled toward zero proportionally to its current magnitude. The gradient gets an extra term: instead of just -2/m * X.T @ errors, you now have -2/m * X.T @ errors + 2*alpha*w. That extra 2*alpha*w is a constant drag on every weight.

def ridge_regression(X, y, alpha=1.0, lr=0.001, epochs=1000):
    m, n = X.shape
    w = np.zeros(n)

    for _ in range(epochs):
        pred = X @ w
        gradient = -(2/m) * X.T @ (y - pred) + 2 * alpha * w
        w -= lr * gradient

    return w

# Use our degree-11 polynomial that was overfitting badly
degree = 11
X_poly = np.column_stack([X_small**d for d in range(degree + 1)])

# Scale features first -- CRITICAL for regularization to work fairly
# (without scaling, large-magnitude features dominate the penalty)
means = X_poly[:, 1:].mean(axis=0)
stds = X_poly[:, 1:].std(axis=0)
stds[stds == 0] = 1
X_norm = X_poly.copy()
X_norm[:, 1:] = (X_poly[:, 1:] - means) / stds

X_test_poly = np.column_stack([X_test_ov**d for d in range(degree + 1)])
X_test_norm = X_test_poly.copy()
X_test_norm[:, 1:] = (X_test_poly[:, 1:] - means) / stds

print("Degree-11 polynomial with different regularization strengths:\n")
for alpha in [0, 0.001, 0.01, 0.1, 1.0, 10.0]:
    w = ridge_regression(X_norm, y_small, alpha=alpha, lr=0.01, epochs=5000)

    train_pred = X_norm @ w
    train_mse = np.mean((y_small - train_pred) ** 2)

    test_pred = X_test_norm @ w
    test_mse = np.mean((y_test_ov - test_pred) ** 2)

    nonzero = np.sum(np.abs(w) > 0.01)
    print(f"  alpha={alpha:>6.3f}: train={train_mse:>7.3f}  "
          f"test={test_mse:>8.3f}  active weights={nonzero}")

Watch what happens as alpha increases. At alpha=0 (no regularization), the test MSE is catastrophic -- that's our overfit from before. As alpha grows, the test MSE drops because the model is being prevented from going wild. But push alpha too high and the test MSE climbs again -- too much regularization forces the model into underfitting. There's a sweet spot in the middle, and finding it is one of the core practical skills in ML.

L1 Regularization (Lasso)

L1 uses the absolute value of weights instead of the square:

J_lasso(w) = MSE + alpha * sum(|wi|)

The crucial difference: L1 can drive weights to exactly zero. Not just shrink them, but eliminate them entirely. This means Lasso performs automatic feature selection -- it discovers which features are relevant and kills the rest.

def lasso_regression(X, y, alpha=1.0, lr=0.001, epochs=1000):
    m, n = X.shape
    w = np.zeros(n)

    for _ in range(epochs):
        pred = X @ w
        gradient = -(2/m) * X.T @ (y - pred) + alpha * np.sign(w)
        w -= lr * gradient

    return w

w_lasso = lasso_regression(X_norm, y_small, alpha=0.1, lr=0.01, epochs=5000)

print(f"Lasso weights (alpha=0.1):")
for i, wi in enumerate(w_lasso):
    marker = "" if abs(wi) > 0.01 else " (eliminated)"
    print(f"  w[{i:>2d}] = {wi:>+8.4f}{marker}")

# Count how many survived
active = np.sum(np.abs(w_lasso) > 0.01)
print(f"\n{active} out of {len(w_lasso)} weights survived")
print(f"Lasso eliminated {len(w_lasso) - active} irrelevant polynomial terms")

Lasso is particularly powerful when you suspect that many of your features are noise. In real-world datasets you might have 200 features and only 15 actually matter. Lasso can find those 15 automatically. Ridge can't do that -- it shrinks everything but keeps everything. The choice between them (or using both, which is called Elastic Net) depends on your specific problem.

Having said that, the gradient of L1 has a discontinuity at zero (the np.sign(w) flips abruptly). This makes pure gradient descent a bit rough for Lasso -- specialized algorithms like coordinate descent handle it better. For learning purposes, our implementation works fine, but keep that in mind when you see production Lasso implementations doing something more sophisticated.

Feature scaling: building StandardScaler from scratch

We've been scaling features throughout this series -- in episode #10 we standardized square meters, rooms, and age so gradient descent could use a reasonable learning rate. Let me formalize this into a proper reusable class.

StandardScaler transforms each feature to have zero mean and unit variance:

x_scaled = (x - mean(x)) / std(x)

After scaling, every feature lives on roughly the same range (centered around 0, most values between -2 and +2). This is essential for two reasons:

Gradient descent: without scaling, features with large magnitudes dominate the gradient and require tiny learning rates (we saw this in episode #10 with the learning rate difference between scaled and unscaled features)
Regularization: without scaling, the penalty treats all weights equally but they're not -- a weight of 1000 on a feature measured in square meters is NOT the same as a weight of 1000 on a feature measured in number of rooms

class StandardScaler:
    def fit(self, X):
        """Compute mean and std from training data."""
        self.mean = X.mean(axis=0)
        self.std = X.std(axis=0)
        self.std[self.std == 0] = 1  # avoid division by zero
        return self

    def transform(self, X):
        """Apply the stored transformation."""
        return (X - self.mean) / self.std

    def fit_transform(self, X):
        """Convenience: fit and transform in one call."""
        return self.fit(X).transform(X)

    def inverse_transform(self, X_scaled):
        """Reverse the transformation (back to original units)."""
        return X_scaled * self.std + self.mean

# Demo
np.random.seed(42)
experience = np.random.uniform(0, 25, 50)
salary_raw = 30000 + 5000 * np.sqrt(experience) + np.random.randn(50) * 2000

X_raw = np.column_stack([experience, experience**2])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)

print(f"Before scaling:")
print(f"  experience  -- mean={X_raw[:, 0].mean():.1f}  std={X_raw[:, 0].std():.1f}")
print(f"  experience^2 -- mean={X_raw[:, 1].mean():.1f}  std={X_raw[:, 1].std():.1f}")
print(f"\nAfter scaling:")
print(f"  experience  -- mean={X_scaled[:, 0].mean():.4f}  std={X_scaled[:, 0].std():.4f}")
print(f"  experience^2 -- mean={X_scaled[:, 1].mean():.4f}  std={X_scaled[:, 1].std():.4f}")

Critical rule -- and I cannot emphasize this enough: fit the scaler on training data ONLY, then use .transform() (not .fit_transform()) on test data. If you fit on the full dataset including test data, you're leaking information from the future into your model. The test set must be treated as completely unseen. This is a subtle but devastating form of data leakage that produces overoptimistic evaluation results. Your model looks great on paper and fails in production. I've seen this happen in professional codebases. Don't be that person ;-)

# CORRECT approach: fit on train, transform both
X_all = np.column_stack([experience, experience**2])
train_idx = np.arange(40)
test_idx = np.arange(40, 50)

scaler_correct = StandardScaler()
X_train_scaled = scaler_correct.fit_transform(X_all[train_idx])  # fit on train
X_test_scaled = scaler_correct.transform(X_all[test_idx])         # transform test

print(f"Train mean after scaling: {X_train_scaled.mean(axis=0).round(4)}")
print(f"Test mean after scaling:  {X_test_scaled.mean(axis=0).round(4)}")
print(f"(Test mean is NOT zero -- and that's correct!)")

The normal equation: no gradient descent needed

Everything we've built so far -- from episode #7 through today -- uses gradient descent. Iterative steps downhill, learning rates, convergence questions. For linear regression specifically, there's a shortcut: a closed-form solution that gives you the optimal weights in one shot. No iterations, no learning rate to tune, no convergence worries.

It's called the normal equation:

w* = (X_transpose * X)^(-1) * X_transpose * y

In NumPy that's a single line. Let me show you and compare it against our gradient descent approach:

def normal_equation(X, y):
    """Closed-form solution for linear regression."""
    return np.linalg.inv(X.T @ X) @ X.T @ y

# Our apartment data from episode #10
np.random.seed(42)
m = 100
sqm = np.random.uniform(30, 150, m)
rooms = np.random.randint(1, 6, m).astype(float)
age = np.random.uniform(0, 40, m)
noise = np.random.randn(m) * 15000
price = 2500 * sqm + 800 * rooms - 500 * age + 15000 + noise

X_full = np.column_stack([sqm, rooms, age, np.ones(m)])

# Method 1: normal equation (instant)
w_normal = normal_equation(X_full, price)

# Method 2: NumPy's lstsq (also closed-form, numerically stabler)
w_lstsq = np.linalg.lstsq(X_full, price, rcond=None)[0]

print("Normal equation vs lstsq:\n")
for i, name in enumerate(["sqm", "rooms", "age", "bias"]):
    print(f"  {name:>5s}: normal={w_normal[i]:>10.2f}  "
          f"lstsq={w_lstsq[i]:>10.2f}  true={[2500, 800, -500, 15000][i]}")

The results are identical (within floating-point precision). Both find the exact same optimal weights. So why did we spend all of episode #10 building gradient descent if there's a one-line solution?

Two reasons, and they're both important:

1. Scalability. The normal equation requires computing (X.T @ X)^(-1) -- inverting a matrix. Matrix inversion is O(n^3) where n is the number of features. For 10 features that's 1000 operations -- instant. For 10,000 features that's 10^12 operations -- slow. For 1,000,000 features (common in deep learning)? Computationally impossible. Gradient descent scales linearly with the number of features, which is why it's the universal optimization method.

2. Generality. The normal equation only works for linear regression with MSE loss. Gradient descent works for any differentiable loss function, any model architecture. Logistic regression? Gradient descent. Neural networks? Gradient descent. Transformers with 175 billion parameters? Gradient descent. The technique we learned is the one that scales to everything.

# Quick scalability comparison
import time

for n_features in [10, 100, 1000, 5000]:
    X_big = np.random.randn(1000, n_features)
    y_big = np.random.randn(1000)
    X_big_b = np.column_stack([X_big, np.ones(1000)])

    start = time.time()
    try:
        w = np.linalg.inv(X_big_b.T @ X_big_b) @ X_big_b.T @ y_big
        elapsed = time.time() - start
        print(f"  {n_features:>5d} features: normal equation took {elapsed:.4f}s")
    except np.linalg.LinAlgError:
        print(f"  {n_features:>5d} features: matrix inversion FAILED")

In practice, use the normal equation (or np.linalg.lstsq, which is numerically stabler) when n_features is small (under 10,000 or so). Use gradient descent when n is large or when you need flexibility -- different loss functions, regularization, online learning, mini-batch updates. Knowing both gives you the right tool for the job.

R-squared: a more intuitive metric

In episode #10 we measured model quality with MSE and MAE. Those are fine for comparing models, but they're hard to interpret on their own. Is an MAE of EUR 12,000 good or bad? Depends on the price range. An MAE of EUR 12,000 on apartments worth EUR 500,000 is great. On apartments worth EUR 20,000 it's terrible.

R-squared (R^2) solves this by expressing model quality as a proportion. It answers: "how much of the variance in y does my model explain?"

def r_squared(y_true, y_pred):
    """R^2 -- proportion of variance explained. 1.0 = perfect, 0.0 = useless."""
    ss_res = np.sum((y_true - y_pred) ** 2)    # residual sum of squares
    ss_tot = np.sum((y_true - y_true.mean()) ** 2)  # total sum of squares
    return 1 - ss_res / ss_tot

# Our normal equation model
preds_all = X_full @ w_normal

r2 = r_squared(price, preds_all)
mse = np.mean((price - preds_all) ** 2)
mae = np.mean(np.abs(price - preds_all))

print(f"Model performance:")
print(f"  R^2:  {r2:.4f}")
print(f"  MSE:  {mse:>14,.0f}")
print(f"  MAE:  EUR {mae:>10,.0f}")
print(f"\nInterpretation: the model explains {r2*100:.1f}% of price variance")

R^2 of 0.85 means the model explains 85% of the variance in prices. The remaining 15% is unexplained -- noise, missing features, nonlinear effects. R^2 of 1.0 is a perfect fit (suspicious -- probably overfitting). R^2 of 0.0 means the model is no better than just predicting the mean price for everything. R^2 can even go negative, which means the model is actively WORSE than the mean -- a sign something went very wrong.

This is a metric you'll use constantly. When someone asks "how good is your model?" in a regression context, R^2 is usually the first number they want to hear.

Putting it all together: the complete pipeline

Let me chain everything from this episode into one coherent pipeline. This is the workflow you'll use in real projects -- the order matters:

# COMPLETE PIPELINE: split -> polynomial -> scale -> regularize -> evaluate

# 1. Generate realistic data with a known curved + noisy relationship
np.random.seed(42)
n = 100
X_raw = np.random.uniform(0, 20, n)
y = 3 * X_raw + 0.5 * X_raw**2 - 0.02 * X_raw**3 + np.random.randn(n) * 8

# 2. Train/test split FIRST (before any transformations!)
split = int(0.8 * n)
idx = np.random.permutation(n)
X_train_raw, X_test_raw = X_raw[idx[:split]], X_raw[idx[split:]]
y_train, y_test = y[idx[:split]], y[idx[split:]]

# 3. Create polynomial features
degree = 5
X_train_poly = np.column_stack([X_train_raw**d for d in range(1, degree + 1)])
X_test_poly = np.column_stack([X_test_raw**d for d in range(1, degree + 1)])

# 4. Scale (fit on train ONLY, transform both)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)

# 5. Add bias column
X_train_b = np.column_stack([X_train_scaled, np.ones(len(X_train_scaled))])
X_test_b = np.column_stack([X_test_scaled, np.ones(len(X_test_scaled))])

# 6. Train with Ridge regularization
w = ridge_regression(X_train_b, y_train, alpha=0.1, lr=0.01, epochs=3000)

# 7. Evaluate on BOTH sets
train_pred = X_train_b @ w
test_pred = X_test_b @ w

train_rmse = np.sqrt(np.mean((y_train - train_pred)**2))
test_rmse = np.sqrt(np.mean((y_test - test_pred)**2))
train_r2 = r_squared(y_train, train_pred)
test_r2 = r_squared(y_test, test_pred)

print("Pipeline results:")
print(f"  Train RMSE: {train_rmse:.2f}   R^2: {train_r2:.4f}")
print(f"  Test RMSE:  {test_rmse:.2f}   R^2: {test_r2:.4f}")
print(f"  RMSE ratio: {test_rmse/train_rmse:.3f} (close to 1.0 = good)")

Notice the order. Split first, then transform. If you create polynomial features or scale before splitting, information from the test set contaminates the training process. This is the same data leakage warning from the StandardScaler section -- it applies to EVERY transformation. Always split first, then apply transformations separately.

Comparing different regularization strengths

One more practical exercise. Let's sweep through different alpha values and see which one gives the best test performance:

print(f"\nAlpha sweep (degree-{degree} polynomial, Ridge):\n")
print(f"{'alpha':>8s}  {'Train RMSE':>10s}  {'Test RMSE':>10s}  {'Test R^2':>8s}")
print("-" * 42)

best_alpha = 0
best_test_rmse = float('inf')

for alpha in [0, 0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0]:
    w_sweep = ridge_regression(X_train_b, y_train, alpha=alpha, lr=0.01, epochs=3000)

    tr_pred = X_train_b @ w_sweep
    te_pred = X_test_b @ w_sweep
    tr_rmse = np.sqrt(np.mean((y_train - tr_pred)**2))
    te_rmse = np.sqrt(np.mean((y_test - te_pred)**2))
    te_r2 = r_squared(y_test, te_pred)

    marker = ""
    if te_rmse < best_test_rmse:
        best_test_rmse = te_rmse
        best_alpha = alpha
        marker = " <-- best so far"

    print(f"{alpha:>8.3f}  {tr_rmse:>10.2f}  {te_rmse:>10.2f}  {te_r2:>8.4f}{marker}")

print(f"\nBest alpha: {best_alpha}")

You'll see the classic U-shaped curve. Too little regularization (alpha near 0) gives overfitting -- low train RMSE but high test RMSE. Too much regularization (alpha = 10) gives underfitting -- high RMSE on both. The best alpha is somewhere in the middle. Finding that sweet spot is called hyperparameter tuning, and there are systematic ways to do it (grid search, cross-validation) that we'll encounter in upcoming episodes.

Multivariate regression: putting features together

Before we wrap up, let me make one more thing explicit. Everything in this episode -- polynomial features, regularization, scaling -- works just as well with multiple input features. You're not limited to polynomials of one variable. You can combine square meters, rooms, age, AND their squares, AND interactions between features (like sqm * rooms), all in one model.

# Multiple features + polynomial terms + interactions
np.random.seed(42)
n = 150
sqm = np.random.uniform(30, 150, n)
rooms = np.random.randint(1, 6, n).astype(float)
age = np.random.uniform(0, 40, n)
noise = np.random.randn(n) * 10000

# True relationship has a nonlinear component
price = (2500 * sqm + 800 * rooms - 500 * age
         - 5 * sqm * age  # interaction: old large apartments depreciate faster
         + 15000 + noise)

# Build feature matrix with interactions and polynomials
X_multi = np.column_stack([
    sqm, rooms, age,          # original features
    sqm**2, age**2,            # polynomial
    sqm * rooms,               # interaction
    sqm * age,                 # interaction (the one that matters!)
])

# Split
idx = np.random.permutation(n)
split = int(0.8 * n)
X_tr, X_te = X_multi[idx[:split]], X_multi[idx[split:]]
y_tr, y_te = price[idx[:split]], price[idx[split:]]

# Scale -> add bias -> train
scaler_multi = StandardScaler()
X_tr_s = scaler_multi.fit_transform(X_tr)
X_te_s = scaler_multi.transform(X_te)
X_tr_b = np.column_stack([X_tr_s, np.ones(len(X_tr_s))])
X_te_b = np.column_stack([X_te_s, np.ones(len(X_te_s))])

w_multi = ridge_regression(X_tr_b, y_tr, alpha=0.1, lr=0.01, epochs=3000)

tr_r2 = r_squared(y_tr, X_tr_b @ w_multi)
te_r2 = r_squared(y_te, X_te_b @ w_multi)
te_mae = np.mean(np.abs(y_te - X_te_b @ w_multi))

feature_names = ["sqm", "rooms", "age", "sqm^2", "age^2",
                 "sqm*rooms", "sqm*age", "bias"]
print("Multivariate regression with interactions:\n")
print(f"  Train R^2: {tr_r2:.4f}")
print(f"  Test R^2:  {te_r2:.4f}")
print(f"  Test MAE:  EUR {te_mae:,.0f}")
print(f"\nLearned weights (scaled space):")
for name, wi in zip(feature_names, w_multi):
    print(f"  {name:>10s}: {wi:>+10.1f}")

Look at the weight for sqm*age -- if the model is working correctly, it should be one of the larger weights because that interaction term is part of the true relationship. The polynomial terms (sqm^2, age^2) should have smaller weights since the true relationship doesn't actually include those. Ridge regularization keeps them small without eliminating them entirely. If we used Lasso in stead, it might drive those irrelevant polynomial terms to zero completely.

This is the power of combining everything: polynomial features give flexibility, regularization controls complexity, scaling ensures fair treatment, and the train/test split keeps us honest about generalizaton.

Let's recap

We took the linear regression engine from episode #10 and made it production-ready. Here's what we added:

Polynomial regression creates features like x^2, x^3 to fit curves -- the model stays linear in the weights but can capture nonlinear patterns in the data;
The bias-variance tradeoff is the central tension in ALL of ML: too simple = underfitting (high bias), too complex = overfitting (high variance). The sweet spot is in the middle, and managing it is a core skill;
L2 regularization (Ridge) adds alpha * sum(wi^2) to the cost, shrinking weights toward zero and reducing overfitting without eliminating features;
L1 regularization (Lasso) adds alpha * sum(|wi|) to the cost, driving some weights to exactly zero -- automatic feature selection;
StandardScaler transforms features to zero mean and unit variance. Fit on training data ONLY. Always. No exceptions;
The normal equation w = (X.T @ X)^(-1) @ X.T @ y solves linear regression in one step, but scales poorly beyond ~10,000 features -- gradient descent wins for large problems;
R-squared expresses model quality as a proportion of variance explained -- more intuitive than raw MSE;
The complete pipeline is: split -> features -> scale -> regularize -> evaluate on held-out test data. Order matters. Split first, transform after.

Next episode we leave the world of predicting continuous numbers entirely. What if the thing you're trying to predict isn't a number at all -- but a category? Is this email spam or not? Is this tumor malignant or benign? Will this customer churn or stay? That's classification, and the transition from regression to classification is one of the most important conceptual leaps in ML. We'll build the model from scratch, same as always ;-)

Learn AI Series (#11) - Making Linear Regression Real

Learn AI Series (#11) - Making Linear Regression Real

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#11) - Making Linear Regression Real

When straight lines aren't enough

The bias-variance tradeoff: overfitting in action

Regularization: penalizing complexity

L2 Regularization (Ridge)

L1 Regularization (Lasso)

Feature scaling: building StandardScaler from scratch

The normal equation: no gradient descent needed

R-squared: a more intuitive metric

Putting it all together: the complete pipeline

Comparing different regularization strengths

Multivariate regression: putting features together

Let's recap

De groeten! Vragen of opmerkingen? Drop ze in de comments!

Curriculum (of the `Learn AI Series`):