Learn AI Series (#7) - The Training Loop - See It Work Step by Step

@scipio 71

4 months ago

StemSocial

Learn AI Series (#7) - The Training Loop - See It Work Step by Step

What will I learn

You will learn how to assemble a complete training loop from the pieces we built in episodes #4 through #6;
how to watch a model actually learn -- with loss decreasing step by step;
what convergence looks like and how to recognize it in practice;
what happens when the learning rate is wrong (too big, too small, just right);
when to stop training -- patience and early stopping;
how to test whether your model learned something real or just memorized noise;
that gradient descent is the foundation that every ML algorithm builds on.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#7) - The Training Loop - See It Work Step by Step

Six episodes. Six episodes of building up concepts piece by piece -- what ML is (#1), how to set up your workbench (#2), how data becomes numbers (#3), making predictions from intuition (#4), spotting patterns (#5), and formalizing everything with loss functions and gradient descent (#6). And now we put it ALL together into one running program.

This is the episode I've been looking forward to writing, because this is where it clicks. Not "here's a concept, think about it." No -- here's a loop, run it, and watch the numbers change. Watch the loss go down. Watch the parameters crawl toward the right values. Watch the model learn. Right there in your terminal. It's one of those moments where you go from "I understand the theory" to "holy macaroni, it actually works" ;-)

Let's go.

The setup: data with a known answer

We're going to create synthetic data where WE know the true relationship. That way, when our model finishes training, we can check whether it discovered the correct answer. This is a standard trick in ML development -- you test your code on problems where you already know the solution before unleashing it on real data where you don't.

Remember our apartment pricing from episodes #4 and #5? Let's use that same domain but this time with a known formula:

import numpy as np

np.random.seed(42)

# True relationship: price = 2500 * sqm + 15000 + noise
# We KNOW slope=2500 and intercept=15000
# Our model will have to discover these values on its own

n_samples = 20
sqm = np.random.uniform(30, 150, n_samples)
noise = np.random.randn(n_samples) * 12000
price = 2500 * sqm + 15000 + noise

print(f"Generated {n_samples} apartments")
print(f"Sqm range: {sqm.min():.0f} - {sqm.max():.0f}")
print(f"Price range: EUR {price.min():,.0f} - EUR {price.max():,.0f}")
print(f"\nFirst 5 apartments:")
for i in range(5):
    print(f"  {sqm[i]:.0f} sqm -> EUR {price[i]:,.0f}")

The true relationship is price = 2500 * sqm + 15000. Our model doesn't know this. It starts with random (terrible) parameter values and has to discover the right slope and intercept by minimizing the loss. The noise (np.random.randn * 12000) simulates all the real-world factors we aren't modeling -- location, condition, floor level, that sort of thing. We talked about noise back in episode #5 and why it exists. The model has to learn the signal through the noise.

The model: two knobs to turn

Our model is the simplest possible prediction machine: a straight line. prediction = slope * sqm + intercept. Two parameters to learn -- the slope (how much each extra square meter adds to the price) and the intercept (the base price when square meters is zero, which is a mathematical convenience more than a realistic quantity).

Remember from episode #6 -- parameters are the adjustable knobs. The model is its parameters. Different values = different predictions = different loss. Training means finding the values that produce the lowest loss.

# Initialize parameters -- start from zero (bad, but that's the point)
slope = 0.0
intercept = 0.0

print(f"Starting parameters:")
print(f"  slope = {slope:.2f}  (true value: 2500)")
print(f"  intercept = {intercept:.2f}  (true value: 15000)")

# What does the model predict right now?
predictions = slope * sqm + intercept
loss = ((price - predictions) ** 2).mean()
print(f"\n  Starting loss: {loss:,.0f}")
print(f"  (This is going to be terrible -- the parameters are zero!)")

With both parameters at zero, every prediction is EUR 0. Every single apartment priced at zero euros. The loss is astronomical. But that's fine -- the whole point of training is to start bad and get better.

The training loop: all four pieces working together

Here it is. The complete training loop. Every line connects to something we've discussed in previous episodes. I want you to read through this carefully, then run it, and watch what happens to the loss column:

# Hyperparameters (settings WE choose before training -- episode #6)
learning_rate = 0.0000001  # Step size for gradient descent
n_epochs = 200             # How many times we loop through the data

# Initialize parameters
slope = 0.0
intercept = 0.0

# Store history for analysis later
loss_history = []
slope_history = []
intercept_history = []

print("Training begins...\n")
print(f"{'Epoch':>6s}  {'Loss':>14s}  {'Slope':>10s}  {'Intercept':>10s}")
print("-" * 48)

for epoch in range(n_epochs):
    # STEP 1: Forward pass -- make predictions with current parameters
    predictions = slope * sqm + intercept

    # STEP 2: Compute loss -- how wrong are we? (MSE from episode #6)
    errors = price - predictions
    loss = (errors ** 2).mean()
    loss_history.append(loss)
    slope_history.append(slope)
    intercept_history.append(intercept)

    # STEP 3: Compute gradients -- which direction improves things?
    # These formulas come from calculus (we'll derive them properly soon)
    # For now: they tell us which way to nudge each parameter
    d_slope = -2 * (sqm * errors).mean()
    d_intercept = -2 * errors.mean()

    # STEP 4: Update parameters -- take a step downhill
    slope = slope - learning_rate * d_slope
    intercept = intercept - learning_rate * d_intercept

    # Print progress every 20 epochs
    if epoch % 20 == 0 or epoch == n_epochs - 1:
        print(f"{epoch:>6d}  {loss:>14,.0f}  {slope:>10.1f}  {intercept:>10.1f}")

print(f"\nFinal parameters:")
print(f"  slope = {slope:.1f}  (true: 2500)")
print(f"  intercept = {intercept:.1f}  (true: 15000)")

Run it. Watch the loss column. It starts enormous (random parameters produce terrible predictions) and drops rapidly in the first few epochs, then slows down as it approaches the optimum. That shape -- steep drop followed by a long gradual tail -- is universal. Every model you will ever train in your life shows this pattern. Neural networks with billions of parameters, little toy linear models with two -- the shape is the same. The numbers are different, the curve is the same.

And watch the slope crawling toward 2500 and the intercept moving toward 15000. The model is discovering the true relationship. Nobody told it "the answer is 2500." It found it by repeatedly asking "which direction reduces the error?" and taking small steps.

Breaking down one epoch

Let me be really explicit about what happens in a single iteration of that loop. Four steps, and they map directly to the concepts from the last three episodes:

Step 1 -- Forward pass. We feed our features (square meters) through the model formula and get predictions. With bad parameters, these predictions are way off. With good parameters, they're close. This is just multiplication and addition -- nothing mysterious.

Step 2 -- Compute loss. We compare predictions to actual prices using MSE (Mean Squared Error -- episode #6). This gives us a single number that says "how wrong is the model, overall?" Lower is better. That's it.

Step 3 -- Compute gradients. This is the "feel which way is downhill" part from the blindfolded mountain analogy in episode #6. The gradient for the slope tells us: "if you increase the slope slightly, the loss will change by this much in this direction." If the gradient is positive, increasing the slope makes things worse -- so we decrease it. If negative, the opposite.

Step 4 -- Update parameters. We nudge each parameter in the direction that reduces the loss, scaled by the learning rate. Small steps. Every epoch. Relentlessly downhill.

Let's trace one epoch in detail so you can see the actual numbers:

# Trace what happens at the CURRENT state (after 200 epochs of training)
predictions = slope * sqm + intercept
errors = price - predictions

print("Detailed epoch trace (first 5 data points):\n")
print(f"{'Sqm':>6s}  {'Actual':>10s}  {'Predicted':>10s}  {'Error':>10s}  {'Sqr Error':>14s}")
for i in range(5):
    se = errors[i] ** 2
    print(f"{sqm[i]:>6.0f}  {price[i]:>10,.0f}  {predictions[i]:>10,.0f}  "
          f"{errors[i]:>+10,.0f}  {se:>14,.0f}")

print(f"\nMSE loss: {(errors**2).mean():,.0f}")
print(f"Gradient for slope: {-2 * (sqm * errors).mean():,.0f}")
print(f"Gradient for intercept: {-2 * errors.mean():,.0f}")

Look at those error values. After 200 epochs of training, the predictions are in the right ballpark -- the errors are in the thousands or tens of thousands, not in the hundreds of thousands like they would have been at epoch 0. The model has learned something. Not perfectly (and I'll explain why in a moment), but significantly better than its starting point of predicting zero for everything.

The learning rate: one number to rule them all

In episode #6 we talked about how the learning rate controls step size -- too big and you overshoot, too small and you crawl. Let's actually see that now with our complete training loop. Same data, same model, three different learning rates:

def train_and_report(lr, n_epochs=200, label=""):
    """Train a model and report the result."""
    s, i = 0.0, 0.0
    losses = []

    for epoch in range(n_epochs):
        preds = s * sqm + i
        errs = price - preds
        loss = (errs ** 2).mean()
        losses.append(loss)

        # Check for divergence (loss exploding to infinity)
        if loss > 1e20 or np.isnan(loss):
            print(f"  {label}: DIVERGED at epoch {epoch}! "
                  f"Loss exploded to {loss:.0e}")
            return losses

        s -= lr * (-2 * (sqm * errs).mean())
        i -= lr * (-2 * errs.mean())

    print(f"  {label}:")
    print(f"    final loss = {losses[-1]:>14,.0f}")
    print(f"    slope = {s:>8.1f}  intercept = {i:>8.1f}")
    improvement = (losses[0] - losses[-1]) / losses[0] * 100
    print(f"    improvement = {improvement:.1f}%")
    return losses

print("Learning rate experiments:\n")
losses_small  = train_and_report(1e-10, label="lr=1e-10 (too small)")
print()
losses_good   = train_and_report(1e-7,  label="lr=1e-7  (good)")
print()
losses_large  = train_and_report(1e-4,  label="lr=1e-4  (too large)")

Run it and compare. With lr=1e-10 (too small), the model barely moves after 200 epochs -- the slope might be at 2.5 in stead of 2500. It would eventually get there, but you'd need millions of epochs. In practice, nobody waits that long.

With lr=1e-4 (too large), the parameters bounce wildly, the loss explodes to infinity, and the model diverges. Game over. Total failure. A number that was supposed to help the model learn instead caused it to self-destruct.

With lr=1e-7 (just right for this problem), the model converges nicely. Loss drops steadily, parameters approach the true values, everything works as intended.

This is why learning rate is the first thing practitioners tune. Before you touch anything else -- model architecture, features, data augmentation, regularization -- you get the learning rate right. Getting it right makes the difference between a model that learns and one that doesn't. I've been doing this for a while now, and I still occasionally have that moment where a model isn't learning and I spend 20 minutes debugging before realizing the learning rate was off by a factor of 10. It happens ;-)

Convergence: knowing when to stop

How do you know when the model has learned enough? How do you know when continuing to train is just wasting compute? When the loss stops decreasing meaningfully.

# Train for much longer and track the improvement at different stages
s, i = 0.0, 0.0
lr = 1e-7
losses = []

for epoch in range(1000):
    preds = s * sqm + i
    errs = price - preds
    loss = (errs ** 2).mean()
    losses.append(loss)
    s -= lr * (-2 * (sqm * errs).mean())
    i -= lr * (-2 * errs.mean())

losses = np.array(losses)

# Check improvement at different stages
print("Loss improvement by training stage:\n")
stages = [(0, 50), (50, 100), (100, 200), (200, 500), (500, 1000)]
for start, end in stages:
    improvement = losses[start] - losses[end - 1]
    pct = improvement / losses[start] * 100
    print(f"  Epochs {start:>4d}-{end:>4d}: "
          f"loss {losses[start]:>14,.0f} -> {losses[end-1]:>14,.0f}  "
          f"({pct:>6.2f}% reduction)")

See the pattern? Early epochs produce massive improvements -- the loss might drop 90% in the first 50 epochs. Middle epochs produce moderate improvements. Late epochs produce tiny, diminishing improvements. Going from epoch 500 to 1000 might improve the loss by a fraction of a percent.

This is convergence. The model has reached a region of the loss landscape where the gradient is almost flat -- it's near the bottom of the valley, and there's very little downhill left to walk. Continuing to train isn't harmful, but it's wasteful. Those extra epochs cost computation and produce negligible benefit.

In practice, most practitioners define a patience parameter: "stop if the loss hasn't improved by at least X for N epochs in a row." This is called early stopping and it's one of the most practically useful techniques in all of ML. Let's build it:

def train_with_patience(lr, patience=20, min_improvement=100):
    """Train until the model stops improving."""
    s, i = 0.0, 0.0
    best_loss = float('inf')
    epochs_without_improvement = 0

    for epoch in range(5000):  # Upper limit -- we'll stop early
        preds = s * sqm + i
        errs = price - preds
        loss = (errs ** 2).mean()

        # Did we improve enough?
        if best_loss - loss > min_improvement:
            best_loss = loss
            epochs_without_improvement = 0
        else:
            epochs_without_improvement += 1

        # Patience exhausted?
        if epochs_without_improvement >= patience:
            print(f"Converged at epoch {epoch}!")
            print(f"  Final loss: {loss:,.0f}")
            print(f"  slope = {s:.1f}, intercept = {i:.1f}")
            print(f"  (Stopped because {patience} consecutive epochs "
                  f"without >{min_improvement} improvement)")
            return epoch, s, i

        s -= lr * (-2 * (sqm * errs).mean())
        i -= lr * (-2 * errs.mean())

    print("Hit maximum epochs without converging")
    return 5000, s, i

print("Training with patience-based early stopping:\n")
epochs_needed, final_slope, final_intercept = train_with_patience(lr=1e-7)
print(f"\nTrue values: slope=2500, intercept=15000")

The model stops itself when it realizes further training isn't producing meaningful gains. This is a fundamental pattern you'll see over and over -- every serious training pipeline has some form of early stopping. When we get to neural networks, early stopping serves double duty: it also prevents overfitting (the model memorizing noise in stead of learning signal). We touched on overfitting vs underfitting back in episode #1 and saw it with the K parameter in KNN in episode #4. Same concept, different context. Everything connects.

Testing on new data: the moment of truth

We trained our model on 20 apartments. Cool. The loss went down. The parameters look reasonable. But here's the real question -- the question that separates a useful model from a useless one: does it work on apartments it has never seen before?

This is the generalization test. Remember from episode #4 when we did leave-one-out evaluation? Same principle. A model that only works on its training data has memorized, not learned. It's like a student who can repeat the textbook word for word but can't answer a question that wasn't in the exam prep materials. Impressive recall, zero understanding.

# Generate 5 NEW apartments -- same underlying relationship, fresh noise
np.random.seed(99)  # Different seed = different random numbers
new_sqm = np.array([50, 75, 95, 110, 140], dtype=np.float64)
new_price = 2500 * new_sqm + 15000 + np.random.randn(5) * 12000

# Our trained model's predictions
new_predictions = final_slope * new_sqm + final_intercept

print("Predictions on NEW (unseen) data:\n")
print(f"{'Sqm':>5s}  {'Actual':>10s}  {'Predicted':>10s}  {'Error':>10s}")
print("-" * 42)
for sq, act, pred in zip(new_sqm, new_price, new_predictions):
    err = act - pred
    print(f"{sq:>5.0f}  EUR {act:>9,.0f}  EUR {pred:>9,.0f}  {err:>+10,.0f}")

# Compare training error vs test error
train_preds = final_slope * sqm + final_intercept
train_mae = np.abs(price - train_preds).mean()
test_mae = np.abs(new_price - new_predictions).mean()

print(f"\nTraining MAE: EUR {train_mae:,.0f}")
print(f"Test MAE:     EUR {test_mae:,.0f}")

If the model truly learned the underlying pattern (price increases by roughly EUR 2500 per extra square meter), the test MAE should be in a similar range as the training MAE. Not identical -- the test set has its own noise -- but in the same ballpark. If the test error is massively worse than the training error, that's a red flag: the model might be overfitting.

With our simple linear model on this clean data, you'll likely see test performance that's comparable to training performance. That's the reward for using a model that matches the actual data-generating process. A straight line fits data that was generated by a straight line. When the model architecture matches the true relationship, generalization tends to be good. (Later in the series, when we have more complex data and more flexible models, this gets trickier -- much trickier. But we'll deal with that when we get there.)

Tracking the learning journey

One more thing I want to show you. We stored the loss, slope, and intercept at every epoch. Let's use that history to see the full learning trajectory:

loss_history = np.array(loss_history)
slope_history = np.array(slope_history)
intercept_history = np.array(intercept_history)

# Show how the model evolved over training
print("The model's learning journey:\n")
print(f"{'Epoch':>6s}  {'Loss':>14s}  {'Slope':>8s}  {'Intercept':>10s}  {'Slope Err':>10s}")
print("-" * 56)

milestones = [0, 1, 5, 10, 25, 50, 100, 150, 199]
for ep in milestones:
    slope_err = abs(slope_history[ep] - 2500)
    print(f"{ep:>6d}  {loss_history[ep]:>14,.0f}  {slope_history[ep]:>8.1f}  "
          f"{intercept_history[ep]:>10.1f}  {slope_err:>10.1f}")

# How fast did the model learn most of what it knows?
loss_at_start = loss_history[0]
loss_at_50 = loss_history[49]
loss_at_end = loss_history[-1]
total_reduction = loss_at_start - loss_at_end
first_50_reduction = loss_at_start - loss_at_50

print(f"\nTotal loss reduction: {total_reduction:,.0f}")
print(f"First 50 epochs:     {first_50_reduction:,.0f} "
      f"({first_50_reduction/total_reduction*100:.1f}% of total)")
print(f"Remaining 150:       {total_reduction - first_50_reduction:,.0f} "
      f"({(total_reduction - first_50_reduction)/total_reduction*100:.1f}% of total)")

Most of the learning happens early. The first 50 epochs might account for 90%+ of the total loss reduction. The remaining 150 epochs polish the parameters but don't change the predictions dramatically. This has practical implications: if you're experimenting with different model architectures or hyperparameters, you can often get a good signal from short training runs. You don't need to train to full convergence every time just to see whether an approach is promising. Run it for 50 epochs, check the curve, and decide whether to invest more computation.

Having said that, there's a subtlety here. In our simple example the loss curve is smooth and predictable. With more complex models (neural networks, for example), the loss curve can be noisy -- bouncing up and down between epochs, sometimes getting stuck on a plateau before dropping again. Early stopping and patience parameters become crucial in those settings because you need to distinguish "the model is done learning" from "the model is temporarily stuck and will improve again if you keep going."

The complete picture: everything in 25 lines

Let's take everything we've built across six episodes and compress it into the most minimal version possible. This is the skeleton of ALL machine learning:

import numpy as np

# 1. DATA: generate (or load) features and targets
np.random.seed(42)
X = np.random.uniform(30, 150, 20)
y = 2500 * X + 15000 + np.random.randn(20) * 12000

# 2. MODEL: define learnable parameters
slope, intercept = 0.0, 0.0

# 3. TRAIN: gradient descent loop
lr = 1e-7
for epoch in range(500):
    pred = slope * X + intercept           # forward pass
    loss = ((y - pred) ** 2).mean()        # compute loss (MSE)
    slope -= lr * (-2 * (X * (y - pred)).mean())      # update slope
    intercept -= lr * (-2 * (y - pred).mean())         # update intercept

# 4. EVALUATE: test on new data
X_test = np.array([50, 100, 130])
y_test = 2500 * X_test + 15000 + np.random.randn(3) * 12000
pred_test = slope * X_test + intercept

print(f"Learned: price = {slope:.0f} * sqm + {intercept:.0f}")
print(f"True:    price = 2500 * sqm + 15000")
print(f"Test MAE: EUR {np.abs(y_test - pred_test).mean():,.0f}")

That's it. That's machine learning. Twenty-something lines of Python. Data, model, train, evaluate. Everything else we'll learn in this series -- neural networks, convolutional networks, transformers, reinforcement learning -- is a variation on this theme. The model gets more complex (in stead of slope * X + intercept, you might have millions of matrix multiplications and nonlinear activations). The loss function changes (classification uses different loss than regression). The optimizer gets fancier (instead of plain gradient descent, you might use Adam or SGD with momentum). The evaluation gets more rigorous. But the skeleton? Predict, compute loss, compute gradients, update parameters, repeat. Identical.

I want you to let that sink in for a moment, because it's genuinely one of the most powerfull ideas in computer science. The same four-step loop that trained our two-parameter apartment pricer is the same four-step loop that trained language models with hundreds of billions of parameters. The scale is different. The math inside each step is different. But the loop is the loop is the loop.

Vocabulary check

We've been throwing around quite some terms. Let me be precise about what each one means, because these words will come up in every single episode from here on:

Epoch: one complete pass through all the training data. If you have 20 data points and you do one forward pass + gradient update using all 20, that's one epoch.
Forward pass: running your input through the model to get predictions. In our case: prediction = slope * sqm + intercept.
Loss: the single number that measures how wrong the model is overall. We used MSE.
Gradient: the direction and magnitude of steepest ascent. We step in the opposite direction (descent) to reduce loss.
Learning rate: how big each step is. A hyperparameter -- you set it before training, the model doesn't learn it.
Convergence: when the loss stops improving meaningfully. The model has found a good spot in the loss landscape.
Early stopping: halting training when improvement drops below a threshold for N consecutive epochs.
Generalization: a model's ability to perform well on data it wasn't trained on. The true measure of learning.

These aren't abstract definitions. You've seen every single one of them in action today, in working code that you can run yourself. That's the best way to learn -- definition AND demonstration, not one or the other.

What to remember from this one

The training loop is four steps that repeat: predict -> compute loss -> compute gradients -> update parameters;
Loss decreases rapidly at first, then slowly -- this universal curve indicates convergence and tells you when to stop;
The learning rate is the single most impactful hyperparameter: too small = painfully slow, too large = diverge and die;
Convergence means the loss has stopped improving -- patience-based early stopping detects this automaticaly;
A model that works on training data but fails on new data has memorized in stead of learned -- generalization is the true test;
Everything in ML is this loop, at different scales and with different models inside it. The foundation is now in place.

Next up, we'll cover the linear algebra that makes gradient descent rigorous when you have not just two parameters but dozens or hundreds. Vectors, matrices, dot products -- the math that turns our hand-written gradient formulas into something that scales to real problems. After that, calculus and probability. And then we build linear regression from scratch properly, with all the math we've gathered. The climb gets steeper from here, but the view gets better too ;-)

Learn AI Series (#7) - The Training Loop - See It Work Step by Step

Learn AI Series (#7) - The Training Loop - See It Work Step by Step

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#7) - The Training Loop - See It Work Step by Step

The setup: data with a known answer

The model: two knobs to turn

The training loop: all four pieces working together

Breaking down one epoch

The learning rate: one number to rule them all

Convergence: knowing when to stop

Testing on new data: the moment of truth

Tracking the learning journey

The complete picture: everything in 25 lines

Vocabulary check

What to remember from this one

De groeten! Vragen? Laat maar horen in de comments!

Curriculum (of the `Learn AI Series`):