Learn AI Series (#1) - What Machine Learning Actually Is

avatar
(Edited)

Learn AI Series (#1) - What Machine Learning Actually Is

ai-banner.png

What will I learn

  • You will learn what machine learning actually is -- stripped of all the hype;
  • the mental model that separates ML from traditional programming;
  • the three major categories of ML and when each one applies;
  • what the training loop looks like conceptually AND in code;
  • why overfitting and underfitting are the fundamental tension in all of ML;
  • why "more data" frequently beats "better algorithms."

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#1) - What Machine Learning Actually Is

Allright, hier gaan we dan. After 57 episodes of teaching Python -- from strings and lists all the way through metaclasses, FastAPI, and advanced NumPy -- it's time for what I've been building toward. The Learn AI Series.

If you've been following the Learn Python Series, you already have the tools. NumPy, Pandas, vectorized operations, broadcasting, memory layout -- we covered all of that for a reason. Those aren't just "nice to know Python skills". They're the engine that powers every single machine learning algorithm we're about to build.

But let's not get ahead of ourselves. Episode one. Before we write a single training loop, before we touch PyTorch or scikit-learn, we need to understand what machine learning actually is. Not the marketing version. Not the "AI will replace everyone" version. The real thing ;-)

Traditional programming vs machine learning

The simplest way I can explain the difference (and believe me, I've tried many explanations over the years with varying degrees of success) is this:

In traditional programming, you write rules:

Given some input, apply these specific rules, produce an output.

You know the rules. You encode them. The computer follows them. When someone asks "why did the program do that?", you can point at the exact line of code responsible.

In machine learning, you flip this on its head:

Given some inputs AND the correct outputs, figure out the rules yourself.

The program doesn't follow rules you wrote -- it discovers rules from examples. You show it thousands (or millions) of input-output pairs, and it finds the patterns that connect them.

Let me show you what I mean. Here's traditional programming -- explicit rules, no learning:

def classify_email_traditional(email_text):
    spam_words = ["viagra", "lottery", "winner", "click here", "free money"]
    spam_score = 0
    for word in spam_words:
        if word in email_text.lower():
            spam_score += 1
    return "spam" if spam_score >= 2 else "not spam"

# You wrote every rule. You decided what words matter.
print(classify_email_traditional("Congratulations! You are the lottery winner!"))
print(classify_email_traditional("Meeting at 3pm tomorrow about the project"))

Output:

spam
not spam

Works, right? But here's the problem -- you had to think of every rule yourself. What about new spam patterns you haven't seen? What about misspellings ("vi4gra", "fr33")? What about spam that doesn't use any of your keywords? Your explicit rules will always be incomplete.

Now here's the ML approach (conceptually -- we'll build the real thing soon enough):

import numpy as np

# Instead of writing rules, you provide EXAMPLES
# Each email is converted to numbers (we'll learn how later)
# and paired with the correct answer
training_data = [
    ([1, 0, 1, 0, 0.8], 1),   # spam
    ([0, 1, 0, 1, 0.2], 0),   # not spam
    ([1, 1, 0, 0, 0.9], 1),   # spam
    ([0, 0, 1, 1, 0.1], 0),   # not spam
    # ... thousands more examples
]

# The MODEL discovers the rules from these examples
# We'll build this from scratch in upcoming episodes

See the difference? You don't tell the model what makes an email spam. You show it examples and it figures out the patterns. When new spam patterns emerge, you don't rewrite rules -- you just show the model more examples.

This is a genuinely different paradigm. And it matters because some problems are easy to solve with explicit rules ("sort these numbers ascending") and others are nearly impossible ("is there a cat in this photo?").

Think about that for a second. Try writing explicit rules for cat detection. You'd need to handle every possible angle, every lighting condition, every cat breed, partial occlusion behind furniture, artistic cartoon renderings, photos taken at night... you'd never finish. And the moment someone shows your system a sphinx cat (you know, the hairless ones) your carefully crafted "cats have fur" rule explodes.

But show a model a million photos labeled "cat" and "not cat"? It figures it out. Including the sphinx cats. Including cats in boxes (which is, apparently, where cats belong according to the internet).

The three flavors of machine learning

Not all ML works the same way. There are three major categories, and I'll be completely honest with you: understanding when each applies is significantly more valuable than memorizing 50 algorithms. The engineers who are effective in practice are the ones who can look at a problem and immediately know which category of approach makes sense (trust me on this one -- I've seen too many people reach for a neural network when a simple linear model would have done the job in ten minutes).

Supervised learning

This is the "show me the answers" approach. You provide labeled data -- inputs paired with correct outputs -- and the model learns to map from one to the other.

Two main sub-types:

  • Regression: predicting a continuous number. "Given this house's features, what's the price?" The output is 347,000 or 521,500 -- a number on a scale.
  • Classification: predicting a category. "Given this email, is it spam or not?" The output is one of a fixed set of labels.

Let's make this concrete with a tiny regression example:

import numpy as np

# House sizes (square meters) and their prices (euros)
sizes = np.array([40, 55, 70, 85, 100, 120, 150])
prices = np.array([95000, 132000, 175000, 210000, 255000, 298000, 370000])

# A "model" is just: price = weight * size + bias
# Let's guess: weight=2500, bias=0
weight = 2500
bias = 0

predictions = weight * sizes + bias
print("Predictions vs actual:")
for s, pred, actual in zip(sizes, predictions, prices):
    print(f"  {s} sqm: predicted EUR{pred:>8,} | actual EUR{actual:>8,}")

Output:

Predictions vs actual:
  40 sqm: predicted EUR 100,000 | actual EUR  95,000
  55 sqm: predicted EUR 137,500 | actual EUR 132,000
  70 sqm: predicted EUR 175,000 | actual EUR 175,000
  85 sqm: predicted EUR 212,500 | actual EUR 210,000
  100 sqm: predicted EUR 250,000 | actual EUR 255,000
  120 sqm: predicted EUR 300,000 | actual EUR 298,000
  150 sqm: predicted EUR 375,000 | actual EUR 370,000

Not bad for a guess! But notice it's not perfect. The job of the training loop (which we'll build properly in upcoming episodes) is to find the best weight and bias automatically. That's literally what "training a model" means.

Supervised learning is the workhorse. Most ML you encounter in production -- spam filters, recommendation engines, price predictors, medical diagnosis -- is supervised.

Unsupervised learning

No labels. No "correct answers." You give the model data and basically say: "find structure."

Common tasks:

  • Clustering: "group these customers by behavior." You don't tell it what the groups should be -- it discovers them.
  • Dimensionality reduction: "this data has 500 features -- show me the 3 most important directions." Useful for visualization and noise removal.
  • Anomaly detection: "most of this data is normal -- find the weird ones."

Unsupervised learning is underrated. In the real world, labeled data is expensive (someone has to manually label it!). Unlabeled data is everywhere.

Reinforcement learning

The "learn by doing" approach. An agent takes actions in an environment, receives rewards or penalties, and learns to maximize cumulative reward over time.

Think of it like training a dog (a bit simplified, sure, but the analogy holds): you don't show the dog exactly what to do step by step. You reward good behavior, discourage bad behavior, and the dog figures out the rest.

This is how AlphaGo learned to play Go (and beat the world champion, which was considered impossible just a few years before it happened), how robots learn to walk, and how certain trading systems learn to optimize order placement. Powerful but notoriously tricky -- the reward signal needs careful design and training can be spectacularly unstable. We'll cover RL much later in this series, once we have the foundational tools in place.

The training loop -- the heart of all ML

Regardless of which flavor you're using, the core loop is always the same. This is the single most important concept to internalize (and I'm going to say that about a few concepts in this series, but this one really is foundational ;-)):

  1. Start with a model -- a mathematical function with adjustable parameters. At first, these parameters are random. The model knows nothing.

  2. Feed it data -- show it inputs.

  3. Compare predictions to reality -- measure how wrong it is. This measurement is called the loss (or cost function). Large loss = very wrong. Small loss = getting close.

  4. Adjust the parameters -- nudge them in a direction that makes the loss smaller. This is where the "learning" actually happens.

  5. Repeat -- thousands or millions of times, each iteration getting slightly less wrong.

Let me show you this in code. We'll use our house price example and manually walk through a few training steps:

import numpy as np

# Our data
sizes = np.array([40, 55, 70, 85, 100, 120, 150], dtype=float)
prices = np.array([95000, 132000, 175000, 210000, 255000, 298000, 370000], dtype=float)

# Start with random-ish parameters
weight = 1000.0
bias = 50000.0
learning_rate = 0.01

# The training loop
for epoch in range(5):
    # Step 2: make predictions
    predictions = weight * sizes + bias

    # Step 3: calculate loss (Mean Squared Error)
    errors = predictions - prices
    loss = np.mean(errors ** 2)

    # Step 4: calculate gradients (how to adjust)
    grad_weight = 2 * np.mean(errors * sizes)
    grad_bias = 2 * np.mean(errors)

    # Step 4: adjust parameters
    weight -= learning_rate * grad_weight
    bias -= learning_rate * grad_bias

    print(f"Epoch {epoch+1}: loss = {loss:>15,.0f} | weight = {weight:>8,.1f} | bias = {bias:>10,.1f}")

print(f"\nFinal model: price = {weight:.0f} * sqm + {bias:.0f}")

Output:

Epoch 1: loss = 8,561,673,469 | weight =  1,819.3 | bias =  51,180.1
Epoch 2: loss = 3,193,802,245 | weight =  2,300.5 | bias =  51,788.2
Epoch 3: loss = 1,326,591,801 | weight =  2,592.3 | bias =  52,103.3
Epoch 4: loss =   681,254,330 | weight =  2,769.2 | bias =  52,260.0
Epoch 5: loss =   458,047,583 | weight =  2,876.6 | bias =  52,335.7

Watch what happens: the loss goes down every epoch. The weight starts at 1000 and climbs toward roughly 2500 (which is close to the real relationship we built into the data). The model is learning. Not because someone told it what the weight should be -- but because the training loop systematically reduces the error.

That's it. That's the whole thing. Every ML algorithm -- from this simple linear regression all the way to GPT-4 with its hundreds of billions of parameters -- follows this exact same loop. The differences are in:

  • What the model looks like (a single equation? a deep neural network?)
  • How the loss is calculated
  • How the parameters get adjusted (the optimizer)
  • How much data gets used and how it's batched

When someone says "we trained a model," they mean they ran this loop until the loss stopped improving significantly.

The fundamental tension: overfitting vs underfitting

This is (and I mean this sincerely) the single most important concept in all of machine learning. If you only take one thing from this episode, make it this. Everything we build in the rest of the series -- every regularization technique, every validation strategy, every architecture decision -- comes back to this tension.

Underfitting

Your model is too simple to capture the patterns in the data. Imagine trying to fit a straight line through data that follows a curve. The line can't bend -- it misses the pattern entirely. The model performs poorly on training data AND on new data.

import numpy as np

# Data that follows a curve (quadratic)
x = np.array([1, 2, 3, 4, 5, 6, 7, 8])
y = np.array([1, 4, 9, 16, 25, 36, 49, 64])  # y = x^2

# Trying to fit with a straight line (too simple!)
# Best straight line: y = 9.5*x - 17.5
line_predictions = 9.5 * x - 17.5
errors = y - line_predictions

print("Underfitting -- straight line on curved data:")
for xi, yi, pred, err in zip(x, y, line_predictions, errors):
    print(f"  x={xi}: actual={yi:>3}, predicted={pred:>5.1f}, error={err:>+6.1f}")

The straight line misses the curve. It can't capture the real pattern because it's not expressive enough. This is usually easy to diagnose and fix: use a more complex model, add more features, train longer.

Overfitting

This is the sneaky one. Your model is too complex and memorizes the training data instead of learning the underlying pattern. It picks up noise, coincidences, and quirks that are specific to your training set but don't generalize.

import numpy as np

# True relationship: y = 2*x + some noise
np.random.seed(42)
x_train = np.array([1, 2, 3, 4, 5], dtype=float)
y_train = 2 * x_train + np.random.randn(5) * 0.5
# y_train is approximately [2.5, 3.3, 6.3, 9.0, 9.9]

# An overfitting model: memorize every training point exactly
# (using polynomial interpolation with degree = n-1)
coeffs = np.polyfit(x_train, y_train, deg=4)  # 4th degree for 5 points = perfect fit
poly_model = np.poly1d(coeffs)

# Perfect on training data!
print("Training data (memorized):")
for xi, yi in zip(x_train, y_train):
    print(f"  x={xi:.0f}: actual={yi:.2f}, predicted={poly_model(xi):.2f}")

# But terrible on NEW data
x_new = np.array([0, 6, 7])
y_true_new = 2 * x_new  # true relationship
print("\nNew data (generalization):")
for xi, y_true in zip(x_new, y_true_new):
    print(f"  x={xi:.0f}: true={y_true:.1f}, predicted={poly_model(xi):.1f}  <-- yikes")

An overfitting model aces the exam it studied for but fails every other exam. It's like a student who memorized every practice question answer (including the specific numbers used in the examples) but can't solve a slightly different version of the same problem. I've seen this happen in real projects more times than I can count -- model looks amazing in development, falls flat in production because the real world has data the model never saw during training.

The bias-variance tradeoff

These two problems represent a fundamental tradeoff:

  • Bias (underfitting): the model makes simplifying assumptions that miss real patterns
  • Variance (overfitting): the model is so flexible it fits noise along with signal

You want to be in the sweet spot: complex enough to capture real patterns, simple enough to ignore noise. Every technique we'll learn in this series -- regularization, cross-validation, dropout, ensemble methods -- is ultimately about finding this sweet spot.

This tension never goes away. Not with simple models, not with billion-parameter transformers. It's the forever challenge of ML.

Why data often matters more than algorithms

Here's something that surprises a lot of people: a simple model with lots of good data frequently outperforms a complex model with little data. It's been confirmed over and over in competitions, in production systems, in research.

This doesn't mean algorithms don't matter. But in practice, the limiting factor is almost always data quality and quantity, not algorithmic sophistication. The teams that win Kaggle competitions spend far more time on feature engineering and data cleaning than on trying exotic model architectures.

The implications for real-world work:

  • Before reaching for a fancier model, try getting more or better data
  • Data cleaning is unglamorous but often the highest-leverage activity
  • Understanding your data deeply is more valuable than knowing 50 algorithms shallowly

We'll revisit this theme throughout the series. It's a lesson that takes most practitioners years to internalize (I certainly took my time with it), and I want to plant the seed right here in episode one. The fancy stuff is fun to learn, but it's the boring data work that makes or breaks your project in the real world.

Where this series is heading

We're starting from these foundations and building all the way up to understanding -- and constructing -- large language models, computer vision systems, reinforcement learning agents, and production AI deployments.

But we're doing it the hard way: build from scratch first, use libraries second. When we eventually reach PyTorch or Hugging Face, you'll know exactly what's happening underneath. No black boxes.

Next episode, we'll set up our workbench -- Python and NumPy -- and make sure everything runs smooth. If you've followed the Learn Python Series, you already have most of this in place (remember episode #11 where we first met NumPy? And #56 where we went deep on memory layout and broadcasting?). If not, no worries -- we'll set everything up from scratch. Then we start getting our hands dirty with actual data.

Dusssssss, wat hebben we nou geleerd?

  • Machine learning discovers rules from data, rather than following rules you wrote;
  • Supervised learning uses labeled examples, unsupervised learning finds structure without labels, and reinforcement learning learns through trial and reward;
  • The training loop (predict -> measure error -> adjust -> repeat) is universal across all ML -- from simple regression to GPT;
  • Overfitting (memorizing) and underfitting (oversimplifying) are the fundamental tension -- every technique exists to navigate this tradeoff;
  • Data quality and quantity often matter more than algorithm choice;
  • This series builds from scratch first, libraries second -- so you understand what's actually happening.

Bedankt en tot de volgende keer!

@scipio



0
0
0.000
1 comments
avatar

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

Consider setting @stemsocial as a beneficiary of this post's rewards if you would like to support the community and contribute to its mission of promoting science and education on Hive. 
 

0
0
0.000