Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability

avatar

Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability

ai-banner.png

What will I learn

  • You will learn what a derivative actually measures and why ML cares about it;
  • partial derivatives -- how to handle functions with multiple inputs;
  • the chain rule -- the mathematical trick that makes neural networks learnable;
  • gradient descent visualized -- walking downhill to find optimal parameters;
  • probability distributions, Bayes' theorem, and why they matter for ML;
  • maximum likelihood estimation -- the bridge from probability to loss functions;
  • how all of this connects to the training loop from episode #7.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability

Eight episodes deep. We've come a long way from "what is ML?" in episode #1. We built predictions from pure intuition in #4, formalized errors into loss functions in #6, watched a training loop converge step by step in #7, and last time in #8 we covered linear algebra -- the language that describes data as vectors and matrices, the dot product that powers every model prediction, and the matrix form that lets us scale from 2 parameters to 2,000 without changing a single line of training code. That was the what of the computation. The structure. The representation.

Today we tackle the other two mathematical pillars: calculus and probability. If linear algebra is what the model computes, calculus is how it learns. And probability is how it handles uncertainty -- which, in the real world, is absolutely everywhere.

I won't lie to you -- this is probably the most concept-heavy episode so far. But I promise you: once you see how derivatives connect to the gradients we've been computing since episode #6, something will click. The abstract formulas become concrete tools. And the chain rule -- a formula you may have memorized and forgotten in high school -- turns out to be the single reason deep learning works at all. Without it, training a neural network would be computationally impossible. Not "slow." Impossible.

Let's go.

Derivatives: measuring the rate of change

A derivative measures how much a function's output changes when you nudge its input by a tiny amount. That's the entire definition. No more, no less. If f(x) = x**2, the derivative at a specific point tells you: "right here, how steep is the curve?"

Remember the blindfolded hiker from episode #6? You're standing on a hill, you can't see the terrain, but you can feel the slope under your feet. The derivative IS that slope. It tells you the steepness and direction at your current position. Positive derivative means the ground goes uphill if you step forward. Negative means it goes downhill.

We can compute derivatives numerically -- just nudge the input slightly in both directions and measure how much the output changes:

import numpy as np

def f(x):
    return x ** 2

# Numerical derivative: nudge x slightly, measure the change
def numerical_derivative(func, x, h=1e-7):
    return (func(x + h) - func(x - h)) / (2 * h)

# The derivative of x^2 is 2*x (from calculus class)
# Let's verify at a few points
for x_val in [3.0, -1.0, 0.0, 5.0]:
    approx = numerical_derivative(f, x_val)
    exact = 2 * x_val
    print(f"  x={x_val:>5.1f}  numerical={approx:>6.2f}  exact(2x)={exact:>6.2f}")

The numerical approximation matches the exact derivative (2x) to many decimal places. That formula (f(x+h) - f(x-h)) / (2*h) is called the central difference method, and it's how you compute derivatives when you don't have a neat analytical formula. In practice, frameworks like PyTorch use a more sophisticated technique called automatic differentiation (which we'll encounter in Arc 3 of this series), but the concept is identical: measure the change in output relative to a tiny change in input.

Why does ML care about this? Because of the training loop from episode #7. We have a loss function that measures how wrong the model is. We want to minimize that loss. The derivative tells us which direction to move our parameters to reduce the loss. If the derivative is positive at the current point, the function is increasing -- so we move left (decrease the parameter). If the derivative is negative, the function is decreasing -- so we move right (increase the parameter). Always walk "downhill." This is exactly what we did in episodes #6 and #7, and now you know the mathematical name for the "which way is downhill" measurement ;-)

Different functions, different derivatives

Let's build some intuition for how derivatives behave. The steeper the function, the larger the derivative. Flat regions have derivatives near zero:

def square(x):
    return x ** 2

def cube(x):
    return x ** 3

def flat_region(x):
    return 0.5 * x  # Linear -- constant derivative

# Compare derivatives at the same point
x = 2.0
print(f"At x = {x}:")
print(f"  f(x) = x^2:   derivative = {numerical_derivative(square, x):.4f}  (exact: {2*x})")
print(f"  f(x) = x^3:   derivative = {numerical_derivative(cube, x):.4f}  (exact: {3*x**2})")
print(f"  f(x) = 0.5*x: derivative = {numerical_derivative(flat_region, x):.4f}  (exact: 0.5)")

# Near x=0, x^2 has a derivative near zero (bottom of the parabola)
x_near_zero = 0.001
print(f"\nAt x = {x_near_zero} (near the minimum of x^2):")
print(f"  derivative of x^2 = {numerical_derivative(square, x_near_zero):.6f}")
print(f"  (Almost zero -- we're at the bottom of the valley!)")

That last observation is critical. When the derivative is zero, you're at a critical point -- either a minimum, a maximum, or a saddle point. For loss functions, a zero derivative means the gradient is flat and gradient descent stops moving. In the best case, you've found the minimum of the loss. In the worst case, you're stuck on a plateau. But that's a problem for later episodes when we build more complex models ;-)

Partial derivatives: multiple knobs at once

Real models don't have just one parameter. In episode #7, we had two (slope and intercept). In episode #8, we scaled up to 4 features using the matrix form X @ w. A large language model has billions. We need derivatives with respect to each parameter individually -- holding all the others constant. These are called partial derivatives.

The idea is simple: if you have a function of two variables like f(x, y) = x**2 + 3*x*y + y**2, the partial derivative with respect to x means "pretend y is a fixed constant and differentiate only with respect to x." Then do the same thing for y separately.

# f(x, y) = x^2 + 3*x*y + y^2
# Partial w.r.t. x (treat y as constant): df/dx = 2x + 3y
# Partial w.r.t. y (treat x as constant): df/dy = 3x + 2y

def f(x, y):
    return x**2 + 3*x*y + y**2

def partial_x(x, y, h=1e-7):
    """Partial derivative w.r.t. x: nudge only x."""
    return (f(x + h, y) - f(x - h, y)) / (2 * h)

def partial_y(x, y, h=1e-7):
    """Partial derivative w.r.t. y: nudge only y."""
    return (f(x, y + h) - f(x, y - h)) / (2 * h)

# Test at (x=1, y=2)
x_val, y_val = 1.0, 2.0
print(f"At (x={x_val}, y={y_val}):")
print(f"  df/dx numerical: {partial_x(x_val, y_val):.4f}  exact(2x+3y): {2*x_val + 3*y_val}")
print(f"  df/dy numerical: {partial_y(x_val, y_val):.4f}  exact(3x+2y): {3*x_val + 2*y_val}")

# At a different point
x_val, y_val = -3.0, 1.0
print(f"\nAt (x={x_val}, y={y_val}):")
print(f"  df/dx numerical: {partial_x(x_val, y_val):.4f}  exact: {2*x_val + 3*y_val}")
print(f"  df/dy numerical: {partial_y(x_val, y_val):.4f}  exact: {3*x_val + 2*y_val}")

The collection of all partial derivatives is called the gradient -- a vector that points in the direction of steepest ascent. To minimize the loss, we go in the opposite direction: the negative gradient. This is exactly what we did in episodes #6 through #8, just now with a proper name for it.

def gradient(x, y):
    """The gradient vector: [df/dx, df/dy]"""
    return np.array([partial_x(x, y), partial_y(x, y)])

g = gradient(1.0, 2.0)
print(f"Gradient at (1, 2): {g}")         # [8.0, 7.0]
print(f"Negative gradient:  {-g}")        # [-8.0, -7.0] -- step this way to decrease f
print(f"Gradient magnitude: {np.linalg.norm(g):.4f}")  # How steep is it here?

The gradient magnitude tells you how steep the landscape is at the current point. Large gradient = steep slope = big changes possible. Small gradient = flat terrain = you're near a minimum (or stuck on a plateau). This is why training slows down as you approach convergence -- the gradients get smaller and smaller, and each step produces less improvement. We saw this exact pattern in episode #7 when the loss curve had that characteristic "steep drop followed by a long flat tail" shape. Now you know why ;-)

Gradient descent: walking downhill, step by step

Let's put derivatives and gradients together and build gradient descent from scratch. We'll minimize f(x, y) = x**2 + y**2 (which has its minimum at x=0, y=0 -- but let's pretend we don't know that).

def loss(params):
    x, y = params
    return x**2 + y**2

def gradient_of_loss(params):
    x, y = params
    return np.array([2*x, 2*y])  # Analytical gradient

# Start at a random point
params = np.array([5.0, 3.0])
learning_rate = 0.1

print(f"{'Step':>4s}  {'x':>8s}  {'y':>8s}  {'Loss':>10s}  {'|Gradient|':>10s}")
print("-" * 48)

for step in range(20):
    grad = gradient_of_loss(params)
    grad_magnitude = np.linalg.norm(grad)
    current_loss = loss(params)

    if step % 4 == 0 or step == 19:
        print(f"{step:>4d}  {params[0]:>8.4f}  {params[1]:>8.4f}  "
              f"{current_loss:>10.6f}  {grad_magnitude:>10.6f}")

    params = params - learning_rate * grad

print(f"\nFinal position: ({params[0]:.6f}, {params[1]:.6f})")
print(f"Final loss: {loss(params):.10f}")
print(f"(The true minimum is at (0, 0) with loss 0)")

Watch the parameters crawl toward zero. Watch the loss shrink. Watch the gradient magnitude decrease as we approach the bottom of the valley. Every concept from the training loop in episode #7 is right here -- predict (evaluate the function), compute the gradient (which way is downhill), update parameters (take a step), repeat.

The learning rate controls step size. Remember from episode #7 when we compared different learning rates -- too large and the model diverged, too small and it crawled? Same thing here. Let's verify:

# Same problem, three learning rates
for lr_label, lr in [("too small", 0.001), ("just right", 0.1), ("too large", 1.05)]:
    p = np.array([5.0, 3.0])
    print(f"\nLearning rate: {lr} ({lr_label})")

    for step in range(10):
        g = gradient_of_loss(p)
        p = p - lr * g
        l = loss(p)

        if l > 1e10:
            print(f"  Step {step}: DIVERGED! loss = {l:.0e}")
            break
        if step in [0, 4, 9]:
            print(f"  Step {step}: loss = {l:.6f}")

With lr=0.001, the loss barely budges after 10 steps. With lr=0.1, it converges nicely. With lr=1.05, the parameters overshoot and the loss grows -- the model is bouncing across the valley and flying off in stead of settling down. Same lesson as episode #7, now with a two-dimensional visualization.

The key insight here -- and I really want this to land: gradient descent doesn't require us to solve equations algebraically. We just need to compute gradients and take steps. This is what makes it work for billion-parameter models where solving equations analytically is utterly impossible. You could never write down the formula for the minimum of a loss function with 175 billion parameters. But you can compute the gradient at the current point, take a step, and repeat. And that's enough.

The chain rule: why deep learning works

Alright. This is the most important piece of math in this entire series. I'm not exaggerating. If you remember one thing from this episode, make it the chain rule.

The chain rule says: if you have a composition of functions -- f(g(x)) -- the derivative of the whole thing is the product of the individual derivatives:

d/dx f(g(x)) = f'(g(x)) * g'(x)

In words: the derivative of the outer function (evaluated at the inner function's output), multiplied by the derivative of the inner function. You chain the derivatives together. Hence the name.

Let's verify numerically:

# f(g(x)) where g(x) = 3x + 1 and f(z) = z^2
# So f(g(x)) = (3x + 1)^2
# Chain rule: f'(g(x)) * g'(x) = 2*(3x+1) * 3 = 6*(3x+1)

def g(x):
    return 3 * x + 1

def f(z):
    return z ** 2

def composed(x):
    return f(g(x))   # (3x + 1)^2

x = 2.0
# Numerical derivative of the composition
numerical = numerical_derivative(composed, x)

# Chain rule: f'(g(x)) * g'(x) = 2*g(x) * 3 = 2*(3*2+1) * 3 = 42
chain_rule = 2 * g(x) * 3

print(f"f(g(x)) = (3x + 1)^2")
print(f"At x = {x}:")
print(f"  Numerical derivative: {numerical:.4f}")
print(f"  Chain rule:           {chain_rule:.4f}")

Beautiful match. But why does this matter so much for ML?

Because a neural network IS a composition of functions. Layer 1 takes the input and produces an intermediate result. Layer 2 takes that intermediate result and produces another. Layer 3 takes that and produces the final output. When we need to compute how the loss depends on a weight in Layer 1, we need to trace the chain of influence through every subsequent layer. The chain rule lets us do exactly that.

# A tiny neural network: input -> layer1 -> layer2 -> loss
# z1 = w1 * x          (layer 1)
# z2 = w2 * z1         (layer 2)
# L  = (z2 - target)^2 (loss)

# How does the loss change when we adjust w1?
# By the chain rule:
# dL/dw1 = dL/dz2 * dz2/dz1 * dz1/dw1

x = 2.0
target = 1.0
w1, w2 = 0.5, 0.3

# Forward pass -- compute each layer's output
z1 = w1 * x        # 1.0
z2 = w2 * z1       # 0.3
L = (z2 - target)**2  # 0.49

print("Forward pass:")
print(f"  z1 = w1 * x = {w1} * {x} = {z1}")
print(f"  z2 = w2 * z1 = {w2} * {z1} = {z2}")
print(f"  L  = (z2 - target)^2 = ({z2} - {target})^2 = {L}")

# Backward pass -- chain rule, layer by layer
dL_dz2 = 2 * (z2 - target)    # How loss changes with z2
dz2_dz1 = w2                   # How z2 changes with z1
dz1_dw1 = x                    # How z1 changes with w1

dL_dw1 = dL_dz2 * dz2_dz1 * dz1_dw1

print(f"\nBackward pass (chain rule):")
print(f"  dL/dz2  = 2*(z2 - target) = {dL_dz2}")
print(f"  dz2/dz1 = w2 = {dz2_dz1}")
print(f"  dz1/dw1 = x = {dz1_dw1}")
print(f"  dL/dw1  = {dL_dz2} * {dz2_dz1} * {dz1_dw1} = {dL_dw1}")

# Verify numerically
def compute_loss(w1_val):
    z1_val = w1_val * x
    z2_val = w2 * z1_val
    return (z2_val - target)**2

numerical_grad = numerical_derivative(compute_loss, w1)
print(f"\nNumerical verification: {numerical_grad:.4f}")
print(f"Chain rule result:      {dL_dw1:.4f}")

That backward pass -- going from the loss back through each layer to compute how every weight affects the final error -- is backpropagation. The algorithm that trains neural networks. It's "just" the chain rule applied systematically, layer by layer. We'll implement it from scratch in Arc 3 when we build neural networks. For now, the principle is what matters: the chain rule lets us compute how any parameter, no matter how deep in the network, affects the final loss.

Why the chain rule enables depth

Having said that, let me make sure the significance really lands. Without the chain rule, we'd have to compute each parameter's effect on the loss independently -- essentially nudging every parameter one at a time and measuring the change. For a model with a million parameters, that means a million separate forward passes per training step. For a billion-parameter model? Forget it. Computationally dead on arrival.

The chain rule gives us a shortcut: compute local derivatives at each layer and multiply them together. One forward pass, one backward pass, and you have gradients for ALL parameters. The cost scales linearly with network depth, not with the number of parameters. This is the mathematical reason we can train deep networks with hundreds of layers and billions of weights.

# Demonstrating: chain rule scales linearly with depth
# A network with N layers -- forward pass then backward pass

def deep_network_gradients(n_layers):
    """Simulate forward + backward pass through N layers."""
    np.random.seed(42)
    x = 2.0
    target = 1.0
    weights = np.random.randn(n_layers) * 0.5

    # Forward pass: z[i] = w[i] * z[i-1]
    z = [x]
    for i in range(n_layers):
        z.append(weights[i] * z[-1])

    loss = (z[-1] - target) ** 2

    # Backward pass: chain rule
    dL_dz = 2 * (z[-1] - target)  # gradient at output

    gradients = []
    for i in range(n_layers - 1, -1, -1):
        # dL/dw[i] = dL/dz[i+1] * dz[i+1]/dw[i]
        # dz[i+1]/dw[i] = z[i]  (the input to this layer)
        dL_dw = dL_dz * z[i]
        gradients.append(dL_dw)

        # Propagate gradient backward: dL/dz[i] = dL/dz[i+1] * w[i]
        dL_dz = dL_dz * weights[i]

    return loss, len(gradients)

# Chain rule works regardless of depth
for depth in [2, 5, 10, 50, 100]:
    loss, n_grads = deep_network_gradients(depth)
    print(f"  {depth:>3d} layers: loss = {loss:>12.4f}, "
          f"computed {n_grads} gradients in ONE backward pass")

One backward pass computes gradients for every layer, regardless of depth. 2 layers or 100 layers -- same algorithm, same structure. When people talk about "vanishing gradients" or "exploding gradients" -- problems we'll cover in later episodes -- they're talking about what happens when you multiply many small (or large) numbers together in this chain. The chain rule giveth and the chain rule taketh away. But the computational feasibility? That's a gift that never stops giving.

Connecting calculus to everything we've built

Let me pause and connect all of this back to episodes #6 through #8, because the pieces fit together like a puzzle.

In episode #6, we had the gradient formulas for slope and intercept. I told you "don't worry about where these come from yet." Now you know: they come from taking the partial derivative of the MSE loss function with respect to each parameter.

In episode #7, we watched those gradients drive the training loop -- parameters moving downhill, loss decreasing, convergence happening. That was gradient descent, powered by derivatives.

In episode #8, we wrote gradient = -2 / len(y) * (X.T @ errors) -- that matrix expression computes the partial derivatives for ALL weights simultaneously using linear algebra. Calculus gave us the formula. Linear algebra gave us the implementation.

# The MSE gradient formula, derived from calculus:
# MSE = (1/n) * sum((y - X@w)^2)
# d(MSE)/dw = -2/n * X.T @ (y - X@w)

# Let's verify this matches numerical differentiation

X = np.array([
    [65.0,  2.0, 3.0, 1.0],
    [82.0,  3.0, 1.0, 1.0],
    [45.0,  1.0, 5.0, 1.0],
    [120.0, 4.0, 2.0, 1.0],
    [55.0,  2.0, 4.0, 1.0],
], dtype=np.float64)

y = np.array([185000, 210000, 145000, 320000, 165000], dtype=np.float64)
w = np.array([1500.0, 10000.0, -5000.0, 20000.0])

# Analytical gradient (from calculus)
errors = y - X @ w
analytical_grad = -2 / len(y) * (X.T @ errors)

# Numerical gradient (nudge each weight)
numerical_grad = np.zeros_like(w)
h = 1e-4
for i in range(len(w)):
    w_plus = w.copy()
    w_plus[i] += h
    w_minus = w.copy()
    w_minus[i] -= h
    loss_plus = ((y - X @ w_plus) ** 2).mean()
    loss_minus = ((y - X @ w_minus) ** 2).mean()
    numerical_grad[i] = (loss_plus - loss_minus) / (2 * h)

print("Gradient verification (analytical vs numerical):\n")
for i, name in enumerate(["sqm", "rooms", "floor", "bias"]):
    print(f"  w_{name:>5s}:  analytical = {analytical_grad[i]:>12.2f}  "
          f"numerical = {numerical_grad[i]:>12.2f}")

They match. The analytical formula and the numerical nudge-and-measure approach give the same answer. But the analytical version is astronomicaly faster -- one matrix multiplication vs. one forward pass per weight. This is why we derive gradient formulas instead of just nudging: the math makes it fast. And for models with millions of weights, "fast" is the difference between "trains in hours" and "trains in centuries."

Probability: handling uncertainty

Alright, let's shift gears. Calculus tells us how to learn (gradient descent). Now we need a framework for dealing with the fact that the real world is noisy and uncertain. That framework is probability.

Remember the noise we talked about in episode #5? The apartment prices that didn't fall perfectly on a line because of location, condition, seller urgency, and a hundred other unmeasured factors? Probability gives us a rigorous way to model that noise and reason about uncertainty.

Probability distributions

A probability distribution describes how likely each outcome is. Two distributions that you'll encounter constantly in ML:

Normal (Gaussian) distribution -- the famous bell curve. Data clusters around a mean with a specific spread called the standard deviation. The more standard deviations away from the mean, the less likely an observation is.

# Sample from a normal distribution
np.random.seed(42)
samples = np.random.normal(loc=0.0, scale=1.0, size=10000)

print("Normal distribution (mean=0, std=1):")
print(f"  Sample mean: {samples.mean():.4f}  (expected: 0.0)")
print(f"  Sample std:  {samples.std():.4f}  (expected: 1.0)")

# How many samples fall within 1, 2, 3 standard deviations?
for n_std in [1, 2, 3]:
    within = np.abs(samples) < n_std
    pct = within.mean() * 100
    print(f"  Within {n_std} std: {pct:.1f}%")

# In ML: weight initialization often uses normal distribution
# Small random weights, centered at zero
weights_init = np.random.normal(0, 0.01, size=(4, 3))
print(f"\nRandom weight initialization (4x3 matrix):")
print(f"  Mean:  {weights_init.mean():.6f}")
print(f"  Std:   {weights_init.std():.6f}")
print(f"  Range: [{weights_init.min():.6f}, {weights_init.max():.6f}]")

You'll see ~68% of samples within 1 standard deviation, ~95% within 2, and ~99.7% within 3. This is the "68-95-99.7 rule" and it shows up everywhere in statistics. In ML, we use normal distributions for weight initialization (small random values centered at zero), for modeling noise in data, and as assumptions behind many loss functions.

Uniform distribution -- all outcomes equally likely within a range. Used for random initialization, random sampling, and various randomization techniques:

# Uniform distribution: all values equally likely in [0, 1]
uniform_samples = np.random.uniform(low=0.0, high=1.0, size=10000)
print("Uniform distribution [0, 1]:")
print(f"  Mean:  {uniform_samples.mean():.4f}  (expected: 0.5)")
print(f"  Min:   {uniform_samples.min():.4f}")
print(f"  Max:   {uniform_samples.max():.4f}")

# Uniform is also used for train/test splitting -- random selection
# "Shuffle the data randomly" = uniform random permutation
indices = np.random.permutation(100)
train_idx = indices[:80]
test_idx = indices[80:]
print(f"\nRandom 80/20 split: {len(train_idx)} train, {len(test_idx)} test")

Conditional probability and Bayes' theorem

Conditional probability asks: "what's the probability of A, given that B already happened?" Written as P(A|B). This is different from just P(A), because knowing that B happened changes the picture.

Bayes' theorem lets you flip the conditional:

P(A|B) = P(B|A) * P(A) / P(B)

In ML terms: you observe some data (B), and you want to know the probability of a hypothesis (A). Bayes' theorem lets you update your belief based on evidence. This is the foundation of an entire branch of ML called Bayesian methods.

Let's make it concrete with a classic example:

# Spam filter using Bayes' theorem
#
# P(spam) = 0.3                        (30% of emails are spam)
# P("free money" | spam) = 0.8         (80% of spam has "free money")
# P("free money" | not spam) = 0.01    (1% of real email has "free money")
#
# Question: if an email contains "free money", what's P(spam | "free money")?

p_spam = 0.3
p_words_given_spam = 0.8
p_words_given_not_spam = 0.01

# P("free money") = P(words|spam)*P(spam) + P(words|not spam)*P(not spam)
p_words = p_words_given_spam * p_spam + p_words_given_not_spam * (1 - p_spam)

# Bayes' theorem
p_spam_given_words = (p_words_given_spam * p_spam) / p_words

print("Spam filter (Bayes' theorem):")
print(f"  Prior P(spam):                    {p_spam:.2f}")
print(f"  P('free money' | spam):           {p_words_given_spam:.2f}")
print(f"  P('free money' | not spam):       {p_words_given_not_spam:.2f}")
print(f"  P('free money'):                  {p_words:.4f}")
print(f"  P(spam | 'free money'):           {p_spam_given_words:.4f}")
print(f"\n  -> {p_spam_given_words*100:.1f}% chance it's spam!")

A 97.1% chance it's spam, given the words "free money." This is basically how Naive Bayes classifiers work -- a technique we'll cover properly in the ML algorithms arc. The key principle extends throughout ML: updating beliefs with evidence. You start with a prior (30% of email is spam), observe evidence ("free money" in the text), and compute a posterior (97.1% chance it's spam given that evidence). The evidence dramatically shifted our belief.

What if we see more evidence?

The beauty of Bayes' theorem is that you can chain it. The posterior from one observation becomes the prior for the next:

# Now the email also contains "congratulations winner"
# P("congratulations winner" | spam) = 0.6
# P("congratulations winner" | not spam) = 0.005

p_new_words_given_spam = 0.6
p_new_words_given_not_spam = 0.005

# Use the posterior from before as the new prior
new_prior = p_spam_given_words  # 0.9717

p_new_words = (p_new_words_given_spam * new_prior +
               p_new_words_given_not_spam * (1 - new_prior))

updated_posterior = (p_new_words_given_spam * new_prior) / p_new_words

print(f"After seeing 'congratulations winner' too:")
print(f"  Updated P(spam) = {updated_posterior:.6f}")
print(f"  -> {updated_posterior*100:.4f}% chance it's spam")

Each new piece of evidence makes us more confident. This sequential updating is powerful and shows up all over the place -- not just in spam filters, but in medical diagnosis, fault detection, recommendation systems, and anywhere else you need to combine multiple sources of uncertain information.

Maximum likelihood: the bridge from probability to loss functions

Here's where probability connects directly to everything we've been doing since episode #6. Most ML training can be framed as maximum likelihood estimation (MLE): find the parameters that make the observed data most probable under your model.

That sounds abstract. Let me make it concrete. Suppose you measure 5 apartment prices and assume they come from a normal distribution. What mean and standard deviation best describe your data?

# Five observed prices
observed = np.array([185000, 210000, 145000, 320000, 165000], dtype=np.float64)

# Maximum likelihood estimates for a normal distribution
# (these are the values that make the data most probable)
ml_mean = observed.mean()
ml_std = observed.std()  # ML estimate (not Bessel-corrected)

print(f"Observed prices: {observed}")
print(f"ML estimate of mean: EUR {ml_mean:,.0f}")
print(f"ML estimate of std:  EUR {ml_std:,.0f}")

# The log-likelihood: how "probable" is the data given these parameters?
def log_likelihood(data, mu, sigma):
    """Log probability of data under normal(mu, sigma)."""
    n = len(data)
    return -n/2 * np.log(2 * np.pi * sigma**2) - np.sum((data - mu)**2) / (2 * sigma**2)

ll_best = log_likelihood(observed, ml_mean, ml_std)
ll_wrong_mean = log_likelihood(observed, 250000, ml_std)

print(f"\nLog-likelihood at ML estimate:   {ll_best:.2f}")
print(f"Log-likelihood at wrong mean:    {ll_wrong_mean:.2f}")
print(f"(Higher is better -- ML estimate wins)")

Now here's the punchline. The loss functions we've been minimizing -- MSE, in particular -- are derived from maximum likelihood. When you minimize MSE, you're maximizing the likelihood that your model generated the observed data, assuming Gaussian noise. Let me show you why:

# MSE and negative log-likelihood are proportional!
#
# If we assume: y = model(x) + noise, where noise ~ Normal(0, sigma)
# Then: P(y | x, w) = Normal(model(x), sigma)
# Log-likelihood: -n/2 * log(2*pi*sigma^2) - sum((y - pred)^2) / (2*sigma^2)
#
# Maximizing this = minimizing sum((y - pred)^2) = minimizing MSE!
# The constants and sigma don't depend on the model weights,
# so they don't affect which weights are optimal.

# Demonstration with our apartment model
X_simple = np.array([65, 82, 45, 120, 55], dtype=np.float64)
y_actual = observed

# Two candidate models
w_good = np.array([2500.0, 15000.0])   # price = 2500*sqm + 15000
w_bad = np.array([1000.0, 100000.0])   # price = 1000*sqm + 100000

pred_good = w_good[0] * X_simple + w_good[1]
pred_bad = w_bad[0] * X_simple + w_bad[1]

mse_good = ((y_actual - pred_good) ** 2).mean()
mse_bad = ((y_actual - pred_bad) ** 2).mean()

# Log-likelihoods (assuming sigma = 30000)
sigma = 30000
ll_good = log_likelihood(y_actual, pred_good.mean(), sigma)
ll_bad = log_likelihood(y_actual, pred_bad.mean(), sigma)

print("Model comparison:")
print(f"  Good model -- MSE: {mse_good:>14,.0f}  log-lik: {ll_good:.2f}")
print(f"  Bad model  -- MSE: {mse_bad:>14,.0f}  log-lik: {ll_bad:.2f}")
print(f"\n  Lower MSE = higher likelihood. Same optimum!")

This connection between probability and optimization is deep. When you minimize MSE, you're implicitly assuming the noise in your data is normally distributed. When you minimize cross-entropy loss (which we'll use for classification problems), you're assuming a different probability model. The loss function isn't arbitrary -- it encodes probabilistic assumptions about your data. You don't need to think about this every day, but knowing it exists helps you understand why certain loss functions are used for certain problems, and why using the wrong one can give you poor results.

How all three pillars connect

Let me tie the two math episodes together with everything that came before. This is the complete picture:

  1. Episodes #1-7: We built intuition -- what ML is, what data looks like, how predictions work, loss functions, gradient descent, the complete training loop running from start to finish
  2. Episode #8: Linear algebra -- data is represented as vectors and matrices, transformed by matrix multiplication, and the training loop generalizes to any number of features via X @ w
  3. Episode #9 (this one): Calculus tells us HOW to minimize the loss (gradient descent via derivatives), and probability tells us WHY certain loss functions work (maximum likelihood)

The entire training loop is now fully demystified:

# The complete training loop -- every piece now has a name

# Setup (episodes #2, #3)
np.random.seed(42)
n = 20
X_raw = np.random.uniform(30, 150, n)
noise = np.random.randn(n) * 12000            # probability: Gaussian noise assumption
y = 2500 * X_raw + 15000 + noise

# Bias trick (episode #8: linear algebra)
X = np.column_stack([X_raw, np.ones(n)])       # (n, 2) matrix
w = np.zeros(2)                                 # weight vector

lr = 1e-7
print(f"{'Epoch':>5s}  {'Loss (MSE)':>14s}  {'Slope':>10s}  {'Intercept':>10s}")
print("-" * 48)

for epoch in range(300):
    # FORWARD PASS: matrix multiplication (linear algebra)
    pred = X @ w

    # LOSS: MSE (derived from probability -- maximum likelihood)
    errors = y - pred
    loss = (errors ** 2).mean()

    # GRADIENTS: partial derivatives via chain rule (calculus)
    grad = -2 / n * (X.T @ errors)

    # UPDATE: gradient descent step (calculus + learning rate)
    w = w - lr * grad

    if epoch % 60 == 0 or epoch == 299:
        print(f"{epoch:>5d}  {loss:>14,.0f}  {w[0]:>10.1f}  {w[1]:>10.1f}")

print(f"\nLearned:  price = {w[0]:.0f} * sqm + {w[1]:.0f}")
print(f"True:     price = 2500 * sqm + 15000")

Four lines in the loop. Four concepts from four episodes. Matrix multiply (episode #8). Loss function (episode #6). Gradient via derivatives (episode #9). Parameter update (episode #7). That's all of machine learning in a nutshell.

Next episode, we put ALL of this into practice and build our first real model: linear regression, from scratch, in NumPy. No libraries hiding the math. You'll see every gradient computation, every parameter update, every prediction converging on the data. The math episodes are over -- now we build for real ;-)

What to carry forward

  • Derivatives measure how a function's output changes with its input -- the gradient points toward steepest ascent, so we go the other way to minimize loss;
  • Partial derivatives handle multiple parameters independently, and the gradient vector collects them all into one direction;
  • Gradient descent iteratively walks downhill by stepping in the negative gradient direction, controlled by a learning rate -- too large and you diverge, too small and you crawl;
  • The chain rule lets us compute gradients through compositions of functions -- this IS backpropagation, and it's the mathematical reason deep learning is computationally feasible (linear cost vs exponential);
  • Probability distributions model uncertainty; the normal distribution shows up everywhere in ML (weight initialization, noise modeling, loss derivation);
  • Bayes' theorem updates beliefs with evidence -- the foundation of Bayesian ML methods and probabilistic reasoning;
  • Maximum likelihood connects probability to loss functions: minimizing MSE = maximizing likelihood under Gaussian noise assumptions;
  • Linear algebra, calculus, and probability are the three mathematical pillars of ML -- and now you have the intuition for all three. Everything from here builds on this foundation.

We hebben de wiskundeberg beklommen. Het uitzicht is goed, maar het echte werk begint nu ;-)

@scipio



0
0
0.000
0 comments