Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation

Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation

ai-banner.png

What will I learn

  • You will learn the chain rule -- the calculus trick that makes neural networks trainable;
  • how backpropagation computes gradients layer by layer, from output back to input;
  • implementing backprop from scratch in NumPy for a multi-layer network;
  • the binary cross-entropy loss function and its derivative;
  • activation function derivatives -- sigmoid and ReLU, and why ReLU's simplicity matters for training;
  • numerical gradient checking -- how to verify your math is actually correct;
  • why backpropagation's efficiency makes modern deep learning possible;
  • training a network to solve a non-linear classification problem end to end.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation

Last episode we built a complete forward pass -- pushing data through layers of matrix multiplications and activation functions to produce a prediction. We even solved XOR with hand-crafted weights. But here's the problem: hand-crafting weights only works when you already know the solution. For any real problem -- image classification, language understanding, medical diagnosis -- nobody can hand-design the right weight values for millions of connections. The network needs to figure out its own weights by learning from data.

And to learn, the network needs to answer one critical question for every single weight: "if I nudge this weight a tiny bit, does the overall error go up or go down?" That answer is the gradient, and backpropagation is the algorithm that computes it efficiently for every weight in the network simultaneously. Gradient descent (which we've been using since episode #7) then takes those gradients and updates the weights in the direction that reduces the error.

Backpropagation is the single most important algorithm in modern AI. Not the architectures, not the data -- the training algorithm. Every neural network you've ever heard of -- GPT, DALL-E, AlphaFold, self-driving car models -- they all learn through backpropagation. The architectures vary enormously, but the learning mechanism is always the same: forward pass, compute loss, backpropagate gradients, update weights.

We'll build it from scratch in NumPy, trace every matrix operation, and verify our gradients numerically. Here we go!

The loss function: what are we minimizing?

Before we can compute gradients, we need a single number that tells us "how wrong is the network?" This is the loss function (or cost function -- same thing). For binary classification, the standard choice is binary cross-entropy, which we first saw in episode #12 with logistic regression:

L = -(1/N) * sum[y * log(y_hat) + (1 - y) * log(1 - y_hat)]

Where y is the true label (0 or 1) and y_hat is the predicted probability. When the prediction matches the label perfectly, the loss is 0. When the prediction diverges from the label, the loss grows -- and it grows fast. Predicting 0.01 when the true label is 1 gives a much larger penalty than predicting 0.3, because the log function penalizes confident wrong answers heavily. This is exactly the behavior we want: the network should be strongly pushed away from confidently wrong predictions.

import numpy as np

def binary_cross_entropy(y_true, y_pred):
    """Binary cross-entropy loss. Same as episode #12."""
    eps = 1e-15  # prevent log(0)
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(
        y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
    )

# Quick sanity check
print("Loss when prediction matches label:")
print(f"  y=1, pred=0.99: {binary_cross_entropy(np.array([1]), np.array([0.99])):.4f}")
print(f"  y=0, pred=0.01: {binary_cross_entropy(np.array([0]), np.array([0.01])):.4f}")
print(f"\nLoss when prediction is WRONG:")
print(f"  y=1, pred=0.01: {binary_cross_entropy(np.array([1]), np.array([0.01])):.4f}")
print(f"  y=0, pred=0.99: {binary_cross_entropy(np.array([0]), np.array([0.99])):.4f}")
print(f"\nLoss when prediction is uncertain:")
print(f"  y=1, pred=0.50: {binary_cross_entropy(np.array([1]), np.array([0.50])):.4f}")

Now we need the derivative of the loss with respect to the prediction. This tells us: "if the prediction increases a tiny bit, does the loss go up or down?" If the true label is 1 and the prediction is 0.2 (too low), the derivative should be negative -- pointing us toward a higher prediction. If the prediction is already 0.9 (close to correct), the derivative should be small -- not much correction needed.

def bce_derivative(y_true, y_pred):
    """Derivative of BCE with respect to predictions.
    Returns dL/d(y_pred) for each sample."""
    eps = 1e-15
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -(y_true / y_pred - (1 - y_true) / (1 - y_pred)) / len(y_true)

# Verify: derivative should point toward the correct answer
print("Derivatives (negative = push prediction up, positive = push down):")
print(f"  y=1, pred=0.2: {bce_derivative(np.array([[1]]), np.array([[0.2]]))[0,0]:.4f}  (should be negative)")
print(f"  y=1, pred=0.9: {bce_derivative(np.array([[1]]), np.array([[0.9]]))[0,0]:.4f}  (small negative)")
print(f"  y=0, pred=0.8: {bce_derivative(np.array([[0]]), np.array([[0.8]]))[0,0]:.4f}  (should be positive)")
print(f"  y=0, pred=0.1: {bce_derivative(np.array([[0]]), np.array([[0.1]]))[0,0]:.4f}  (small positive)")

The derivative of the loss is the starting point of backpropagation. It's the initial error signal that gets propagated backward through the network. Everything else is just applying the chain rule to push this signal through each layer.

The chain rule: the engine behind backpropagation

In episode #9, we covered the chain rule from calculus: the derivative of a composed function is the product of derivatives at each step. If f(g(h(x))) and you want df/dx, you compute: df/dg * dg/dh * dh/dx. Each piece is a local derivative -- how the output of one function changes with respect to its input -- and you multiply them along the chain.

A neural network is exactly this: a deeply nested function composition. The loss depends on the output layer's activation, which depends on the output layer's pre-activation (the linear part z = Wa + b), which depends on the previous layer's activation, which depends on that layer's pre-activation, and so on all the way back to the input. The chain rule lets us compute the gradient of the loss with respect to any weight, anywhere in the network, by multiplying local derivatives along the path from the loss back to that weight.

This is why it's called backpropagation -- we start at the output (where we know the loss and its derivative) and propagate the error signal backward through the network, one layer at a time. Each layer receives the error signal from the layer above, computes its local gradients, passes the error signal further back, and stores the weight gradients for the update step.

Let me make this concrete with a small example before we jump into the full implementation:

# A tiny network: 1 input -> 1 hidden -> 1 output
# Forward pass:
#   z1 = x * w1 + b1       (hidden pre-activation)
#   a1 = relu(z1)           (hidden activation)
#   z2 = a1 * w2 + b2       (output pre-activation)
#   a2 = sigmoid(z2)        (output = prediction)
#   L  = bce(y, a2)         (loss)
#
# Chain rule for dL/dw1:
#   dL/dw1 = dL/da2 * da2/dz2 * dz2/da1 * da1/dz1 * dz1/dw1
#
#   dL/da2   = bce_derivative(y, a2)     -- loss derivative
#   da2/dz2  = sigmoid_derivative(z2)    -- output activation derivative
#   dz2/da1  = w2                        -- linear layer: z2 = a1*w2 + b2
#   da1/dz1  = relu_derivative(z1)       -- hidden activation derivative
#   dz1/dw1  = x                         -- linear layer: z1 = x*w1 + b1

print("Chain rule decomposition for a 2-layer network:")
print("  dL/dw1 = dL/da2 * da2/dz2 * dz2/da1 * da1/dz1 * dz1/dw1")
print("         = bce_deriv * sigmoid_deriv * w2 * relu_deriv * x")
print("  Each factor is a LOCAL derivative -- one step in the chain")
print("  Multiply them all together and you have the gradient for w1")

The beauty of backpropagation is that many of these local derivatives are shared between different weights. The error signal dL/da2 * da2/dz2 is computed once and reused for all weights in the output layer. Then dz2/da1 propagates that signal to the hidden layer, where it's reused for all weights in the hidden layer. No redundant computation -- each local derivative is computed exactly once.

Activation function derivatives

Each activation function has a derivative that backpropagation needs. We already have the functions from episode #38 -- now we add their derivatives:

def sigmoid(z):
    """Sigmoid activation: squishes to (0, 1)."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def sigmoid_derivative(z):
    """Derivative of sigmoid: s * (1 - s).
    Maximum value is 0.25 (at z=0), approaches 0 at extremes."""
    s = sigmoid(z)
    return s * (1 - s)

def relu(z):
    """ReLU: max(0, z). The modern default for hidden layers."""
    return np.maximum(0, z)

def relu_derivative(z):
    """Derivative of ReLU: 1 if z > 0, else 0.
    Gradient flows through unchanged for positive inputs."""
    return (z > 0).astype(float)

# Compare the derivatives
z_vals = np.array([-5, -2, -1, 0, 1, 2, 5])
print(f"{'z':>5s}  {'sig_deriv':>10s}  {'relu_deriv':>11s}")
print("-" * 30)
for z in z_vals:
    sd = sigmoid_derivative(np.array([z]))[0]
    rd = relu_derivative(np.array([z]))[0]
    print(f"{z:>5.0f}  {sd:>10.6f}  {rd:>11.1f}")

print(f"\nMax sigmoid derivative: {sigmoid_derivative(np.array([0]))[0]:.4f}")
print(f"ReLU derivative for any z > 0: 1.0 (constant, never shrinks)")

This is why ReLU dominates hidden layers, as we discussed in episode #38. Sigmoid's derivative peaks at 0.25 and decays rapidly toward zero for large or small inputs. In a 10-layer network, the gradient gets multiplied by the sigmoid derivative at each layer: 0.25^10 = 0.00000095. That's the vanishing gradient problem -- by the time the error signal reaches the early layers, it's been multiplied away to practically nothing. ReLU's derivative is exactly 1 for positive inputs, so the gradient passes through without any shrinkage. This simple mathematical property is what made deep networks trainable.

Implementing the backward pass

Now we build the complete neural network with both forward and backward passes. We extend the NeuralNetwork class from episode #38:

class NeuralNetwork:
    def __init__(self, layer_sizes):
        """Initialize network with He initialization.
        layer_sizes: e.g. [2, 16, 8, 1] for 2 inputs, two hidden, 1 output."""
        self.weights = []
        self.biases = []
        for i in range(len(layer_sizes) - 1):
            scale = np.sqrt(2.0 / layer_sizes[i])  # He init for ReLU
            self.weights.append(
                np.random.randn(layer_sizes[i], layer_sizes[i+1]) * scale
            )
            self.biases.append(np.zeros(layer_sizes[i+1]))

    def forward(self, X):
        """Forward pass: store activations and pre-activations
        for backpropagation (same as episode #38)."""
        self.activations = [X]
        self.pre_activations = []
        for i, (W, b) in enumerate(zip(self.weights, self.biases)):
            z = self.activations[-1] @ W + b
            self.pre_activations.append(z)
            if i < len(self.weights) - 1:
                a = relu(z)         # hidden layers: ReLU
            else:
                a = sigmoid(z)      # output layer: sigmoid
            self.activations.append(a)
        return self.activations[-1]

    def backward(self, y_true, lr=0.01):
        """Backward pass: compute gradients and update weights."""
        m = len(y_true)

        # Step 1: output layer error signal
        # delta = dL/dz for the output layer
        # This combines the loss derivative and the sigmoid derivative
        delta = (bce_derivative(y_true, self.activations[-1])
                 * sigmoid_derivative(self.pre_activations[-1]))

        # Step 2: propagate backward through each layer
        for i in range(len(self.weights) - 1, -1, -1):
            # Weight gradient: how each weight contributed to the error
            dW = self.activations[i].T @ delta

            # Bias gradient: average error signal across the batch
            db = delta.mean(axis=0)

            # Propagate error to previous layer (if not at input)
            if i > 0:
                delta = ((delta @ self.weights[i].T)
                         * relu_derivative(self.pre_activations[i-1]))

            # Update weights and biases
            self.weights[i] -= lr * dW
            self.biases[i] -= lr * db

Let me walk through each step in detail, because this is where the conceptual magic happens:

Step 1 -- Output layer delta. We combine the loss derivative (dL/d(prediction)) with the sigmoid derivative (d(prediction)/d(z_output)) to get dL/dz for the output layer. This is the initial error signal -- it tells us how the output layer's pre-activation values need to change to reduce the loss. The chain rule at work: dL/dz = dL/da * da/dz.

Step 2a -- Weight gradients. For each layer, we compute dW = activations[i].T @ delta. This is the outer product of the layer's inputs and the error signal. Think about what this means: if a particular input value was large AND the error signal is large, then the weight connecting them gets a large gradient -- it contributed significanly to the error. If the input was zero (because ReLU killed it), the weight gradient is zero regardless of the error signal -- the weight didn't participate in the error, so it shouldn't be updated.

Step 3 -- Bias gradients. db = mean(delta). Biases shift the pre-activation uniformly, so their gradient is just the average error signal across the batch. Simple.

Step 4 -- Error propagation. delta = (delta @ W.T) * relu_derivative(z). This is the key step that makes it backpropagation. We project the error signal backward through the weight matrix (delta @ W.T reverses the direction of the forward pass a @ W), then multiply by the activation derivative to account for the nonlinearity. The result is the error signal for the previous layer, and the process repeats.

Step 5 -- Weight update. Standard gradient descent: W -= lr * dW. Move each weight in the direction that reduces the loss, scaled by the learning rate.

Training on a non-linear problem

The whole point of neural networks is solving problems that linear models can't. Let's train our network on a classification task where the decision boundary is a circle -- no straight line can separate the classes:

# Generate circular decision boundary data
np.random.seed(42)
N = 500
X = np.random.randn(N, 2)
y = ((X[:, 0]**2 + X[:, 1]**2) < 1.5).astype(float).reshape(-1, 1)

print(f"Dataset: {N} points, {int(y.sum())} inside circle, "
      f"{int(N - y.sum())} outside")
print(f"This is NON-linearly separable -- no straight line can do this.")
print(f"Logistic regression from episode #12 would fail here.\n")

# Build a 3-layer network: 2 -> 16 -> 8 -> 1
nn = NeuralNetwork([2, 16, 8, 1])

# Training loop
for epoch in range(1000):
    # Forward pass
    pred = nn.forward(X)
    loss = binary_cross_entropy(y, pred)

    # Backward pass + weight update
    nn.backward(y, lr=0.1)

    if epoch % 200 == 0:
        acc = ((pred > 0.5) == y).mean()
        print(f"Epoch {epoch:>4d}: loss = {loss:.4f}, "
              f"accuracy = {acc:.1%}")

# Final evaluation
pred_final = nn.forward(X)
acc_final = ((pred_final > 0.5) == y).mean()
loss_final = binary_cross_entropy(y, pred_final)
print(f"\nFinal: loss = {loss_final:.4f}, accuracy = {acc_final:.1%}")

The network learns to approximate the circular boundary by combining multiple ReLU neurons. Each hidden neuron contributes a linear piece (a half-plane boundary), and the combination of 16 linear pieces can approximate a circle quite well. This is the universal approximation theorem in action -- we discussed it theoretically in episode #38, and now we're seeing it work in practice with learned (not hand-crafted!) weights.

Notice how the training loop is exactly the pattern from episode #7: predict, compute loss, compute gradients, update. The only difference is that now "compute gradients" means running the backward pass through multiple layers in stead of computing a single derivative. Same loop, same concept, bigger network.

Watching the network learn

Let's get more visibility into what's happening during training. How fast does the network converge? How do the predictions change?

np.random.seed(42)
nn2 = NeuralNetwork([2, 16, 8, 1])

losses = []
accuracies = []
for epoch in range(2000):
    pred = nn2.forward(X)
    loss = binary_cross_entropy(y, pred)
    nn2.backward(y, lr=0.1)
    losses.append(loss)
    accuracies.append(((pred > 0.5) == y).mean())

# Show the learning curve at key points
checkpoints = [0, 10, 50, 100, 200, 500, 1000, 1999]
print(f"{'Epoch':>6s}  {'Loss':>8s}  {'Accuracy':>9s}  {'Change':>8s}")
print("-" * 38)
prev_loss = None
for cp in checkpoints:
    change = f"{losses[cp] - prev_loss:>+8.4f}" if prev_loss else "       -"
    print(f"{cp:>6d}  {losses[cp]:>8.4f}  {accuracies[cp]:>8.1%}  {change}")
    prev_loss = losses[cp]

print(f"\nLoss dropped from {losses[0]:.4f} to {losses[-1]:.4f}")
print(f"Accuracy went from {accuracies[0]:.1%} to {accuracies[-1]:.1%}")

You'll see the typical training pattern: rapid improvement in the first few hundred epochs as the network finds the general structure of the decision boundary, then gradual refinement as it fine-tunes the curve. The loss keeps decreasing but the rate of decrease slows down -- this is normal, and it's one of the reasons training neural networks requires patience and good learning rate selection (which we'll cover in depth when we get to optimization algorithms).

Numerical gradient checking: trust but verify

Here's a scenario that's plagued every neural network practitioner at some point: you implement backpropagation, the loss goes down during training, the accuracy looks reasonable, and you ship it. But the gradients have a subtle bug -- maybe a sign error, maybe a wrong index -- and the network trains 10x slower than it should, or converges to a bad solution, or works on simple data but fails on complex data. Backprop bugs don't crash your program. They silently degrade performance.

Numerical gradient checking is the antidote. The idea is simple: compute the gradient two ways and compare. Analytically (via backpropagation) and numerically (via the definition of a derivative). If they match, your backprop is correct. If they diverge, you have a bug.

The numerical gradient for any weight w is: (loss(w + eps) - loss(w - eps)) / (2 * eps). Nudge the weight up, measure the loss. Nudge it down, measure the loss. The difference, divided by 2 * eps, approximates the derivative. This is slow (you have to do a full forward pass for every single weight), but it requires zero calculus and is almost impossible to get wrong.

def gradient_check(nn, X, y, eps=1e-5):
    """Compare analytical gradients (backprop) to numerical gradients."""
    # Run forward + backward to get analytical gradients
    nn.forward(X)

    # We need to compute gradients without updating weights
    # Store current weights, run backward with lr=0, then restore
    m = len(y)
    delta = (bce_derivative(y, nn.activations[-1])
             * sigmoid_derivative(nn.pre_activations[-1]))

    # Compute analytical gradient for first layer weights
    analytical_dW0 = nn.activations[0].T @ delta
    for i in range(len(nn.weights) - 1, 0, -1):
        delta = (delta @ nn.weights[i].T) * relu_derivative(
            nn.pre_activations[i-1]
        )
        if i == 1:
            analytical_dW0 = nn.activations[0].T @ delta

    # Numerical gradient for first layer, a few elements
    W = nn.weights[0]
    print("Gradient check (first layer weights):")
    print(f"  {'Index':>8s}  {'Analytical':>12s}  {'Numerical':>12s}  {'Diff':>10s}")
    print("  " + "-" * 48)

    checks = [(0,0), (0,1), (1,0), (1,1), (0,2)]
    for idx in checks:
        if idx[0] < W.shape[0] and idx[1] < W.shape[1]:
            original = W[idx]

            W[idx] = original + eps
            loss_plus = binary_cross_entropy(y, nn.forward(X))

            W[idx] = original - eps
            loss_minus = binary_cross_entropy(y, nn.forward(X))

            W[idx] = original
            numerical = (loss_plus - loss_minus) / (2 * eps)

            ana = analytical_dW0[idx]
            diff = abs(ana - numerical)
            ok = "OK" if diff < 1e-4 else "MISMATCH!"
            print(f"  {str(idx):>8s}  {ana:>12.8f}  "
                  f"{numerical:>12.8f}  {diff:>10.2e} {ok}")

# Test it
np.random.seed(42)
nn_check = NeuralNetwork([2, 4, 1])
X_check = np.random.randn(10, 2)
y_check = np.array([[1],[0],[1],[1],[0],[0],[1],[0],[1],[0]])
gradient_check(nn_check, X_check, y_check)

If the analytical and numerical gradients match to within ~1e-5, your backpropagation is correct. If the difference is larger than ~1e-3, you almost certainly have a bug. This check should be part of your workflow whenever you implement backprop from scratch -- run it on a small network with a small dataset, verify the gradients match, and only then scale up to real training. The few minutes it takes to write and run this check can save you hours of debugging mysterious training failures.

(Having said that, once you move to frameworks like PyTorch, the framework handles gradient computation via automatic differentiation and you don't need to check manually. But understanding backprop well enough to implement it from scratch gives you the intuition to debug training issues even when the framework is doing the math for you.)

The complete training pipeline

Let's put everything together into a clean, complete training function and run it on a more interesting dataset:

def train_network(nn, X, y, epochs=1000, lr=0.01, print_every=100):
    """Full training loop with loss and accuracy tracking."""
    history = {'loss': [], 'accuracy': []}

    for epoch in range(epochs):
        # Forward
        predictions = nn.forward(X)
        loss = binary_cross_entropy(y, predictions)

        # Backward
        nn.backward(y, lr=lr)

        # Track metrics
        acc = ((predictions > 0.5) == y).mean()
        history['loss'].append(loss)
        history['accuracy'].append(acc)

        if epoch % print_every == 0:
            print(f"  Epoch {epoch:>5d}: loss={loss:.4f}, acc={acc:.1%}")

    return history

# A harder problem: two interleaved spirals
np.random.seed(42)
n_points = 300
noise = 0.15

# Spiral 1 (class 0)
t1 = np.linspace(0, 3 * np.pi, n_points)
x1 = t1 * np.cos(t1) + np.random.randn(n_points) * noise
y1 = t1 * np.sin(t1) + np.random.randn(n_points) * noise

# Spiral 2 (class 1)
t2 = np.linspace(0, 3 * np.pi, n_points)
x2 = t2 * np.cos(t2 + np.pi) + np.random.randn(n_points) * noise
y2 = t2 * np.sin(t2 + np.pi) + np.random.randn(n_points) * noise

X_spiral = np.vstack([
    np.column_stack([x1, y1]),
    np.column_stack([x2, y2])
])
y_spiral = np.vstack([
    np.zeros((n_points, 1)),
    np.ones((n_points, 1))
])

# Normalize inputs (helps training stability)
X_spiral = (X_spiral - X_spiral.mean(axis=0)) / X_spiral.std(axis=0)

print(f"Spiral dataset: {len(X_spiral)} points, 2 classes")
print(f"This is EXTREMELY non-linear -- two spirals wound around each other.\n")

nn_spiral = NeuralNetwork([2, 64, 32, 16, 1])
print(f"Network: 2 -> 64 -> 32 -> 16 -> 1")
print(f"Total parameters: {sum(W.size + b.size for W, b in zip(nn_spiral.weights, nn_spiral.biases))}\n")

history = train_network(nn_spiral, X_spiral, y_spiral,
                         epochs=3000, lr=0.05, print_every=500)

print(f"\nFinal accuracy: {history['accuracy'][-1]:.1%}")
print(f"Final loss: {history['loss'][-1]:.4f}")

The spiral problem is a classic benchmark for neural networks. Two interleaved spirals are impossible for any linear classifier, and even non-linear methods like SVMs (episode #20) struggle without careful kernel selection. A deep enough network with enough neurons can learn the winding boundary, though it requires more capacity (neurons) and more training time than the simple circle problem.

Why backpropagation is efficient (and why that matters)

I want to emphasize something that might seem like a detail but is actually the entire reason deep learning works at scale. Consider a network with 1 million parameters (a small network by today's standards). How would you compute the gradient without backpropagation?

The naive approach: for each of the 1 million weights, nudge it by epsilon, run a full forward pass, measure the loss change. That's 1 million forward passes per gradient step. At 1000 gradient steps per epoch and 100 epochs, that's 100 billion forward passes. Completely impractical.

With backpropagation: one forward pass + one backward pass = all 1 million gradients computed simultaneously. The backward pass costs roughly the same as the forward pass (about 2x in practice because of the extra multiplications). So two forward-pass equivalents give you every gradient you need. That's a speedup factor of 500,000x compared to the naive approach.

# Demonstrate the efficiency argument
import time

nn_perf = NeuralNetwork([20, 128, 64, 1])
X_perf = np.random.randn(100, 20)
y_perf = np.random.randint(0, 2, (100, 1)).astype(float)

# Time a forward pass
start = time.time()
for _ in range(1000):
    nn_perf.forward(X_perf)
fwd_time = (time.time() - start) / 1000

# Time forward + backward
start = time.time()
for _ in range(1000):
    nn_perf.forward(X_perf)
    nn_perf.backward(y_perf, lr=0.0)  # lr=0 so weights don't change
bwd_time = (time.time() - start) / 1000

n_params = sum(W.size + b.size
               for W, b in zip(nn_perf.weights, nn_perf.biases))
print(f"Network has {n_params:,} parameters")
print(f"Forward pass:           {fwd_time*1000:.3f} ms")
print(f"Forward + backward:     {bwd_time*1000:.3f} ms")
print(f"Backward overhead:      {(bwd_time - fwd_time)*1000:.3f} ms "
      f"({(bwd_time/fwd_time - 1)*100:.0f}% of forward)")
print(f"\nNaive gradient (nudge each param): "
      f"{n_params} forward passes = {n_params * fwd_time:.1f} seconds")
print(f"Backprop gradient: 1 forward + 1 backward "
      f"= {bwd_time*1000:.1f} ms")
print(f"Speedup: ~{n_params * fwd_time / bwd_time:,.0f}x")

This efficiency is what makes training GPT-4 with 1.8 trillion parameters feasible. Without backpropagation, you'd need 1.8 trillion forward passes per gradient step -- even on the fastest hardware, that would take centuries. With backpropagation, it takes two passes. The algorithm itself isn't complicated (we just implemented it in ~30 lines of Python), but its impact on the field is absolutely enormous. Backpropagation is to neural networks what the printing press was to literacy -- it didn't invent the concept, but it made it practical at scale.

A note on the history

Backpropagation has an interesting history. The algorithm was independently discovered multiple times -- by Paul Werbos in his 1974 PhD thesis, by David Parker in 1985, and by Rumelhart, Hinton, and Williams in their famous 1986 paper "Learning representations by back-propagating errors" in Nature. It was the 1986 paper that broke through to the broader community and demonstrated that backpropagation could train multi-layer networks to learn useful internal representations.

This was the algorithm that the field had been missing since Minsky and Papert's 1969 critique (which we covered in episode #37). The architecture for solving XOR was known since the 1960s -- multi-layer networks with hidden layers. What was missing was a practical way to train those networks. Backpropagation was the answer. It didn't immediately lead to the deep learning revolution (that had to wait for GPUs, large datasets, and algorithmic improvements like ReLU and batch normalization), but it laid the mathematical foundation that everything since has been built on.

Before you close this tab

Here's what to take away:

  • Binary cross-entropy measures how wrong probability predictions are -- perfect predictions give zero loss, confident wrong predictions get hit hardest;
  • The chain rule decomposes the gradient of a composed function into a product of local derivatives at each step -- this is the mathematical backbone of backpropagation;
  • Backpropagation applies the chain rule layer by layer, starting at the output and propagating error signals backward. For each layer: compute weight gradients from the error signal and the layer's inputs, propagate the error to the previous layer, then update the weights;
  • ReLU's derivative is 0 or 1 -- gradients flow through without shrinking. Sigmoid's derivative peaks at 0.25 and decays fast, causing gradients to vanish in deep networks. This is the reason ReLU replaced sigmoid in hidden layers;
  • Numerical gradient checking compares analytical gradients against brute-force approximation. Always run it when implementing backprop from scratch. The few minutes it costs will save hours of debugging silent gradient bugs;
  • Backpropagation's efficiency is its superpower: computing gradients for N parameters costs roughly 2 forward passes, not N forward passes. This ~500,000x speedup (for a million-parameter network) is what makes modern deep learning computationally feasible;
  • Every modern AI system learns through backpropagation. The architectures are wildly different, but the core loop -- forward pass, compute loss, backpropagate gradients, update weights -- is universal.

The network we built today learns, but it learns slowly and is sensitive to the learning rate. Pick too large a learning rate and the loss oscillates wildly; too small and training takes forever. Vanilla gradient descent isn't the whole story -- there are smarter ways to use the gradients that backpropagation computes. We'll also run into a whole zoo of practical challenges when training deeper networks: exploding gradients, overfitting, dead neurons, and more. That's where we're headed next ;-)

Bedankt en tot de volgende keer!

@scipio



0
0
0.000
0 comments