Learn AI Series (#40) - Training Neural Networks - Practical Challenges

Learn AI Series (#40) - Training Neural Networks - Practical Challenges

ai-banner.png

What will I learn

  • You will learn vanishing and exploding gradients -- why deep networks are hard to train;
  • weight initialization strategies (Xavier and He) -- keeping signals stable across layers;
  • batch normalization -- stabilizing training by normalizing layer inputs within each mini-batch;
  • dropout -- regularization by randomly disabling neurons during training;
  • learning rate scheduling -- adapting the step size during training for faster convergence;
  • early stopping -- knowing when to quit before your network memorizes the training data;
  • how all these techniques combine into the standard modern training recipe.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#40) - Training Neural Networks - Practical Challenges

You know how to build a neural network from scratch (episode #38) and train it with backpropagation (episode #39). On a toy dataset with 2 features and 500 samples, training works fine. Loss goes down, accuracy goes up, everyone's happy. But scale that network deeper -- 10 layers, 50 layers, 100 layers -- and things start breaking in ways that are absolutely maddening because nothing crashes. The loss just... doesn't go anywhere. Or it goes to NaN. Or the network trains for hours and produces predictions that are barely better than random.

This episode covers the practical techniques that make deep network training actually work. These are NOT optional nice-to-haves. They're the difference between a network that learns and one that produces garbage after 10 epochs. Every one of these techniques was developed because researchers slammed into a specific failure mode, understood the math behind why it happened, and engineered a fix. Here we go!

Vanishing and exploding gradients

The fundamental problem of deep networks comes straight from the chain rule we covered in episode #39. During backpropagation, gradients multiply at each layer. That's how the chain rule works -- dL/dw1 = dL/da_last * da_last/dz_last * ... * da1/dz1 * dz1/dw1. Each of those intermediate derivatives is a multiplication factor. If those factors are consistently less than 1, gradients shrink exponentially. After 50 layers, a gradient of 0.9^50 is about 0.005. The early layers receive a learning signal that's essentially noise. This is the vanishing gradient problem.

And the flip side: if those multiplication factors are consistently greater than 1, gradients explode exponentially. 1.1^50 is about 117. Weights shoot to infinity, loss becomes NaN, and your training run is dead. This is the exploding gradient problem.

We touched on this in episodes #38 and #39 when we compared sigmoid vs ReLU. Let me demonstrate just how devastating the vanishing problem is with sigmoid activations:

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

# Demonstrate gradient vanishing through sigmoid layers
gradient = 1.0
print("Gradient magnitude through 10 sigmoid layers:")
for layer in range(10):
    z = np.random.randn()
    sig_deriv = sigmoid(z) * (1 - sigmoid(z))  # max is 0.25
    gradient *= sig_deriv
    print(f"  Layer {layer}: gradient = {abs(gradient):.10f}")

print(f"\nAfter 10 layers: gradient is {abs(gradient):.2e}")
print(f"After 50 layers (worst case): 0.25^50 = {0.25**50:.2e}")
print(f"That's {1.0 / 0.25**50:.2e}x weaker than the output layer")

With sigmoid activations, the maximum derivative at any single layer is 0.25 (at z=0). In practice it's usually less, because most pre-activation values aren't exactly zero. After 10 layers, the gradient reaching the first layer is on the order of one millionth the strength of the last layer's gradient. The first layer learns a million times slower than the last layer. That's not a rounding error -- that's a fundamental training failure.

This is the reason sigmoid-based deep networks were so difficult to train before ReLU. And it's why the switch to ReLU was transformative. ReLU's derivative is either 0 (for negative inputs) or 1 (for positive inputs). For active neurons, the gradient passes through completely unchanged. No shrinking, no exploding. The gradient at layer 1 has the same magnitude as the gradient at layer 50, as long as the neurons are active. That single mathematical property -- derivative of 1 for positive inputs -- is what made deep networks practical.

(Having said that, ReLU introduces its own problem: dying neurons. If a neuron's pre-activation is always negative, its gradient is always zero, and the neuron never recovers. It's permanently dead. In practice this rarely kills an entire network, especially with proper initialization, but it's something to watch for in very deep networks.)

Weight initialization: starting on the right foot

We briefly mentioned He initialization in episode #38 when we built the NeuralNetwork class. Let me now show you exactly why it matters by comparing three initialization strategies through a 20-layer network:

def forward_activations(X, n_layers, init_strategy):
    """Push data through n_layers and watch what happens."""
    np.random.seed(42)
    a = X
    for i in range(n_layers):
        n_in = a.shape[1]
        if init_strategy == 'small':
            W = np.random.randn(n_in, 64) * 0.01
        elif init_strategy == 'large':
            W = np.random.randn(n_in, 64) * 1.0
        elif init_strategy == 'he':
            W = np.random.randn(n_in, 64) * np.sqrt(2.0 / n_in)
        z = a @ W
        a = np.maximum(0, z)  # ReLU
    return a

X = np.random.randn(100, 64)
print("Activation statistics after 20 ReLU layers:\n")
for strategy in ['small', 'large', 'he']:
    output = forward_activations(X, 20, strategy)
    mean_val = output.mean()
    std_val = output.std()
    dead = (output == 0).all(axis=0).sum()
    print(f"  {strategy:6s} init: mean={mean_val:.4e}, "
          f"std={std_val:.4e}, dead_neurons={dead}/64")

With small initialization (0.01 scale), the activations collapse to near-zero within a few layers. The signal vanishes. Every neuron outputs essentially the same thing (nothing), so the network can't distinguish between different inputs. With large initialization (1.0 scale), activations explode to enormous values -- and in a real training run, gradients would follow suit, producing NaN loss within a few epochs.

With he initialization (scale = sqrt(2/fan_in)), activations maintain stable variance across all 20 layers. The network starts in a regime where learning is actually possible. The factor of 2 in the numerator compensates for ReLU zeroing out roughly half the values -- if half the activations are zero, the surviving half needs twice the variance to maintain the overall signal strength.

Xavier initialization (scale = sqrt(1/fan_in)) is the equivalent for sigmoid and tanh networks. It was proposed by Xavier Glorot and Yoshua Bengio in 2010. He initialization was proposed by Kaiming He in 2015, adding that factor of 2 specifically for ReLU. Use He for ReLU, Xavier for sigmoid/tanh. Getting this wrong doesn't produce an error message -- it silently makes your network untrainable. One of those things you learn to check first when training is mysteriously not working.

# Quick reference
fan_in = 256  # number of inputs to a layer
print("Initialization cheat sheet:")
print(f"  ReLU:    He    -> std = sqrt(2/{fan_in}) = {np.sqrt(2/fan_in):.4f}")
print(f"  Sigmoid: Xavier -> std = sqrt(1/{fan_in}) = {np.sqrt(1/fan_in):.4f}")
print(f"  Tanh:    Xavier -> std = sqrt(1/{fan_in}) = {np.sqrt(1/fan_in):.4f}")

Batch normalization

Even with proper He initialization, a problem creeps in during training. As weights change through gradient updates, the distribution of inputs to each layer shifts. A layer that started with nice zero-mean, unit-variance inputs might, after a few hundred gradient steps, receive inputs with mean 3.7 and variance 0.01. Each layer is constantly chasing a moving target. This phenonmenon is called internal covariate shift (Ioffe & Szegedy, 2015), and it slows training significantly because each layer needs to continuously re-adapt to its changing input distribution.

Batch normalization fixes this by normalizing each layer's inputs within each mini-batch. It subtracts the batch mean and divides by the batch standard deviation, then applies learnable scale (gamma) and shift (beta) parameters. The gamma and beta let the network undo the normalization if that turns out to be optimal -- so batch norm doesn't constrain what the layer can represent, it just provides a better starting point.

def batch_norm(z, gamma, beta, eps=1e-5):
    """Batch normalization: normalize within the mini-batch,
    then scale and shift with learnable parameters."""
    mean = z.mean(axis=0)
    var = z.var(axis=0)
    z_norm = (z - mean) / np.sqrt(var + eps)
    return gamma * z_norm + beta, mean, var

# Before batch norm: shifted and scaled activations
np.random.seed(42)
z = np.random.randn(32, 64) * 5 + 3  # mean~3, std~5
print(f"Before BN: mean={z.mean():.2f}, std={z.std():.2f}")

# After: normalized
gamma = np.ones(64)    # learnable scale (initialized to 1)
beta = np.zeros(64)    # learnable shift (initialized to 0)
z_bn, _, _ = batch_norm(z, gamma, beta)
print(f"After BN:  mean={z_bn.mean():.4f}, std={z_bn.std():.4f}")

Batch norm sits between the linear transformation and the activation function: z = x @ W + b -> z_norm = BN(z) -> a = relu(z_norm). The bias b becomes redundant when using batch norm because beta absorbs its role, so you'll often see b omitted in layers that have BN.

The practical benefits are dramatic:

# Demo: compare training stability with and without batch norm
np.random.seed(42)

def simulate_layer_stats(n_layers, use_bn=False):
    """Simulate forward pass statistics with/without batch norm."""
    x = np.random.randn(64, 128)  # batch of 64
    gamma = np.ones(128)
    beta = np.zeros(128)
    stats = []
    for i in range(n_layers):
        W = np.random.randn(128, 128) * np.sqrt(2.0 / 128)
        z = x @ W
        if use_bn:
            z, _, _ = batch_norm(z, gamma, beta)
        x = np.maximum(0, z)  # ReLU
        stats.append((x.mean(), x.std()))
    return stats

stats_no_bn = simulate_layer_stats(15, use_bn=False)
stats_bn = simulate_layer_stats(15, use_bn=True)

print(f"{'Layer':>6s}  {'No BN mean':>11s}  {'No BN std':>10s}  "
      f"{'BN mean':>8s}  {'BN std':>7s}")
print("-" * 50)
for i in range(15):
    m1, s1 = stats_no_bn[i]
    m2, s2 = stats_bn[i]
    print(f"{i:>6d}  {m1:>11.4f}  {s1:>10.4f}  "
          f"{m2:>8.4f}  {s2:>7.4f}")

With batch norm, statistics stay controlled across layers. Without it, the activations can drift and either explode or collapse. Training converges faster, is less sensitive to learning rate choice, and allows deeper networks. The downside: BN introduces a dependency on batch size. Small batches produce noisy statistics. For very small batches or sequence models, layer normalization (normalizing across features in stead of across the batch) is preferred -- but that's a detail for a later episode.

Dropout: regularization through noise

Neural networks are powerful function approximators -- so powerful that they can memorize training data in stead of learning generalizable patterns. If you remember episode #13 on evaluation, this is the classic overfitting problem: training accuracy is 99% but test accuracy is 75%. The network learned the noise in the training set, not the underlying pattern.

Dropout (Srivastava et al., 2014) is a beautifully simple regularization technique. During training, it randomly sets a fraction of neuron activations to zero. With a dropout rate of 0.5, each forward pass uses a random half of the neurons. The network can't rely on any single neuron being present -- it must distribute knowledge across many neurons, learning redundant representations that are robust to individual neurons being absent.

Conceptually, dropout is training an ensemble of smaller networks that share weights. Each forward pass samples a different sub-network. At inference time, you use the full network (all neurons active), which approximates the average prediction of all those sub-networks. This is similar in spirit to random forests (episode #18) where many decorrelated trees produce a better combined prediction than any individual tree.

def dropout(a, rate=0.5, training=True):
    """Inverted dropout: scale up during training
    so no adjustment needed at inference time."""
    if not training or rate == 0:
        return a
    mask = np.random.binomial(1, 1 - rate, a.shape) / (1 - rate)
    return a * mask

np.random.seed(42)
a = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
print(f"Original activations: {a}")
print(f"Mean: {a.mean():.2f}")
print()

# Multiple dropout samples -- each is different
for i in range(4):
    dropped = dropout(a, rate=0.5)
    print(f"Dropout sample {i+1}:    {dropped.round(2)}")
    print(f"  Mean: {dropped.mean():.2f}  "
          f"Active: {(dropped != 0).sum()}/{len(a)}")

The division by (1 - rate) is the inverted dropout trick. During training, surviving activations are scaled up by 1 / (1 - rate) so the expected value stays the same. At inference time, you simply skip dropout (set training=False) without needing to scale anything. Every modern framework does it this way.

Typical dropout rates:

print("Typical dropout rates by layer type:")
print(f"  Input layer:              0.0 - 0.1 (rarely drop inputs)")
print(f"  Convolutional layers:     0.1 - 0.3 (spatial structure matters)")
print(f"  Fully connected hidden:   0.3 - 0.5 (the classic)")
print(f"  Just before output:       0.0 (don't randomize predictions)")
print()
print("Higher dropout = stronger regularization but slower convergence")
print("Lower dropout  = faster training but more overfitting risk")

Learning rate scheduling

The learning rate is arguably the single most important hyperparameter in neural network training. Too high and the network oscillates wildly around the optimum or diverges entirely. Too low and training crawls along for days without making meaningful progress. And here's the twist: the optimal learning rate changes during training. Early on, you want large steps to move quickly through the loss landscape. Later, when you're near a good minimum, you want small steps to fine-tune without overshooting ;-)

Learning rate schedulers automate this transition:

Step decay reduces the learning rate by a fixed factor every N epochs. Simple and effective: lr = initial_lr * decay^(epoch // step_size). The sharp drops can sometimes bump the network out of a local minimum, which can be either helpful or harmful depending on timing.

Cosine annealing smoothly decreases the learning rate following a cosine curve from the initial value to near zero. No sharp drops, fewer hyperparameters -- just the total number of epochs. This has become very popular because it works well across many architectures without much tuning.

Warmup starts with a very low learning rate and linearly increases it for the first few epochs before following another schedule. This prevents the randomly initialized network from taking destructively large steps before it's found a reasonable region of the loss landscape. Warmup has become standard practice for training large networks, especially transformers (which we'll build later in this series).

def step_decay(epoch, initial_lr, decay=0.5, step_size=30):
    """Drop LR by factor every step_size epochs."""
    return initial_lr * (decay ** (epoch // step_size))

def cosine_schedule(epoch, total_epochs, initial_lr, min_lr=1e-6):
    """Smooth cosine decay from initial_lr to min_lr."""
    return min_lr + 0.5 * (initial_lr - min_lr) * (
        1 + np.cos(np.pi * epoch / total_epochs)
    )

def warmup_cosine(epoch, warmup_epochs, total_epochs, initial_lr):
    """Linear warmup then cosine decay."""
    if epoch < warmup_epochs:
        return initial_lr * epoch / warmup_epochs
    return cosine_schedule(
        epoch - warmup_epochs,
        total_epochs - warmup_epochs,
        initial_lr
    )

# Compare schedules
epochs = np.arange(100)
print(f"{'Epoch':>6s}  {'Step':>8s}  {'Cosine':>8s}  {'Warmup+Cos':>11s}")
print("-" * 38)
for e in [0, 5, 10, 20, 30, 50, 70, 90, 99]:
    s = step_decay(e, 0.001)
    c = cosine_schedule(e, 100, 0.001)
    w = warmup_cosine(e, 10, 100, 0.001)
    print(f"{e:>6d}  {s:>8.6f}  {c:>8.6f}  {w:>11.6f}")

In practice, warmup + cosine annealing is the current standard for most deep learning training. It's robust, works well across architectures, and has few hyperparameters to tune. The warmup phase protects early training (typically 5-10% of total epochs), and the cosine decay provides a smooth, gradual transition to fine-grained optimization.

Early stopping: knowing when to quit

Training loss always decreases with more epochs (assuming the learning rate isn't catastrophically large). But validation loss -- the loss on data the network hasn't seen during training -- follows a different pattern. It decreases initially as the network learns generalizable patterns, then at some point starts increasing as the network transitions from learning patterns to memorizing training noise.

That inflection point is exactly where you want to stop. This is early stopping, and it's the simplest and most effective form of regularization for neural networks.

class EarlyStopping:
    def __init__(self, patience=10):
        """Stop training when validation loss hasn't improved
        for 'patience' consecutive epochs."""
        self.patience = patience
        self.best_loss = float('inf')
        self.counter = 0
        self.best_weights = None

    def check(self, val_loss, model_weights):
        if val_loss < self.best_loss:
            self.best_loss = val_loss
            self.counter = 0
            self.best_weights = [w.copy() for w in model_weights]
            return False  # keep training
        self.counter += 1
        return self.counter >= self.patience  # True = stop

# Simulated training curve showing overfitting
np.random.seed(42)
train_losses = []
val_losses = []
for epoch in range(200):
    # Training loss: always decreasing
    train_loss = 0.8 * np.exp(-epoch / 30) + 0.05
    # Validation loss: decreases then increases
    val_loss = (0.8 * np.exp(-epoch / 30) + 0.05
                + 0.001 * max(0, epoch - 80)**1.3)
    train_losses.append(train_loss)
    val_losses.append(val_loss)

best_epoch = np.argmin(val_losses)
print(f"Training loss at epoch 200: {train_losses[-1]:.4f} (still decreasing)")
print(f"Validation loss at epoch 200: {val_losses[-1]:.4f} (INCREASING)")
print(f"\nBest validation loss at epoch {best_epoch}: {val_losses[best_epoch]:.4f}")
print(f"Training loss at epoch {best_epoch}: {train_losses[best_epoch]:.4f}")
print(f"\nWith patience=10, early stopping would trigger at epoch ~{best_epoch + 10}")
print(f"Saving you {200 - best_epoch - 10} epochs of wasted compute")
print(f"AND giving you better generalization (lower val loss)")

The critical detail: when early stopping triggers, you restore the best weights -- not the final weights. The final weights are overfit. The best weights (from the epoch with lowest validation loss) generalize better. This is such a common pattern that every deep learning framework has built-in early stopping callbacks.

Early stopping costs nothing to implement, requires no architectural changes, and implicitly selects the right amount of training for the given dataset and model. Combined with dropout and batch normalization, it forms the practical regularization toolkit for modern deep learning. It's the first thing I reach for when a model is overfitting -- if you recall episode #13, we talked about the train-test gap being the key diagnostic ;-)

Putting it all together

A modern training loop combines all five techniques into a coherent recipe:

def modern_training_recipe():
    """The standard recipe for training deep networks."""
    print("Modern deep learning training recipe:")
    print("=" * 50)
    print()
    print("1. ARCHITECTURE")
    print("   - ReLU activations for hidden layers")
    print("   - He initialization (sqrt(2/fan_in))")
    print("   - Batch norm after linear, before activation")
    print()
    print("2. REGULARIZATION")
    print("   - Dropout (0.3-0.5) after activation in FC layers")
    print("   - Early stopping (patience 5-20 epochs)")
    print("   - Optional: L2 weight decay")
    print()
    print("3. OPTIMIZATION")
    print("   - Warmup + cosine LR schedule")
    print("   - Adam optimizer (we'll cover this next episode)")
    print("   - Mini-batch gradient descent (batch size 32-256)")
    print()
    print("4. MONITORING")
    print("   - Track train AND validation loss every epoch")
    print("   - Watch for divergence (exploding gradients)")
    print("   - Watch for the train-val gap (overfitting)")

modern_training_recipe()

Each technique addresses a different failure mode:

  • He initialization prevents signal collapse at the start of training
  • Batch normalization prevents activation distribution drift during training
  • Dropout prevents overfitting by forcing redundant representations
  • Learning rate scheduling optimizes the learning trajectory over time
  • Early stopping picks the optimal stopping point before overfitting

None of them are silver bullets on their own. But together, they make training deep networks reliable. A 50-layer network that would have been practically untrainable in 2005 is now almost boring to train with this recipe. That's not because the fundamental math changed -- it's because the engineering around the math got dramatically better.

# A concrete example: combining everything
np.random.seed(42)

# Simulate a "full recipe" training run
# (Real implementation would use our NeuralNetwork class
#  from episodes #38-39 with these techniques added)
lr_initial = 0.001
warmup_epochs = 10
total_epochs = 200
patience = 15
dropout_rate = 0.3

print(f"Training config:")
print(f"  Initial LR:    {lr_initial}")
print(f"  Warmup:        {warmup_epochs} epochs")
print(f"  Total budget:  {total_epochs} epochs")
print(f"  Dropout:       {dropout_rate}")
print(f"  Early stop:    patience={patience}")
print(f"  Init:          He (ReLU network)")
print(f"  Batch norm:    Yes (every hidden layer)")
print()

# Simulate realistic training curve with all techniques
stopper = EarlyStopping(patience=patience)
best_epoch = 0

for epoch in range(total_epochs):
    lr = warmup_cosine(epoch, warmup_epochs, total_epochs, lr_initial)
    # Simulated loss (better convergance with full recipe)
    train_loss = 0.6 * np.exp(-epoch / 25) + 0.02
    val_loss = (0.65 * np.exp(-epoch / 25) + 0.03
                + 0.0005 * max(0, epoch - 100)**1.2)

    if epoch % 40 == 0:
        print(f"  Epoch {epoch:>3d}: lr={lr:.6f}, "
              f"train={train_loss:.4f}, val={val_loss:.4f}")

    weights_dummy = [np.array([train_loss])]
    if stopper.check(val_loss, weights_dummy):
        print(f"\n  Early stopping triggered at epoch {epoch}")
        print(f"  Best validation loss: {stopper.best_loss:.4f}")
        best_epoch = epoch - patience
        break

print(f"\n  Would have trained {total_epochs} epochs without early stopping")
print(f"  Saved {total_epochs - epoch} epochs of wasted compute")

What you should take away

Here's the summary -- no filler, just the essentials:

  • Vanishing gradients shrink the learning signal exponentially through layers. Sigmoid compounds this (max derivative 0.25 per layer), ReLU solves it (derivative is exactly 1 for positive inputs). This is the reason ReLU replaced sigmoid in hidden layers;
  • He initialization (scale = sqrt(2/fan_in)) keeps activation variance stable across layers in ReLU networks. Wrong initialization silently kills training -- no error message, just a network that doesn't learn;
  • Batch normalization normalizes layer inputs within each mini-batch, preventing internal covariate shift. It stabilizes training and enables deeper networks. Applied between linear transformation and activation;
  • Dropout randomly zeros neuron activations during training, forcing the network to learn distributed, redundant representations. Use 0.3-0.5 for fully connected layers. Inverted dropout (scale up during training) means no adjustment at inference;
  • Learning rate scheduling (warmup + cosine annealing) adapts step size during training: large steps early for broad exploration, small steps late for fine-tuned convergance. The current standard for most architectures;
  • Early stopping monitors validation loss and restores the best weights when improvement stops. The simplest, cheapest, most effective regularizer. Always use it;
  • These five techniques are NOT optional for deep networks. They're what makes deep network training possible at all. The math of backpropagation is the same as it was in 1986 -- what changed is the engineering that makes it work in practice.

The network we built in episodes #38-39 uses vanilla gradient descent: compute the gradient, take a step in the opposite direction, repeat. That's the simplest possible optimizer. But there's a whole family of smarter optimization algorithms that use gradient information more intelligently -- maintaining momentum, adapting per-parameter learning rates, and handling noisy gradients from mini-batches. That's the next piece of the puzzle, and it's where we're headed ;-)

Thanks for reading! Tot de volgende keer!

@scipio



0
0
0.000
0 comments