Learn AI Series (#40) - Training Neural Networks - Practical Challenges
Learn AI Series (#40) - Training Neural Networks - Practical Challenges

What will I learn
- You will learn vanishing and exploding gradients -- why deep networks are hard to train;
- weight initialization strategies (Xavier and He) -- keeping signals stable across layers;
- batch normalization -- stabilizing training by normalizing layer inputs within each mini-batch;
- dropout -- regularization by randomly disabling neurons during training;
- learning rate scheduling -- adapting the step size during training for faster convergence;
- early stopping -- knowing when to quit before your network memorizes the training data;
- how all these techniques combine into the standard modern training recipe.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges (this post)
Learn AI Series (#40) - Training Neural Networks - Practical Challenges
You know how to build a neural network from scratch (episode #38) and train it with backpropagation (episode #39). On a toy dataset with 2 features and 500 samples, training works fine. Loss goes down, accuracy goes up, everyone's happy. But scale that network deeper -- 10 layers, 50 layers, 100 layers -- and things start breaking in ways that are absolutely maddening because nothing crashes. The loss just... doesn't go anywhere. Or it goes to NaN. Or the network trains for hours and produces predictions that are barely better than random.
This episode covers the practical techniques that make deep network training actually work. These are NOT optional nice-to-haves. They're the difference between a network that learns and one that produces garbage after 10 epochs. Every one of these techniques was developed because researchers slammed into a specific failure mode, understood the math behind why it happened, and engineered a fix. Here we go!
Vanishing and exploding gradients
The fundamental problem of deep networks comes straight from the chain rule we covered in episode #39. During backpropagation, gradients multiply at each layer. That's how the chain rule works -- dL/dw1 = dL/da_last * da_last/dz_last * ... * da1/dz1 * dz1/dw1. Each of those intermediate derivatives is a multiplication factor. If those factors are consistently less than 1, gradients shrink exponentially. After 50 layers, a gradient of 0.9^50 is about 0.005. The early layers receive a learning signal that's essentially noise. This is the vanishing gradient problem.
And the flip side: if those multiplication factors are consistently greater than 1, gradients explode exponentially. 1.1^50 is about 117. Weights shoot to infinity, loss becomes NaN, and your training run is dead. This is the exploding gradient problem.
We touched on this in episodes #38 and #39 when we compared sigmoid vs ReLU. Let me demonstrate just how devastating the vanishing problem is with sigmoid activations:
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
# Demonstrate gradient vanishing through sigmoid layers
gradient = 1.0
print("Gradient magnitude through 10 sigmoid layers:")
for layer in range(10):
z = np.random.randn()
sig_deriv = sigmoid(z) * (1 - sigmoid(z)) # max is 0.25
gradient *= sig_deriv
print(f" Layer {layer}: gradient = {abs(gradient):.10f}")
print(f"\nAfter 10 layers: gradient is {abs(gradient):.2e}")
print(f"After 50 layers (worst case): 0.25^50 = {0.25**50:.2e}")
print(f"That's {1.0 / 0.25**50:.2e}x weaker than the output layer")
With sigmoid activations, the maximum derivative at any single layer is 0.25 (at z=0). In practice it's usually less, because most pre-activation values aren't exactly zero. After 10 layers, the gradient reaching the first layer is on the order of one millionth the strength of the last layer's gradient. The first layer learns a million times slower than the last layer. That's not a rounding error -- that's a fundamental training failure.
This is the reason sigmoid-based deep networks were so difficult to train before ReLU. And it's why the switch to ReLU was transformative. ReLU's derivative is either 0 (for negative inputs) or 1 (for positive inputs). For active neurons, the gradient passes through completely unchanged. No shrinking, no exploding. The gradient at layer 1 has the same magnitude as the gradient at layer 50, as long as the neurons are active. That single mathematical property -- derivative of 1 for positive inputs -- is what made deep networks practical.
(Having said that, ReLU introduces its own problem: dying neurons. If a neuron's pre-activation is always negative, its gradient is always zero, and the neuron never recovers. It's permanently dead. In practice this rarely kills an entire network, especially with proper initialization, but it's something to watch for in very deep networks.)
Weight initialization: starting on the right foot
We briefly mentioned He initialization in episode #38 when we built the NeuralNetwork class. Let me now show you exactly why it matters by comparing three initialization strategies through a 20-layer network:
def forward_activations(X, n_layers, init_strategy):
"""Push data through n_layers and watch what happens."""
np.random.seed(42)
a = X
for i in range(n_layers):
n_in = a.shape[1]
if init_strategy == 'small':
W = np.random.randn(n_in, 64) * 0.01
elif init_strategy == 'large':
W = np.random.randn(n_in, 64) * 1.0
elif init_strategy == 'he':
W = np.random.randn(n_in, 64) * np.sqrt(2.0 / n_in)
z = a @ W
a = np.maximum(0, z) # ReLU
return a
X = np.random.randn(100, 64)
print("Activation statistics after 20 ReLU layers:\n")
for strategy in ['small', 'large', 'he']:
output = forward_activations(X, 20, strategy)
mean_val = output.mean()
std_val = output.std()
dead = (output == 0).all(axis=0).sum()
print(f" {strategy:6s} init: mean={mean_val:.4e}, "
f"std={std_val:.4e}, dead_neurons={dead}/64")
With small initialization (0.01 scale), the activations collapse to near-zero within a few layers. The signal vanishes. Every neuron outputs essentially the same thing (nothing), so the network can't distinguish between different inputs. With large initialization (1.0 scale), activations explode to enormous values -- and in a real training run, gradients would follow suit, producing NaN loss within a few epochs.
With he initialization (scale = sqrt(2/fan_in)), activations maintain stable variance across all 20 layers. The network starts in a regime where learning is actually possible. The factor of 2 in the numerator compensates for ReLU zeroing out roughly half the values -- if half the activations are zero, the surviving half needs twice the variance to maintain the overall signal strength.
Xavier initialization (scale = sqrt(1/fan_in)) is the equivalent for sigmoid and tanh networks. It was proposed by Xavier Glorot and Yoshua Bengio in 2010. He initialization was proposed by Kaiming He in 2015, adding that factor of 2 specifically for ReLU. Use He for ReLU, Xavier for sigmoid/tanh. Getting this wrong doesn't produce an error message -- it silently makes your network untrainable. One of those things you learn to check first when training is mysteriously not working.
# Quick reference
fan_in = 256 # number of inputs to a layer
print("Initialization cheat sheet:")
print(f" ReLU: He -> std = sqrt(2/{fan_in}) = {np.sqrt(2/fan_in):.4f}")
print(f" Sigmoid: Xavier -> std = sqrt(1/{fan_in}) = {np.sqrt(1/fan_in):.4f}")
print(f" Tanh: Xavier -> std = sqrt(1/{fan_in}) = {np.sqrt(1/fan_in):.4f}")
Batch normalization
Even with proper He initialization, a problem creeps in during training. As weights change through gradient updates, the distribution of inputs to each layer shifts. A layer that started with nice zero-mean, unit-variance inputs might, after a few hundred gradient steps, receive inputs with mean 3.7 and variance 0.01. Each layer is constantly chasing a moving target. This phenonmenon is called internal covariate shift (Ioffe & Szegedy, 2015), and it slows training significantly because each layer needs to continuously re-adapt to its changing input distribution.
Batch normalization fixes this by normalizing each layer's inputs within each mini-batch. It subtracts the batch mean and divides by the batch standard deviation, then applies learnable scale (gamma) and shift (beta) parameters. The gamma and beta let the network undo the normalization if that turns out to be optimal -- so batch norm doesn't constrain what the layer can represent, it just provides a better starting point.
def batch_norm(z, gamma, beta, eps=1e-5):
"""Batch normalization: normalize within the mini-batch,
then scale and shift with learnable parameters."""
mean = z.mean(axis=0)
var = z.var(axis=0)
z_norm = (z - mean) / np.sqrt(var + eps)
return gamma * z_norm + beta, mean, var
# Before batch norm: shifted and scaled activations
np.random.seed(42)
z = np.random.randn(32, 64) * 5 + 3 # mean~3, std~5
print(f"Before BN: mean={z.mean():.2f}, std={z.std():.2f}")
# After: normalized
gamma = np.ones(64) # learnable scale (initialized to 1)
beta = np.zeros(64) # learnable shift (initialized to 0)
z_bn, _, _ = batch_norm(z, gamma, beta)
print(f"After BN: mean={z_bn.mean():.4f}, std={z_bn.std():.4f}")
Batch norm sits between the linear transformation and the activation function: z = x @ W + b -> z_norm = BN(z) -> a = relu(z_norm). The bias b becomes redundant when using batch norm because beta absorbs its role, so you'll often see b omitted in layers that have BN.
The practical benefits are dramatic:
# Demo: compare training stability with and without batch norm
np.random.seed(42)
def simulate_layer_stats(n_layers, use_bn=False):
"""Simulate forward pass statistics with/without batch norm."""
x = np.random.randn(64, 128) # batch of 64
gamma = np.ones(128)
beta = np.zeros(128)
stats = []
for i in range(n_layers):
W = np.random.randn(128, 128) * np.sqrt(2.0 / 128)
z = x @ W
if use_bn:
z, _, _ = batch_norm(z, gamma, beta)
x = np.maximum(0, z) # ReLU
stats.append((x.mean(), x.std()))
return stats
stats_no_bn = simulate_layer_stats(15, use_bn=False)
stats_bn = simulate_layer_stats(15, use_bn=True)
print(f"{'Layer':>6s} {'No BN mean':>11s} {'No BN std':>10s} "
f"{'BN mean':>8s} {'BN std':>7s}")
print("-" * 50)
for i in range(15):
m1, s1 = stats_no_bn[i]
m2, s2 = stats_bn[i]
print(f"{i:>6d} {m1:>11.4f} {s1:>10.4f} "
f"{m2:>8.4f} {s2:>7.4f}")
With batch norm, statistics stay controlled across layers. Without it, the activations can drift and either explode or collapse. Training converges faster, is less sensitive to learning rate choice, and allows deeper networks. The downside: BN introduces a dependency on batch size. Small batches produce noisy statistics. For very small batches or sequence models, layer normalization (normalizing across features in stead of across the batch) is preferred -- but that's a detail for a later episode.
Dropout: regularization through noise
Neural networks are powerful function approximators -- so powerful that they can memorize training data in stead of learning generalizable patterns. If you remember episode #13 on evaluation, this is the classic overfitting problem: training accuracy is 99% but test accuracy is 75%. The network learned the noise in the training set, not the underlying pattern.
Dropout (Srivastava et al., 2014) is a beautifully simple regularization technique. During training, it randomly sets a fraction of neuron activations to zero. With a dropout rate of 0.5, each forward pass uses a random half of the neurons. The network can't rely on any single neuron being present -- it must distribute knowledge across many neurons, learning redundant representations that are robust to individual neurons being absent.
Conceptually, dropout is training an ensemble of smaller networks that share weights. Each forward pass samples a different sub-network. At inference time, you use the full network (all neurons active), which approximates the average prediction of all those sub-networks. This is similar in spirit to random forests (episode #18) where many decorrelated trees produce a better combined prediction than any individual tree.
def dropout(a, rate=0.5, training=True):
"""Inverted dropout: scale up during training
so no adjustment needed at inference time."""
if not training or rate == 0:
return a
mask = np.random.binomial(1, 1 - rate, a.shape) / (1 - rate)
return a * mask
np.random.seed(42)
a = np.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
print(f"Original activations: {a}")
print(f"Mean: {a.mean():.2f}")
print()
# Multiple dropout samples -- each is different
for i in range(4):
dropped = dropout(a, rate=0.5)
print(f"Dropout sample {i+1}: {dropped.round(2)}")
print(f" Mean: {dropped.mean():.2f} "
f"Active: {(dropped != 0).sum()}/{len(a)}")
The division by (1 - rate) is the inverted dropout trick. During training, surviving activations are scaled up by 1 / (1 - rate) so the expected value stays the same. At inference time, you simply skip dropout (set training=False) without needing to scale anything. Every modern framework does it this way.
Typical dropout rates:
print("Typical dropout rates by layer type:")
print(f" Input layer: 0.0 - 0.1 (rarely drop inputs)")
print(f" Convolutional layers: 0.1 - 0.3 (spatial structure matters)")
print(f" Fully connected hidden: 0.3 - 0.5 (the classic)")
print(f" Just before output: 0.0 (don't randomize predictions)")
print()
print("Higher dropout = stronger regularization but slower convergence")
print("Lower dropout = faster training but more overfitting risk")
Learning rate scheduling
The learning rate is arguably the single most important hyperparameter in neural network training. Too high and the network oscillates wildly around the optimum or diverges entirely. Too low and training crawls along for days without making meaningful progress. And here's the twist: the optimal learning rate changes during training. Early on, you want large steps to move quickly through the loss landscape. Later, when you're near a good minimum, you want small steps to fine-tune without overshooting ;-)
Learning rate schedulers automate this transition:
Step decay reduces the learning rate by a fixed factor every N epochs. Simple and effective: lr = initial_lr * decay^(epoch // step_size). The sharp drops can sometimes bump the network out of a local minimum, which can be either helpful or harmful depending on timing.
Cosine annealing smoothly decreases the learning rate following a cosine curve from the initial value to near zero. No sharp drops, fewer hyperparameters -- just the total number of epochs. This has become very popular because it works well across many architectures without much tuning.
Warmup starts with a very low learning rate and linearly increases it for the first few epochs before following another schedule. This prevents the randomly initialized network from taking destructively large steps before it's found a reasonable region of the loss landscape. Warmup has become standard practice for training large networks, especially transformers (which we'll build later in this series).
def step_decay(epoch, initial_lr, decay=0.5, step_size=30):
"""Drop LR by factor every step_size epochs."""
return initial_lr * (decay ** (epoch // step_size))
def cosine_schedule(epoch, total_epochs, initial_lr, min_lr=1e-6):
"""Smooth cosine decay from initial_lr to min_lr."""
return min_lr + 0.5 * (initial_lr - min_lr) * (
1 + np.cos(np.pi * epoch / total_epochs)
)
def warmup_cosine(epoch, warmup_epochs, total_epochs, initial_lr):
"""Linear warmup then cosine decay."""
if epoch < warmup_epochs:
return initial_lr * epoch / warmup_epochs
return cosine_schedule(
epoch - warmup_epochs,
total_epochs - warmup_epochs,
initial_lr
)
# Compare schedules
epochs = np.arange(100)
print(f"{'Epoch':>6s} {'Step':>8s} {'Cosine':>8s} {'Warmup+Cos':>11s}")
print("-" * 38)
for e in [0, 5, 10, 20, 30, 50, 70, 90, 99]:
s = step_decay(e, 0.001)
c = cosine_schedule(e, 100, 0.001)
w = warmup_cosine(e, 10, 100, 0.001)
print(f"{e:>6d} {s:>8.6f} {c:>8.6f} {w:>11.6f}")
In practice, warmup + cosine annealing is the current standard for most deep learning training. It's robust, works well across architectures, and has few hyperparameters to tune. The warmup phase protects early training (typically 5-10% of total epochs), and the cosine decay provides a smooth, gradual transition to fine-grained optimization.
Early stopping: knowing when to quit
Training loss always decreases with more epochs (assuming the learning rate isn't catastrophically large). But validation loss -- the loss on data the network hasn't seen during training -- follows a different pattern. It decreases initially as the network learns generalizable patterns, then at some point starts increasing as the network transitions from learning patterns to memorizing training noise.
That inflection point is exactly where you want to stop. This is early stopping, and it's the simplest and most effective form of regularization for neural networks.
class EarlyStopping:
def __init__(self, patience=10):
"""Stop training when validation loss hasn't improved
for 'patience' consecutive epochs."""
self.patience = patience
self.best_loss = float('inf')
self.counter = 0
self.best_weights = None
def check(self, val_loss, model_weights):
if val_loss < self.best_loss:
self.best_loss = val_loss
self.counter = 0
self.best_weights = [w.copy() for w in model_weights]
return False # keep training
self.counter += 1
return self.counter >= self.patience # True = stop
# Simulated training curve showing overfitting
np.random.seed(42)
train_losses = []
val_losses = []
for epoch in range(200):
# Training loss: always decreasing
train_loss = 0.8 * np.exp(-epoch / 30) + 0.05
# Validation loss: decreases then increases
val_loss = (0.8 * np.exp(-epoch / 30) + 0.05
+ 0.001 * max(0, epoch - 80)**1.3)
train_losses.append(train_loss)
val_losses.append(val_loss)
best_epoch = np.argmin(val_losses)
print(f"Training loss at epoch 200: {train_losses[-1]:.4f} (still decreasing)")
print(f"Validation loss at epoch 200: {val_losses[-1]:.4f} (INCREASING)")
print(f"\nBest validation loss at epoch {best_epoch}: {val_losses[best_epoch]:.4f}")
print(f"Training loss at epoch {best_epoch}: {train_losses[best_epoch]:.4f}")
print(f"\nWith patience=10, early stopping would trigger at epoch ~{best_epoch + 10}")
print(f"Saving you {200 - best_epoch - 10} epochs of wasted compute")
print(f"AND giving you better generalization (lower val loss)")
The critical detail: when early stopping triggers, you restore the best weights -- not the final weights. The final weights are overfit. The best weights (from the epoch with lowest validation loss) generalize better. This is such a common pattern that every deep learning framework has built-in early stopping callbacks.
Early stopping costs nothing to implement, requires no architectural changes, and implicitly selects the right amount of training for the given dataset and model. Combined with dropout and batch normalization, it forms the practical regularization toolkit for modern deep learning. It's the first thing I reach for when a model is overfitting -- if you recall episode #13, we talked about the train-test gap being the key diagnostic ;-)
Putting it all together
A modern training loop combines all five techniques into a coherent recipe:
def modern_training_recipe():
"""The standard recipe for training deep networks."""
print("Modern deep learning training recipe:")
print("=" * 50)
print()
print("1. ARCHITECTURE")
print(" - ReLU activations for hidden layers")
print(" - He initialization (sqrt(2/fan_in))")
print(" - Batch norm after linear, before activation")
print()
print("2. REGULARIZATION")
print(" - Dropout (0.3-0.5) after activation in FC layers")
print(" - Early stopping (patience 5-20 epochs)")
print(" - Optional: L2 weight decay")
print()
print("3. OPTIMIZATION")
print(" - Warmup + cosine LR schedule")
print(" - Adam optimizer (we'll cover this next episode)")
print(" - Mini-batch gradient descent (batch size 32-256)")
print()
print("4. MONITORING")
print(" - Track train AND validation loss every epoch")
print(" - Watch for divergence (exploding gradients)")
print(" - Watch for the train-val gap (overfitting)")
modern_training_recipe()
Each technique addresses a different failure mode:
- He initialization prevents signal collapse at the start of training
- Batch normalization prevents activation distribution drift during training
- Dropout prevents overfitting by forcing redundant representations
- Learning rate scheduling optimizes the learning trajectory over time
- Early stopping picks the optimal stopping point before overfitting
None of them are silver bullets on their own. But together, they make training deep networks reliable. A 50-layer network that would have been practically untrainable in 2005 is now almost boring to train with this recipe. That's not because the fundamental math changed -- it's because the engineering around the math got dramatically better.
# A concrete example: combining everything
np.random.seed(42)
# Simulate a "full recipe" training run
# (Real implementation would use our NeuralNetwork class
# from episodes #38-39 with these techniques added)
lr_initial = 0.001
warmup_epochs = 10
total_epochs = 200
patience = 15
dropout_rate = 0.3
print(f"Training config:")
print(f" Initial LR: {lr_initial}")
print(f" Warmup: {warmup_epochs} epochs")
print(f" Total budget: {total_epochs} epochs")
print(f" Dropout: {dropout_rate}")
print(f" Early stop: patience={patience}")
print(f" Init: He (ReLU network)")
print(f" Batch norm: Yes (every hidden layer)")
print()
# Simulate realistic training curve with all techniques
stopper = EarlyStopping(patience=patience)
best_epoch = 0
for epoch in range(total_epochs):
lr = warmup_cosine(epoch, warmup_epochs, total_epochs, lr_initial)
# Simulated loss (better convergance with full recipe)
train_loss = 0.6 * np.exp(-epoch / 25) + 0.02
val_loss = (0.65 * np.exp(-epoch / 25) + 0.03
+ 0.0005 * max(0, epoch - 100)**1.2)
if epoch % 40 == 0:
print(f" Epoch {epoch:>3d}: lr={lr:.6f}, "
f"train={train_loss:.4f}, val={val_loss:.4f}")
weights_dummy = [np.array([train_loss])]
if stopper.check(val_loss, weights_dummy):
print(f"\n Early stopping triggered at epoch {epoch}")
print(f" Best validation loss: {stopper.best_loss:.4f}")
best_epoch = epoch - patience
break
print(f"\n Would have trained {total_epochs} epochs without early stopping")
print(f" Saved {total_epochs - epoch} epochs of wasted compute")
What you should take away
Here's the summary -- no filler, just the essentials:
- Vanishing gradients shrink the learning signal exponentially through layers. Sigmoid compounds this (max derivative 0.25 per layer), ReLU solves it (derivative is exactly 1 for positive inputs). This is the reason ReLU replaced sigmoid in hidden layers;
- He initialization (scale = sqrt(2/fan_in)) keeps activation variance stable across layers in ReLU networks. Wrong initialization silently kills training -- no error message, just a network that doesn't learn;
- Batch normalization normalizes layer inputs within each mini-batch, preventing internal covariate shift. It stabilizes training and enables deeper networks. Applied between linear transformation and activation;
- Dropout randomly zeros neuron activations during training, forcing the network to learn distributed, redundant representations. Use 0.3-0.5 for fully connected layers. Inverted dropout (scale up during training) means no adjustment at inference;
- Learning rate scheduling (warmup + cosine annealing) adapts step size during training: large steps early for broad exploration, small steps late for fine-tuned convergance. The current standard for most architectures;
- Early stopping monitors validation loss and restores the best weights when improvement stops. The simplest, cheapest, most effective regularizer. Always use it;
- These five techniques are NOT optional for deep networks. They're what makes deep network training possible at all. The math of backpropagation is the same as it was in 1986 -- what changed is the engineering that makes it work in practice.
The network we built in episodes #38-39 uses vanilla gradient descent: compute the gradient, take a step in the opposite direction, repeat. That's the simplest possible optimizer. But there's a whole family of smarter optimization algorithms that use gradient information more intelligently -- maintaining momentum, adapting per-parameter learning rates, and handling noisy gradients from mini-batches. That's the next piece of the puzzle, and it's where we're headed ;-)