Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures

What will I learn
- You will learn the evolution of CNN architectures from LeNet to EfficientNet;
- AlexNet -- the architecture that started the deep learning revolution;
- VGG -- the power of depth with simple 3x3 filters;
- ResNet -- skip connections that enable 100+ layer networks;
- building and training a CNN on CIFAR-10 in PyTorch;
- transfer learning -- leveraging pretrained ImageNet models for your tasks;
- fine-tuning strategies -- frozen backbones, gradual unfreezing, differential learning rates.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures (this post)
Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
Solutions to Episode #45 Exercises
Exercise 1: Implement a multi-channel 2D convolution from scratch in NumPy.
import numpy as np
import torch
import torch.nn as nn
def conv2d_multichannel(x, filters):
"""Multi-channel 2D convolution from scratch.
x: (C_in, H, W)
filters: (C_out, C_in, kH, kW)
returns: (C_out, H_out, W_out)
"""
c_out, c_in, kh, kw = filters.shape
_, h, w = x.shape
out_h = h - kh + 1
out_w = w - kw + 1
output = np.zeros((c_out, out_h, out_w))
for f in range(c_out):
for i in range(out_h):
for j in range(out_w):
patch = x[:, i:i+kh, j:j+kw] # (C_in, kH, kW)
output[f, i, j] = np.sum(patch * filters[f])
return output
# Test: 3-channel input, 8 filters of 3x3
np.random.seed(42)
x = np.random.randn(3, 16, 16).astype(np.float32)
filters = np.random.randn(8, 3, 3, 3).astype(np.float32)
our_output = conv2d_multichannel(x, filters)
print(f"Input shape: (3, 16, 16)")
print(f"Filter shape: (8, 3, 3, 3)")
print(f"Output shape: {our_output.shape}") # (8, 14, 14)
# Compare with PyTorch
conv = nn.Conv2d(3, 8, kernel_size=3, bias=False)
conv.weight.data = torch.from_numpy(filters)
x_torch = torch.from_numpy(x).unsqueeze(0) # add batch dim
pt_output = conv(x_torch).detach().numpy()[0]
print(f"PyTorch shape: {pt_output.shape}")
print(f"Max difference: {np.abs(our_output - pt_output).max():.8f}")
# Should be ~1e-6 (floating point precision)
The key insight: each output filter sums across ALL input channels. A single 3x3 filter applied to an RGB image actually has 3 x 3 x 3 = 27 weights (not 9), because it processes all three color channels simultaneously.
Exercise 2: CNN with 3 conv blocks for 32x32 RGB, shape tracing, parameter comparison.
class TracingCNN(nn.Module):
def __init__(self):
super().__init__()
self.block1 = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32),
nn.ReLU(), nn.MaxPool2d(2))
self.block2 = nn.Sequential(
nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64),
nn.ReLU(), nn.MaxPool2d(2))
self.block3 = nn.Sequential(
nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128),
nn.ReLU(), nn.MaxPool2d(2))
self.classifier = nn.Sequential(
nn.Linear(128 * 4 * 4, 256), nn.ReLU(),
nn.Dropout(0.5), nn.Linear(256, 10))
def forward(self, x):
x = self.block1(x)
print(f" After block 1: {x.shape}")
x = self.block2(x)
print(f" After block 2: {x.shape}")
x = self.block3(x)
print(f" After block 3: {x.shape}")
x = x.view(x.size(0), -1)
return self.classifier(x)
cnn = TracingCNN()
x = torch.randn(1, 3, 32, 32)
print("Shape trace:")
_ = cnn(x)
# After block 1: torch.Size([1, 32, 16, 16])
# After block 2: torch.Size([1, 64, 8, 8])
# After block 3: torch.Size([1, 128, 4, 4])
cnn_params = sum(p.numel() for p in cnn.parameters())
fc = nn.Sequential(
nn.Linear(3072, 1024), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(1024, 512), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(512, 256), nn.ReLU(),
nn.Linear(256, 10))
fc_params = sum(p.numel() for p in fc.parameters())
print(f"\nCNN parameters: {cnn_params:,}")
print(f"FC parameters: {fc_params:,}")
Each MaxPool2d halves the spatial dimensions: 32 -> 16 -> 8 -> 4. The CNN has fewer parameters than the FC network while being vastly better at extracting spatial features -- that's the whole point of convolutional architectures.
Exercise 3: Receptive field and output size calculator for arbitrary conv/pool sequences.
def compute_rf_and_size(input_size, operations):
"""Compute receptive field and output size for a sequence of operations.
Each op is (kernel_size, stride, padding).
Returns list of (output_size, receptive_field) after each op.
"""
size = input_size
rf = 1
stride_product = 1
results = []
for k, s, p in operations:
size = (size - k + 2 * p) // s + 1
rf = rf + (k - 1) * stride_product
stride_product *= s
results.append((size, rf))
return results
# Architecture from Exercise 2:
# Block 1: Conv(3,p=1) -> MaxPool(2,s=2)
# Block 2: Conv(3,p=1) -> MaxPool(2,s=2)
# Block 3: Conv(3,p=1) -> MaxPool(2,s=2)
ops = [
(3, 1, 1), (2, 2, 0), # block 1
(3, 1, 1), (2, 2, 0), # block 2
(3, 1, 1), (2, 2, 0), # block 3
]
results = compute_rf_and_size(32, ops)
labels = ["Conv1", "Pool1", "Conv2", "Pool2", "Conv3", "Pool3"]
for label, (size, rf) in zip(labels, results):
print(f"{label:>6s}: output={size:>3d}x{size:<3d} receptive_field={rf}x{rf}")
# Verify against PyTorch
x = torch.randn(1, 3, 32, 32)
conv1 = nn.Conv2d(3, 32, 3, padding=1)
pool = nn.MaxPool2d(2)
print(f"\nPyTorch verify: {pool(conv1(x)).shape}") # [1, 32, 16, 16] -> size=16
The receptive field grows fastest when pooling layers are involved -- each pool effectively doubles the stride for all subsequent layers. By the end of 3 conv+pool blocks on a 32x32 input, each neuron "sees" a substantial portion of the original image.
On to today's episode
Last episode we built everything from the ground up -- hand-coded convolutions, stride, padding, pooling, receptive fields, the works. You now understand exactly what happens when a filter slides across an image. That knowledge puts you in a strong position, because most people using CNNs in practice have no idea what's actually going on under the hood.
Now it's time to see what decades of research produced when these ideas were scaled up. We'll trace the evolution from LeNet (1998) through ResNet (2015), build and train a real classifier on actual image data (CIFAR-10, not synthetic), and then learn the single most important practical technique in modern deep learning: transfer learning with pretrained models. Here we go!
The architecture timeline
The history of CNN architectures reads like a series of "aha" moments, each one making networks dramatically more powerful. Understanding this progression matters because the ideas in these architectures (depth, skip connections, efficient scaling) show up everywhere in modern AI -- including in architectures we'll encounter later in this series.
LeNet-5 (1998): Yann LeCun's handwritten digit recognizer. Two conv layers, two pooling layers, three fully connected layers. 60,000 parameters total. It worked beautifully for MNIST but was too small for real-world images. Still, this was the architecture that proved CNNs could learn visual features automatically from data -- a radical idea at a time when computer vision was dominated by hand-crafted feature engineering.
AlexNet (2012): Alex Krizhevsky's ImageNet winner, and the paper that kicked off the deep learning revolution. Eight layers, 60 million parameters, trained on two GTX 580 GPUs (splitting the network across both GPUs was an engineering hack born from memory constraints -- each GPU only had 3GB). AlexNet didn't invent anything fundamentally new -- it applied LeNet's ideas with ReLU activations (in stead of sigmoid), dropout regularization (episode #40), and data augmentation at a scale enabled by GPUs. It won the ImageNet challenge by 10+ percentage points, absolutely shocking the computer vision community.
Here's what's wild: before AlexNet, the best ImageNet classifiers used carefully hand-engineered features like SIFT and HOG combined with SVMs (episode #20). AlexNet showed that a neural network learning its own features could obliterate years of manual feature engineering. Within two years, virtually every top ImageNet entry was a deep neural network.
VGG (2014): Oxford's Visual Geometry Group made a deceptively simple observation. In stead of using a mix of filter sizes, they used ONLY 3x3 filters and just stacked more layers (16-19 layers deep). Two 3x3 convolutions have the same receptive field as one 5x5, but with fewer parameters and an extra nonlinearity between them. VGG's lesson: depth with simple building blocks beats shallow complexity. 138 million parameters -- large but effective.
import torch
import torch.nn as nn
# VGG's key insight: two 3x3 convs = one 5x5 conv's receptive field
# but with fewer parameters and an extra ReLU
# Two 3x3 convs: 2 * (3*3*C*C) = 18*C^2 parameters
# One 5x5 conv: 5*5*C*C = 25*C^2 parameters
# Plus: the two 3x3 version has an extra nonlinearity!
C = 64 # channels
two_3x3 = 2 * (3 * 3 * C * C)
one_5x5 = 5 * 5 * C * C
print(f"Two 3x3 convs: {two_3x3:,} parameters")
print(f"One 5x5 conv: {one_5x5:,} parameters")
print(f"Savings: {(1 - two_3x3/one_5x5)*100:.0f}% fewer parameters")
print(f"Both see a 5x5 receptive field, but 3x3 stack has 2 ReLUs vs 1")
GoogLeNet/Inception (2014): introduced inception modules -- parallel convolutions with different kernel sizes (1x1, 3x3, 5x5) concatenated together. The network learns which scale of feature is most informative at each spatial position. The real trick was 1x1 convolutions acting as channel-wise dimensionality reduction, dramatically cutting parameter count. 6.8 million parameters -- 20x smaller than VGG with comparable accuracy. Clever engineering over brute force ;-)
ResNet (2015): the skip connection revolution. Kaiming He and colleagues (the same He who gave us the weight initialization we discussed in episode #40) showed that adding the input of a block to its output allows training networks with 152 layers -- and beyond. Without skip connections, networks deeper than roughly 20 layers perform worse than shallower ones. Not because of overfitting, but because the optimization itself fails -- gradients either vanish or explode through all those multiplications (we diagnosed this exact problem in episode #40). The skip connection provides a gradient highway where the derivative always includes a term of 1, ensuring gradient signal reaches the earliest layers.
# The difference skip connections make
# Without: x -> f(x) -- gradient must flow through f
# With: x -> f(x) + x -- gradient always has a path through the +x
# This is why adding ONE line of code (+ x) to the forward pass
# changed deep learning forever
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.block = nn.Sequential(
nn.Conv2d(channels, channels, 3, padding=1),
nn.BatchNorm2d(channels),
nn.ReLU(),
nn.Conv2d(channels, channels, 3, padding=1),
nn.BatchNorm2d(channels))
def forward(self, x):
return torch.relu(self.block(x) + x) # THE magic line
# Stack them deep -- this works because of skip connections
blocks = nn.Sequential(*[ResidualBlock(64) for _ in range(20)])
x = torch.randn(1, 64, 32, 32)
out = blocks(x)
print(f"20 residual blocks: {x.shape} -> {out.shape}")
print(f"Parameters: {sum(p.numel() for p in blocks.parameters()):,}")
ResNet remains the default backbone for many vision tasks even in 2026. When in doubt, start with a ResNet. It's the "nobody ever got fired for choosing IBM" of CNN architectures.
Training a CNN on CIFAR-10
Enough history. Let's build something real. CIFAR-10: 60,000 tiny (32x32) color images across 10 classes -- airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, trucks. It's a standard benchmark that's substantially harder than MNIST but still feasible to train on a laptop CPU in reasonable time:
import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
transform_train = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomCrop(32, padding=4),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
])
train_data = datasets.CIFAR10('.', train=True, download=True, transform=transform_train)
test_data = datasets.CIFAR10('.', train=False, download=True, transform=transform_test)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True)
test_loader = DataLoader(test_data, batch_size=256)
Those normalization values (0.4914, 0.4822, 0.4465) and (0.2470, 0.2435, 0.2616) are CIFAR-10's per-channel mean and standard deviation, precomputed from the training set. Normalizing inputs to zero mean and unit variance helps training converge faster (we discussed why in episode #40 -- it keeps the loss landscape well-conditioned). The augmentations (RandomHorizontalFlip and RandomCrop with padding) create training variety from the limited dataset. A flipped car is still a car; a slightly shifted bird is still a bird. This is essentially free data ;-)
Now our model -- a compact CNN with batch normalization:
class CIFAR10Net(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
# Block 1: 3 -> 32 channels
nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
nn.Conv2d(32, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
nn.MaxPool2d(2),
# Block 2: 32 -> 64 channels
nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
nn.Conv2d(64, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
nn.MaxPool2d(2),
# Block 3: 64 -> 128 channels
nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
nn.AdaptiveAvgPool2d(1),
)
self.classifier = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(128, 10)
)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
return self.classifier(x)
model = CIFAR10Net()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
Notice AdaptiveAvgPool2d(1) at the end of the feature extractor. This is a modern replacement for the traditional flatten-after-conv approach -- it averages each feature map down to a single value, regardless of the spatial size. This means the same architecture could handle different input resolutions without changing the classifier. The VGG/AlexNet approach of flattening to a huge vector and then connecting to a massive FC layer wastes parameters; global average pooling is what most modern architectures use in stead.
The training loop
Standard PyTorch training with the pieces we assembled in episodes #42-43, plus one important addition -- tracking accuracy alongside loss:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=30)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(30):
model.train()
correct, total = 0, 0
for X, y in train_loader:
pred = model(X)
loss = loss_fn(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
correct += (pred.argmax(1) == y).sum().item()
total += len(y)
scheduler.step()
if epoch % 10 == 0:
model.eval()
test_correct = 0
with torch.no_grad():
test_correct = sum(
(model(X).argmax(1) == y).sum().item()
for X, y in test_loader)
print(f"Epoch {epoch}: train_acc={correct/total:.1%}, "
f"test_acc={test_correct/len(test_data):.1%}")
A few things to note here. We're using nn.CrossEntropyLoss -- the standard loss for multi-class classification. It combines log-softmax and negative log-likelihood into one numerically stable operation (the model outputs raw logits, not probabilities -- cross-entropy handles the softmax internally). This is different from the nn.BCELoss we used for binary classification; cross-entropy generalizes to any number of classes.
The cosine annealing scheduler gradually reduces the learning rate following a cosine curve, from 0.001 down to near-zero over 30 epochs. Combined with AdamW (the weight decay variant of Adam we covered in episode #41), this is the modern default recipe.
This compact architecture should reach roughly 88-90% test accuracy on CIFAR-10 in 30 epochs. State of the art is around 99% with much larger models, more augmentation, and weeks of GPU time. But 88-90% from a model you can train on a laptop in minutes? That's not bad at all.
Transfer learning: standing on giants' shoulders
Here's the most practically important technique in all of modern deep learning. Training a CNN from scratch requires large datasets and significant compute. Transfer learning sidesteps both: take a model pretrained on ImageNet (1.4 million images, 1000 classes), remove the classification head, bolt on a new one that fits your task, and fine-tune on your much smaller dataset.
The reason this works is the feature hierarchy we discussed last episode. A ResNet trained on ImageNet has learned to detect edges in layer 1, textures in layer 2, object parts in the middle layers, and high-level concepts in the deep layers. Those features -- edges, textures, shapes -- are universal. They transfer remarkably well between tasks. A model trained to classify dogs and cars produces features that are useful for classifying medical X-rays, satellite photos, or manufacturing defects.
from torchvision import models
# Load pretrained ResNet-18 (11.7M parameters, trained on ImageNet)
resnet = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
# Replace the final FC layer (was 512 -> 1000, now 512 -> 10)
resnet.fc = nn.Linear(512, 10)
# Freeze everything except the new classifier
for param in resnet.parameters():
param.requires_grad = False
for param in resnet.fc.parameters():
param.requires_grad = True
trainable = sum(p.numel() for p in resnet.parameters() if p.requires_grad)
total = sum(p.numel() for p in resnet.parameters())
print(f"Training {trainable:,} of {total:,} parameters ({trainable/total:.2%})")
With a frozen backbone, only 5,130 parameters need training (512 x 10 weights + 10 biases). That's 0.05% of the model. The other 99.95% -- all those carefully learned edge detectors, texture recognizers, and shape detectors -- stay exactly as ImageNet training left them. This is the model surgery technique from episode #44, now applied to a real pretrained model in stead of our toy example.
Fine-tuning strategies
Freezing the backbone and training only the head is the simplest approach, but you can do better. The standard three-phase recipe:
Phase 1 -- Frozen backbone, train head only. A few epochs with just the classification head training. This gets the head roughly calibrated to the new task without disturbing the pretrained features.
Phase 2 -- Unfreeze and train everything with a small learning rate. Once the head has converged, unfreeze the backbone and train the entire model with a learning rate 10-100x smaller than Phase 1. This gently adapts the pretrained features to your specific task.
Phase 3 (optional) -- Differential learning rates. Use the parameter groups technique from episode #44 to set different learning rates for different depths. Early layers (edges, textures) rarely need changing; deeper layers (object parts, high-level concepts) might benefit from larger adjustments.
# Phase 1: head only
optimizer = torch.optim.Adam(resnet.fc.parameters(), lr=1e-3)
# Quick head calibration (5 epochs)
for epoch in range(5):
for X, y in train_loader:
loss = loss_fn(resnet(X), y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Phase 2: unfreeze everything, small learning rate
for param in resnet.parameters():
param.requires_grad = True
optimizer = torch.optim.AdamW([
{'params': resnet.conv1.parameters(), 'lr': 1e-5},
{'params': resnet.layer1.parameters(), 'lr': 1e-5},
{'params': resnet.layer2.parameters(), 'lr': 5e-5},
{'params': resnet.layer3.parameters(), 'lr': 1e-4},
{'params': resnet.layer4.parameters(), 'lr': 1e-4},
{'params': resnet.fc.parameters(), 'lr': 1e-3},
], weight_decay=0.01)
# Fine-tune (15 epochs)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=15)
for epoch in range(15):
resnet.train()
for X, y in train_loader:
loss = loss_fn(resnet(X), y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
scheduler.step()
if epoch % 5 == 0:
resnet.eval()
correct = sum(
(resnet(X).argmax(1) == y).sum().item()
for X, y in test_loader)
print(f"Epoch {epoch}: test_acc={correct/len(test_data):.1%}")
Having said that, for most practical applications you'll find that the simple freeze-then-unfreeze approach works surprisingly well. Differential learning rates are an optimization on top of something that already works. Start simple, add complexity only when the simpler version underperforms.
Batch normalization in convolutional networks
We've mentioned BatchNorm2d a few times now but haven't really unpacked how it differs from the BatchNorm1d we used in earlier episodes. For convolutional layers, batch norm normalizes per channel across both the batch dimension and the spatial dimensions:
# BatchNorm1d: normalizes over (batch,) for each feature
# Input shape: (N, C) -> normalize along N for each C
bn1d = nn.BatchNorm1d(64)
x_1d = torch.randn(32, 64)
print(f"BatchNorm1d input: {x_1d.shape}")
print(f"BatchNorm1d output: {bn1d(x_1d).shape}")
# BatchNorm2d: normalizes over (batch, height, width) for each channel
# Input shape: (N, C, H, W) -> normalize along N, H, W for each C
bn2d = nn.BatchNorm2d(64)
x_2d = torch.randn(32, 64, 16, 16)
print(f"BatchNorm2d input: {x_2d.shape}")
print(f"BatchNorm2d output: {bn2d(x_2d).shape}")
# Each channel gets ONE mean and ONE variance computed across
# the entire batch AND all spatial positions
print(f"\nBatchNorm2d has {len(list(bn2d.parameters()))} parameter groups:")
for name, p in bn2d.named_parameters():
print(f" {name}: shape={list(p.shape)}")
# gamma (scale) and beta (shift), each of size (64,) -- one per channel
The intuition: if a "cat ear detector" filter produces high values at some spatial positions and low values at others, batch norm ensures the distribution of those values is stable across training batches. This prevents the internal covariate shift problem (the distribution of inputs to each layer changing as earlier layers update) that makes deep networks hard to train. Every modern CNN architecture uses batch norm (or a close relative like layer norm or group norm) between convolutions.
Choosing an architecture in 2026
The landscape has settled considerably. Here's the practical decision tree:
ResNet-18/34/50: the default starting point. Well-understood, widely supported in every framework, good balance of accuracy and speed. Use ResNet-50 when accuracy matters most; ResNet-18 when you need speed or have limited compute. If you're not sure which architecture to use, use ResNet.
EfficientNet: scales width, depth, and resolution together using a compound scaling coefficient. More accurate than ResNet per FLOP, but more complex to understand and modify. Good when you need to squeeze maximum accuracy from limited compute.
Vision Transformer (ViT): applies the transformer architecture (originaly from NLP) directly to image patches. Competitive with or better than CNNs on large datasets, but data-hungry -- needs massive pretraining or aggressive data augmentation to work well on smaller datasets. We'll cover transformers in depth later in this series, and then circle back to ViT specifically.
For small datasets with transfer learning, the architecture choice matters far less than you'd think. A pretrained ResNet-18 with proper fine-tuning often beats a pretrained EfficientNet-B7 that wasn't fine-tuned well. Execution beats architecture selection. Get the training pipeline right first -- data augmentation, learning rate schedule, proper validation -- and worry about architecture optimization later.
# Quick comparison of pretrained model sizes
from torchvision import models
architectures = {
'ResNet-18': models.resnet18(weights=models.ResNet18_Weights.DEFAULT),
'ResNet-50': models.resnet50(weights=models.ResNet50_Weights.DEFAULT),
}
for name, model in architectures.items():
params = sum(p.numel() for p in model.parameters())
print(f"{name:>15s}: {params/1e6:.1f}M parameters")
# ResNet-18: 11.7M
# ResNet-50: 25.6M
Data augmentation: the secret weapon
We touched on augmentations briefly in the CIFAR-10 example, but they deserve more attention because they're often the differnce between a mediocre model and a good one. Data augmentation creates synthetic training variety from your existing images:
from torchvision import transforms
# Standard augmentation pipeline for image classification
augment = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
transforms.RandomRotation(15),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
transforms.RandomErasing(p=0.1),
])
# The normalization values [0.485, 0.456, 0.406] are ImageNet's
# per-channel means -- use these when fine-tuning ImageNet models
print("Augmentation pipeline:")
for i, t in enumerate(augment.transforms):
print(f" {i+1}. {t.__class__.__name__}")
The augmentations you choose should reflect the invariances in your data. Horizontal flips make sense for natural images (a cat facing left is still a cat) but NOT for text recognition or medical images where orientation matters. Rotation helps for satellite imagery but not for document scanning where text is always upright. Color jitter helps for outdoor scenes but might hurt for color-critical tasks like quality inspection. Think about what transformations preserve the label before applying them blindly.
RandomErasing (also called Cutout) randomly masks out rectangular patches of the input. This forces the network to recognize objects from partial information -- it can't rely on any single region always being present. It's a remarkably effective regularizer that costs nothing.
What you should remember
- CNN architecture evolution: LeNet (1998) -> AlexNet (2012, GPUs + ReLU) -> VGG (depth with 3x3 stacks) -> Inception (parallel scales, 1x1 bottlenecks) -> ResNet (skip connections);
- Two stacked 3x3 convolutions have the same receptive field as one 5x5, with fewer parameters and an extra nonlinearity -- this is why modern architectures exclusively use 3x3;
nn.CrossEntropyLosscombines softmax and negative log-likelihood -- use it for multi-class classification with raw logit outputs;- Transfer learning with pretrained models is the single most important practical technique: freeze the backbone, replace the head, fine-tune with small learning rates;
- Fine-tuning recipe: Phase 1 (head only) -> Phase 2 (unfreeze, small LR) -> optional Phase 3 (differential LRs per depth);
- BatchNorm2d normalizes per channel across both batch and spatial dimensions, stabilizing training in deep convolutional networks;
- Data augmentation creates training variety for free -- horizontal flips, random crops, color jitter, random erasing. Match augmentations to your data's actual invariances;
- Architecture choice in 2026: ResNet for most tasks, EfficientNet for efficiency, ViT for large-scale vision. But execution (proper training, augmentation, fine-tuning) matters far more than which architecture you pick.
We've now covered CNNs from theory through practice. But CNNs are only one way to handle structured data. Images have spatial structure that CNNs exploit beautifully. What about data with sequential structure -- text, audio, time series -- where the order of elements matters? Fully connected and convolutional networks both treat their inputs as fixed-size, fixed-order tensors. Processing a sentence word by word, or a song sample by sample, requires a fundamentally different kind of architecture -- one with memory. That's where we're headed next ;-)
Exercises
Exercise 1: Build a mini-VGG network for CIFAR-10: 3 "VGG blocks" where each block has two Conv2d(3x3, padding=1) -> BatchNorm2d -> ReLU layers followed by MaxPool2d(2). Channels should go 3 -> 64 -> 128 -> 256. Add a classifier with global average pooling (AdaptiveAvgPool2d(1)) followed by a single Linear layer. Print total parameter count and train for 20 epochs on CIFAR-10 with AdamW + cosine annealing. Report the final test accuracy.
Exercise 2: Implement a transfer learning comparison. Load a pretrained ResNet-18, replace the final FC layer for 10 classes, and train on CIFAR-10 in two configurations: (a) frozen backbone (only FC trains), (b) fully unfrozen with differential learning rates (backbone at 1e-5, FC at 1e-3). Train each for 10 epochs and compare test accuracy. Which approach performs better and why?
Exercise 3: Write a function count_by_type(model) that takes any nn.Module and returns a dictionary mapping layer type names (like "Conv2d", "BatchNorm2d", "Linear") to both count and total parameter count for that type. Test it on your mini-VGG from Exercise 1 and on the pretrained ResNet-18. Which layer type dominates the parameter count in each?