Learn AI Series (#45) - Convolutional Neural Networks - Theory
Learn AI Series (#45) - Convolutional Neural Networks - Theory

What will I learn
- You will learn why fully connected networks fail on images -- and how convolutions fix it;
- the convolution operation -- sliding filters that detect local patterns;
- stride and padding -- controlling output dimensions;
- pooling -- reducing spatial dimensions while preserving important features;
- the feature hierarchy -- how CNNs learn edges, textures, parts, and objects;
- receptive field -- what each neuron in a deep CNN actually "sees";
- building and training a complete CNN on MNIST in PyTorch.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory (this post)
Learn AI Series (#45) - Convolutional Neural Networks - Theory
Solutions to Episode #44 Exercises
Exercise 1: Create a GatedMLP with 3 hidden layers where each layer's output is multiplied element-wise by a learned gate (separate nn.Linear + nn.Sigmoid).
import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import Dataset, DataLoader
class GatedMLP(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.input_layer = nn.Linear(input_dim, hidden_dim)
self.input_gate = nn.Sequential(nn.Linear(input_dim, hidden_dim), nn.Sigmoid())
self.hidden1 = nn.Linear(hidden_dim, hidden_dim)
self.gate1 = nn.Sequential(nn.Linear(hidden_dim, hidden_dim), nn.Sigmoid())
self.hidden2 = nn.Linear(hidden_dim, hidden_dim)
self.gate2 = nn.Sequential(nn.Linear(hidden_dim, hidden_dim), nn.Sigmoid())
self.output = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
h = torch.relu(self.input_layer(x)) * self.input_gate(x)
h = torch.relu(self.hidden1(h)) * self.gate1(h)
h = torch.relu(self.hidden2(h)) * self.gate2(h)
return self.output(h)
class CircleDataset(Dataset):
def __init__(self, n_samples=1000):
np.random.seed(42)
self.X = np.random.randn(n_samples, 2).astype(np.float32)
self.y = ((self.X[:, 0]**2 + self.X[:, 1]**2) < 1.5).astype(np.float32)
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return torch.tensor(self.X[idx]), torch.tensor(self.y[idx])
train_data = CircleDataset(800)
loader = DataLoader(train_data, batch_size=32, shuffle=True)
model = GatedMLP(2, 32, 1)
loss_fn = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
for epoch in range(100):
for X_b, y_b in loader:
pred = model(X_b).squeeze()
loss = loss_fn(pred, y_b)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.eval()
with torch.no_grad():
X_all = torch.tensor(train_data.X)
preds = torch.sigmoid(model(X_all).squeeze())
acc = ((preds > 0.5) == torch.tensor(train_data.y)).float().mean()
print(f"Final accuracy: {acc.item():.1%}")
# Expect ~95%+ accuracy
Each gate layer produces values between 0 and 1 via sigmoid, then multiplies element-wise with the corresponding hidden layer output. This lets the network learn to "turn off" individual hidden units per input -- a primitive form of attention that predates the attention mechanism we'll cover in later episodes. The extra parameters (double the linear layers) are worth it when you need fine-grained feature selection.
Exercise 2: Build a DeepResNet with forward hooks recording mean activation magnitude per block.
class ResidualBlock(nn.Module):
def __init__(self, dim):
super().__init__()
self.block = nn.Sequential(
nn.Linear(dim, dim), nn.BatchNorm1d(dim), nn.ReLU(),
nn.Linear(dim, dim), nn.BatchNorm1d(dim))
def forward(self, x):
return torch.relu(self.block(x) + x)
class DeepResNet(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, n_blocks):
super().__init__()
self.input_layer = nn.Linear(input_dim, hidden_dim)
self.blocks = nn.ModuleList([ResidualBlock(hidden_dim) for _ in range(n_blocks)])
self.output_layer = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
x = torch.relu(self.input_layer(x))
for block in self.blocks:
x = block(x)
return self.output_layer(x)
activation_means = {}
def make_hook(name):
def hook(module, inp, out):
activation_means[name] = out.detach().abs().mean().item()
return hook
model = DeepResNet(20, 64, 10, n_blocks=8)
for i, block in enumerate(model.blocks):
block.register_forward_hook(make_hook(f"block_{i}"))
X_train = torch.randn(2000, 20)
y_train = torch.randint(0, 10, (2000,))
loader = DataLoader(list(zip(X_train, y_train)), batch_size=64, shuffle=True)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(20):
for X_b, y_b in loader:
X_b = torch.stack([x for x in X_b])
y_b = torch.stack([y for y in y_b])
pred = model(X_b)
loss = loss_fn(pred, y_b)
optimizer.zero_grad()
loss.backward()
optimizer.step()
_ = model(X_train[:64])
print("Mean activation magnitude per block:")
for name, val in sorted(activation_means.items()):
print(f" {name}: {val:.4f}")
# Activations should stay fairly stable thanks to skip connections + batch norm
With skip connections and batch normalization, the activation magnitudes remain stable across all 8 blocks. Without them (in a plain deep network), you'd typically see activations either decay toward zero or explode -- the exact problem we diagnosed in episode #40.
Exercise 3: Model surgery: pretrain a 10-class ResNet, swap to 5 classes, freeze backbone, compare fine-tuning speed vs training from scratch.
from torch.utils.data import TensorDataset
X = torch.randn(2000, 20)
y10 = torch.randint(0, 10, (2000,))
y5 = torch.randint(0, 5, (2000,))
loader10 = DataLoader(TensorDataset(X, y10), batch_size=64, shuffle=True)
loader5 = DataLoader(TensorDataset(X, y5), batch_size=64, shuffle=True)
# Pretrain on 10 classes
pretrained = DeepResNet(20, 64, 10, n_blocks=3)
opt = torch.optim.AdamW(pretrained.parameters(), lr=1e-3)
for epoch in range(30):
for xb, yb in loader10:
loss = nn.CrossEntropyLoss()(pretrained(xb), yb)
opt.zero_grad()
loss.backward()
opt.step()
# Surgery: replace output, freeze backbone
pretrained.output_layer = nn.Linear(64, 5)
for param in pretrained.input_layer.parameters():
param.requires_grad = False
for param in pretrained.blocks.parameters():
param.requires_grad = False
opt_ft = torch.optim.Adam(
filter(lambda p: p.requires_grad, pretrained.parameters()), lr=1e-3)
print("Fine-tuning (frozen backbone, 15 epochs):")
for epoch in range(15):
correct, total = 0, 0
for xb, yb in loader5:
loss = nn.CrossEntropyLoss()(pretrained(xb), yb)
opt_ft.zero_grad()
loss.backward()
opt_ft.step()
correct += (pretrained(xb).argmax(1) == yb).sum().item()
total += len(yb)
if epoch % 5 == 0:
print(f" Epoch {epoch}: acc={correct/total:.1%}")
# From scratch comparison
fresh = DeepResNet(20, 64, 5, n_blocks=3)
opt_fresh = torch.optim.AdamW(fresh.parameters(), lr=1e-3)
print("\nFrom scratch (30 epochs):")
for epoch in range(30):
correct, total = 0, 0
for xb, yb in loader5:
loss = nn.CrossEntropyLoss()(fresh(xb), yb)
opt_fresh.zero_grad()
loss.backward()
opt_fresh.step()
correct += (fresh(xb).argmax(1) == yb).sum().item()
total += len(yb)
if epoch % 10 == 0:
print(f" Epoch {epoch}: acc={correct/total:.1%}")
On random data the fine-tuned model converges faster because the backbone already knows how to extract useful features. On real data with actual structure, the difference would be dramatic -- a pretrained backbone compresses weeks of training into minutes of fine-tuning.
On to today's episode
At the end of episode #44 I hinted that we'd start building architectures tailored to specific data types -- and here we are. We've spent the last eight episodes (episodes #37 through #44) building a solid foundation: from the humble perceptron through full neural networks, backpropagation, training challenges, optimization algorithms, and PyTorch's tooling. All of that machinery is general-purpose -- it works on any data you can represent as numbers.
But images are special. They have spatial structure. A pixel's neighbors matter. Patterns repeat across locations. And a cat is a cat whether it's in the top-left corner or dead center. Fully connected networks ignore all of this structure, and they pay a massive price for it. Convolutional Neural Networks exploit this structure, and they changed everything. Here we go!
Why fully connected networks fail on images
Take a 224x224 color image. That's 224 x 224 x 3 = 150,528 input values. Connect each one to a hidden layer of 1,000 neurons using a fully connected network (like the ones we built in episodes #38-39), and you're looking at 150 million weights in the first layer alone. For a single layer. The model would be massive, slow to train, and -- most importantly -- terrible at its job.
Why terrible? Because a fully connected layer treats every pixel as completely independent. The pixel at position (0, 0) and the pixel at position (223, 223) get their own separate weight, even though they're at opposite corners of the image and share almost no useful relationship. The network would have to learn independently that a vertical edge looks the same whether it appears on the left side or the right side of the image. It would effectively need to memorize exact pixel positions in stead of learning visual patterns. That's a catastrophic waste of capacity.
import torch
import torch.nn as nn
import numpy as np
# The problem: fully connected on a small image
fc_layer = nn.Linear(28 * 28, 1000) # just MNIST (28x28 grayscale)
print(f"FC layer on 28x28 image: {fc_layer.weight.numel():,} parameters")
# Scale to a real image
fc_big = nn.Linear(224 * 224 * 3, 1000) # standard ImageNet size
print(f"FC layer on 224x224 RGB: {fc_big.weight.numel():,} parameters")
print(f"That's {fc_big.weight.numel() / 1_000_000:.0f} million weights in ONE layer")
Convolutional Neural Networks (CNNs) solve this by exploiting two properties of visual data. First, locality: nearby pixels are related; distant pixels usually aren't. A vertical edge is defined by the relationship between adjacent pixels, not by pixels on opposite sides of the image. Second, translation invariance: a cat is a cat whether it's in the top-left or bottom-right of the frame. The same visual pattern should be detected regardless of where it appears.
In stead of connecting every input to every neuron, CNNs use small sliding filters that detect local patterns -- and share those filters across the entire image. This idea was born from Hubel and Wiesel's 1960s discovery of how the mammalian visual cortex processes images, where they found that individual neurons in a cat's brain respond to specific edge orientations in specific small regions of the visual field. Yann LeCun and collaborators turned this biological insight into a practical neural network architecture in the late 1980s. Let's understand exactly how it works.
The convolution operation
A convolution slides a small filter (also called a kernel) across the input image, computing a dot product at each position. The filter is typically 3x3 or 5x5 -- much smaller than the image. At each position, the filter multiplies its values element-wise with the corresponding patch of the input and sums the results, producing a single output value.
Let's implement this from scratch so you can see exactly what's happening (same philosophy as building neural networks from scratch in episodes #38-39 before switching to PyTorch):
def conv2d_manual(image, kernel):
"""2D convolution from scratch -- no libraries."""
h, w = image.shape
kh, kw = kernel.shape
out_h = h - kh + 1
out_w = w - kw + 1
output = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
patch = image[i:i+kh, j:j+kw]
output[i, j] = np.sum(patch * kernel)
return output
# Horizontal edge detection filter
horizontal_edge = np.array([
[-1, -1, -1],
[ 0, 0, 0],
[ 1, 1, 1]
], dtype=float)
# Vertical edge detection filter
vertical_edge = np.array([
[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]
], dtype=float)
# Test on a simple image with a clear edge
image = np.zeros((8, 8))
image[:4, :] = 1.0 # top half bright, bottom half dark
h_edges = conv2d_manual(image, horizontal_edge)
v_edges = conv2d_manual(image, vertical_edge)
print(f"Input: {image.shape} -> Output: {h_edges.shape}")
print(f"Horizontal edge response (row 3, the boundary):")
print(f" {h_edges[2, :]}")
print(f"Vertical edge response (should be weak -- no vertical edges):")
print(f" max = {np.abs(v_edges).max():.1f}")
The filter slides across the image, producing an output feature map that is smaller than the input (8x8 input with 3x3 filter -> 6x6 output). Each value in the output represents how strongly the filter pattern matches at that spatial location. The horizontal edge filter produces high responses where horizontal edges exist and low responses in uniform regions.
Here's the critical innovation: the same filter is applied at every position. A filter that detects vertical edges detects them everywhere in the image, whether in the top-left corner or the center. This is weight sharing -- in stead of 150 million independent weights, a 3x3 filter has just 9 weights shared across all positions. Parameter efficiency improves by orders of magnitude.
Multiple filters, multiple channels
One filter detects one pattern. A CNN layer applies multiple filters to detect multiple patterns simultaneously. Each filter produces one output feature map, and the collection of all feature maps is the layer's output:
# A conv layer with 3 input channels (RGB) and 16 output channels (16 filters)
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
print(f"Filter weights shape: {conv.weight.shape}") # (16, 3, 3, 3)
print(f"Bias shape: {conv.bias.shape}") # (16,)
print(f"Total parameters: {conv.weight.numel() + conv.bias.numel()}")
x = torch.randn(1, 3, 32, 32) # batch=1, channels=3, height=32, width=32
out = conv(x)
print(f"\nInput: {x.shape}")
print(f"Output: {out.shape}") # (1, 16, 32, 32)
The weight shape (16, 3, 3, 3) means: 16 filters, each looking at all 3 input channels, each 3x3 in spatial size. Each filter produces one output channel. The output has 16 channels -- one per filter -- at the same spatial resolution (thanks to padding=1, which we'll explain next).
When layers stack, the second convolutional layer's filters operate on the first layer's feature maps. A filter in layer 2 that detects "corner" patterns is combining edge detections from layer 1. This compositional hierarchy -- simple features combining into complex ones -- is how CNNs build up rich representations from raw pixels.
Now compare the parameter count to fully connected:
# Same spatial processing with FC would need:
fc_equivalent = nn.Linear(3 * 32 * 32, 16 * 32 * 32)
print(f"\nConv2d params: {conv.weight.numel() + conv.bias.numel():,}")
print(f"FC equivalent: {fc_equivalent.weight.numel() + fc_equivalent.bias.numel():,}")
print(f"Ratio: {(fc_equivalent.weight.numel()) / (conv.weight.numel()):.0f}x more parameters")
That's the core argument for convolutions: dramatically fewer parameters while actually producing better results, because the inductive bias (locality + weight sharing) matches the structure of visual data.
Stride and padding
Stride controls how far the filter moves at each step. Stride 1 (the default) moves one pixel at a time. Stride 2 moves two pixels, halving the spatial dimensions. Larger strides reduce computation and spatial size but discard fine-grained spatial information.
Padding adds zeros around the border of the input. Without padding, each convolution shrinks the spatial dimensions by kernel_size - 1. With padding=1 and a 3x3 kernel, the output has the same spatial dimensions as the input. This "same" padding is the most common choice because it lets you stack many layers without the feature maps shrinking to nothing.
The output size formula:
output_size = (input_size - kernel_size + 2 * padding) / stride + 1
For a 32x32 input with a 3x3 kernel, padding=1, stride=1: (32 - 3 + 2) / 1 + 1 = 32. Same size. With stride=2: (32 - 3 + 2) / 2 + 1 = 16. Half the size.
x = torch.randn(1, 3, 32, 32)
# Default: stride 1, padding 0 (spatial shrinkage)
conv_default = nn.Conv2d(3, 16, kernel_size=3)
print(f"No padding: {x.shape} -> {conv_default(x).shape}")
# Same padding: output size = input size
conv_same = nn.Conv2d(3, 16, kernel_size=3, padding=1)
print(f"Padding=1: {x.shape} -> {conv_same(x).shape}")
# Stride 2: spatial downsampling
conv_stride = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1)
print(f"Stride=2: {x.shape} -> {conv_stride(x).shape}")
# 5x5 kernel with padding=2 (same padding for 5x5)
conv_5x5 = nn.Conv2d(3, 16, kernel_size=5, padding=2)
print(f"5x5 pad=2: {x.shape} -> {conv_5x5(x).shape}")
Having said that, for most practical architectures you'll use 3x3 kernels with padding=1 almost exclusively. VGGNet (2014) demonstrated that stacking multiple 3x3 convolutions achieves the same receptive field as larger kernels with fewer parameters, and this has been the convention ever since.
Pooling: deliberate information reduction
After convolution, pooling reduces the spatial dimensions further. Max pooling takes the maximum value in each spatial window; average pooling takes the mean. A 2x2 max pool with stride 2 halves both spatial dimensions:
pool = nn.MaxPool2d(kernel_size=2, stride=2)
x_feature = torch.randn(1, 16, 32, 32)
pooled = pool(x_feature)
print(f"MaxPool2d: {x_feature.shape} -> {pooled.shape}")
# Average pooling for comparison
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
avg_pooled = avg_pool(x_feature)
print(f"AvgPool2d: {x_feature.shape} -> {avg_pooled.shape}")
# Adaptive pooling: output a fixed size regardless of input
adaptive = nn.AdaptiveAvgPool2d((1, 1)) # global average pooling
print(f"Adaptive: {x_feature.shape} -> {adaptive(x_feature).shape}")
Max pooling provides a small degree of translation invariance -- if a feature is detected anywhere within a 2x2 region, the max operation preserves it regardless of exact position. It also reduces the number of parameters in subsequent layers, since smaller spatial dimensions mean fewer connections.
Modern architectures increasingly replace pooling with strided convolutions (stride > 1), which learn what information to discard rather than using a fixed rule. The argument is that if you're going to downsample, you might as well let the network learn the best way to do it. But max pooling remains common and intuitive, especially in simpler architectures.
Global average pooling (AdaptiveAvgPool2d((1, 1))) collapses the entire spatial dimension into a single value per channel. This is used at the end of modern CNNs in stead of flattening -- it produces a fixed-size output regardless of input spatial dimensions and acts as a strong regularizer.
The feature hierarchy
The most remarkable property of CNNs is that they automatically learn a hierarchy of visual features, from simple to complex. This has been extensively studied by visualizing what each layer's filters respond to (Zeiler & Fergus, 2013):
Layer 1: edges and color gradients. Horizontal edges, vertical edges, diagonal edges, color transitions. These are the same features that computer vision researchers hand-crafted for decades (Sobel filters, Gabor filters). The CNN discovers them automatically from data.
Layers 2-3: textures and simple shapes. Grid patterns, curves, circles, corners. Each filter in layer 2 is looking at a larger effective region of the original image (due to the composition of 3x3 convolutions).
Layers 4-6: parts of objects. Wheels, eyes, noses, handles, text-like regions. These are recognizable fragments of visual concepts.
Deep layers: entire objects or scenes. "Cat face", "wheel on a car", "text on a sign". The activations at these layers are nearly abstract -- far removed from individual pixels.
This hierarchy emerges from training on large datasets. Nobody programs "detect edges in layer 1, textures in layer 2". The architecture (local connectivity + weight sharing + depth) combined with gradient descent produces this organization automatically. It's one of the more beautiful results in deep learning -- the network independently rediscovers the same feature progression that neuroscientists found in biological visual systems ;-)
Receptive field
The receptive field of a neuron is the region of the original input that influences its value. In layer 1 with a 3x3 kernel, each output neuron has a 3x3 receptive field. In layer 2 (another 3x3 kernel applied to layer 1's output), each neuron has an effective 5x5 receptive field in the original input.
With each additional 3x3 convolution layer, the receptive field grows by 2 pixels in each direction. After 10 layers: 21x21. After 50 layers: 101x101 -- large enough to encompass most of a 224x224 image.
# Calculate receptive field for stacked 3x3 convolutions
def receptive_field(n_layers, kernel_size=3, stride=1):
rf = 1
for _ in range(n_layers):
rf = rf + (kernel_size - 1) * stride
return rf
for n in [1, 3, 5, 10, 20, 50]:
rf = receptive_field(n)
print(f"{n:>2d} layers of 3x3 conv: receptive field = {rf}x{rf}")
# With pooling (effectively doubles stride for subsequent layers)
print("\nWith max pool every 2 conv layers:")
rf = 1
stride_product = 1
for i in range(1, 11):
rf += (3 - 1) * stride_product
if i % 2 == 0:
stride_product *= 2 # pooling doubles effective stride
print(f" Layer {i:>2d}: rf = {rf}x{rf}")
This is why deep networks work better than wide, shallow ones for vision: each additional layer expands the receptive field, allowing neurons to integrate information from larger image regions while still building on the local pattern detections from earlier layers. Pooling and strided convolutions expand the receptive field faster by reducing spatial dimensions -- a max pool layer effectivley doubles the growth rate.
Putting it all together: a CNN in PyTorch
A typical CNN follows the pattern: [Conv -> ReLU -> Pool] x N -> Flatten -> [FC -> ReLU] x M -> Output. The convolutional blocks extract spatial features; the fully connected layers at the end classify based on those features. Let's build one and train it on MNIST-like data (the classic handwritten digit dataset we mentioned in episode #43):
from torch.utils.data import DataLoader, TensorDataset
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
# Block 1: 1 -> 32 channels, 28x28 -> 14x14
nn.Conv2d(1, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2),
# Block 2: 32 -> 64 channels, 14x14 -> 7x7
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2),
)
self.classifier = nn.Sequential(
nn.Linear(64 * 7 * 7, 128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128, num_classes)
)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1) # flatten
return self.classifier(x)
model = SimpleCNN()
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")
For a 28x28 MNIST image: the first conv+pool block produces 32 channels at 14x14, the second conv+pool produces 64 channels at 7x7. Flattening gives 64 x 7 x 7 = 3,136 features, which feed into the fully connected classifier. Total parameters: around 420K -- dramatically fewer than the 150 million a fully connected network would need for a 224x224 image.
Notice we're using BatchNorm2d here (the 2D version of the batch normalization we discussed in episode #40). For convolutional layers, batch norm normalizes per channel across the spatial dimensions and the batch. It stabilizes training in exactly the same way as BatchNorm1d does for fully connected layers.
Now let's train it. We'll generate synthetic MNIST-like data to keep things self-contained (in the next episode we'll use real MNIST via torchvision):
# Generate synthetic "digit-like" data for demonstration
# (in the next episode we'll use actual MNIST)
torch.manual_seed(42)
n_train, n_val = 5000, 1000
X_train = torch.randn(n_train, 1, 28, 28)
y_train = torch.randint(0, 10, (n_train,))
X_val = torch.randn(n_val, 1, 28, 28)
y_val = torch.randint(0, 10, (n_val,))
train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=64, shuffle=True)
val_loader = DataLoader(TensorDataset(X_val, y_val), batch_size=128)
# Training setup
model = SimpleCNN()
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=15)
for epoch in range(15):
model.train()
total_loss = 0
for X_b, y_b in train_loader:
pred = model(X_b)
loss = loss_fn(pred, y_b)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
scheduler.step()
if epoch % 3 == 0:
model.eval()
correct, total = 0, 0
with torch.no_grad():
for X_v, y_v in val_loader:
correct += (model(X_v).argmax(1) == y_v).sum().item()
total += len(y_v)
print(f"Epoch {epoch:>2d}: train_loss={total_loss/len(train_loader):.4f}, "
f"val_acc={correct/total:.1%}")
print(f"\n(Random data -> ~10% baseline. Real MNIST -> 99%+ next episode)")
On random data, the model can't do much better than chance (~10% for 10 classes). That's expected -- there's no spatial structure to exploit. On real MNIST images (handwritten digits with actual edges, curves, and strokes), this same architecture routinely achieves 99%+ accuracy. The difference is entirely in the data, not the model -- which neatly demonstrates why CNNs exist: they're built to exploit spatial structure that fully connected networks can't see.
Visualizing what filters learn
Let's look at the actual filter weights after training. Even on random data, the learned filters reveal the convolution's behavior:
# Extract and display first-layer filter weights
first_conv = model.features[0] # nn.Conv2d(1, 32, 3)
filters = first_conv.weight.detach() # shape: (32, 1, 3, 3)
print("First layer: 32 filters of size 3x3")
print(f"Filter weight shape: {filters.shape}")
print(f"Weight statistics: mean={filters.mean():.4f}, std={filters.std():.4f}")
# Show a few filter patterns
for i in range(4):
f = filters[i, 0].numpy()
print(f"\nFilter {i}:")
for row in f:
print(" " + " ".join(f"{v:+.2f}" for v in row))
On real images, these first-layer filters would converge to edge detectors (horizontal, vertical, diagonal), blob detectors, and color gradient filters -- the same hand-crafted features that dominated computer vision for decades before deep learning. The CNN discovers them automatically from gradient descent. In the next episode we'll train on actual images and see these patterns emerge clearly.
CNN vs fully connected: the parameter comparison
Let me put the parameter advantage in perspective with one final comparison:
# Parameter comparison: CNN vs FC on the same task
print("Task: classify 28x28 grayscale images into 10 classes\n")
# CNN approach
cnn = SimpleCNN(10)
cnn_params = sum(p.numel() for p in cnn.parameters())
print(f"CNN (2 conv blocks + FC head): {cnn_params:>10,} parameters")
# FC approach (same depth, no convolutions)
fc_model = nn.Sequential(
nn.Linear(28*28, 512), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(512, 256), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(256, 10)
)
fc_params = sum(p.numel() for p in fc_model.parameters())
print(f"FC (3 layers, similar depth): {fc_params:>10,} parameters")
print(f"\nCNN has {cnn_params/fc_params:.1f}x the parameters but VASTLY")
print(f"better performance on real images because it exploits spatial structure.")
print(f"\nScale to 224x224 RGB and the FC approach becomes 150M+ parameters")
print(f"while the CNN stays under 5M -- and still outperforms.")
The CNN might actually have more parameters than a simple FC network for tiny images like MNIST. The advantage isn't always fewer parameters -- it's that the parameters are organized to match the structure of the data. Weight sharing means each filter learns a reusable pattern. Locality means each layer focuses on relevant spatial neighborhoods. These inductive biases are what make CNNs so effective on visual data, even with comparable or fewer parameters.
What we covered
- Fully connected networks waste parameters on images -- they treat every pixel independently and can't generalize across spatial locations. Convolutions exploit locality (nearby pixels matter) and translation invariance (patterns are the same everywhere);
- A convolution slides a small filter across the input, computing dot products -- weight sharing makes it dramatically more parameter-efficient than fully connected layers;
- Multiple filters per layer detect different patterns; stacking layers builds a hierarchy from edges to textures to parts to objects;
- Stride controls spatial downsampling; padding preserves spatial dimensions through layers. The formula:
output = (input - kernel + 2*padding) / stride + 1; - Pooling (max or average) reduces spatial size and adds translation invariance. Modern networks sometimes replace pooling with strided convolutions;
- The feature hierarchy (edges -> textures -> parts -> objects) emerges automatically from training -- nobody programs which features to detect;
- Receptive field grows with depth. Each 3x3 conv layer adds 2 pixels. Pooling and strides accelerate the growth;
- A typical CNN architecture:
[Conv -> BN -> ReLU -> Pool] x N -> Flatten -> FC -> Output. We built and trained one in PyTorch.
We now understand the theory behind CNNs -- why they work, how the math flows, and what the key architectural choices mean. The next step is putting this into practice with real image data and exploring the classic architectures that shaped the field. We'll also train on actual MNIST and CIFAR-10, and you'll see for yourself the dramatic diffference between CNN performance and fully connected baselines on real visual data ;-)
Exercises
Exercise 1: Implement a 2D convolution from scratch (no PyTorch, just NumPy) that handles multi-channel input. Your function should accept an input of shape (C_in, H, W) and a filter bank of shape (C_out, C_in, kH, kW), and return output of shape (C_out, H_out, W_out). Test it with 3-channel input (like RGB) and 8 filters of size 3x3, and compare the output shape to PyTorch's nn.Conv2d.
Exercise 2: Build a CNN that processes 32x32 RGB images (3 channels) with 3 convolutional blocks (each block: Conv2d -> BatchNorm2d -> ReLU -> MaxPool2d), followed by a fully connected classifier. Print the shape after each block to trace how spatial dimensions change. Count total parameters and compare to a fully connected network with 3 hidden layers processing the same flattened 32x32x3 = 3072 input.
Exercise 3: Create a function that computes the receptive field and output spatial size for any sequence of conv/pool operations. Each operation is described by a tuple (kernel_size, stride, padding). Test it with the architecture from Exercise 2 and verify the output sizes match what PyTorch computes.