Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd

What will I learn
- You will learn why PyTorch won the deep learning framework war;
- tensors -- NumPy arrays with GPU support and automatic differentiation;
- autograd -- PyTorch's automatic differentiation engine that replaces manual backpropagation;
- rebuilding our episode #39 network in PyTorch -- same architecture, far less code;
- CPU vs GPU: when to use what;
- the computational graph and how PyTorch tracks operations;
- detaching tensors and inference mode for evaluation;
- the standard PyTorch training loop and workflow.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd (this post)
Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
For five episodes now, you've been building neural networks by hand -- writing forward passes, implementing backpropagation, coding entire optimizers from scratch in NumPy. This was deliberate. You understand what happens inside a neural network because you built every component yourself. Whe you read loss.backward() in someone else's code, you know what that means at the matrix level, because you've manually computed those gradients.
Now we switch to the tool that makes all of this practical: PyTorch. In stead of implementing backpropagation manually, PyTorch computes gradients automatically. In stead of hand-coding Adam (like we did last episode), you call torch.optim.Adam. In stead of reshaping NumPy arrays for batches, you use DataLoaders. The concepts are identical -- PyTorch just handles the plumbing so you can focus on the interesting parts: the architecture, the data, the experiments.
Why PyTorch and not TensorFlow? Because by 2026, the framework war is over. PyTorch won. It dominates research, it powers most production deployments, and virtually every new model -- from GPT-4 to Stable Diffusion -- is built with it. Meta's continued investment and the community momentum make it the clear standard. TensorFlow still exists (and has its strengths in mobile/embedded deployment), but if you're starting fresh today, PyTorch is the one to learn. Here we go!
Tensors: NumPy arrays that do more
A PyTorch tensor is a multi-dimensional array, just like a NumPy array, with two extra capabilities: it can run on GPUs and it can track operations for automatic differentiation. If you've been following this series since episode #2 where we set up NumPy, tensors will feel immediately familiar -- the API is nearly identical.
import torch
import numpy as np
# Creating tensors (familiar if you know NumPy)
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.zeros(3, 4)
c = torch.randn(2, 3) # random normal distribution
d = torch.arange(10) # 0 through 9
e = torch.ones(5) # all ones
print(f"a: {a}")
print(f" shape: {a.shape}, dtype: {a.dtype}")
print(f"\nb:\n{b}")
print(f"\nc (random):\n{c}")
print(f"\nd: {d}")
The API mirrors NumPy almost exactly: torch.zeros maps to np.zeros, torch.randn maps to np.random.randn, tensor.shape maps to array.shape. If you know NumPy (and after 41 episodes of this series, you certainly do), you already know 90% of the tensor API. The main syntax difference: NumPy uses @ or np.dot for matrix multiplication; PyTorch uses @ or torch.matmul -- same operator, different library.
# Operations -- same as NumPy
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([4.0, 5.0, 6.0])
print(f"x + y = {x + y}")
print(f"x * y = {x * y}") # element-wise
print(f"x @ y = {(x @ y).item()}") # dot product -> scalar
# Reshaping
m = torch.arange(12).reshape(3, 4)
print(f"\nReshaped:\n{m}")
print(f"Shape: {m.shape}")
print(f"Transpose shape: {m.T.shape}")
# Matrix multiplication
A = torch.randn(3, 4)
B = torch.randn(4, 5)
C = A @ B # (3, 4) @ (4, 5) -> (3, 5)
print(f"\nMatrix multiply: {A.shape} @ {B.shape} = {C.shape}")
NumPy conversion: shared memory
One critical detail about converting between NumPy and PyTorch: torch.from_numpy shares memory with the original NumPy array. Modifying one modifies the other. This is efficient (no copying), but it can bite you if you're not aware of it:
# NumPy <-> PyTorch conversion (shared memory -- no copy!)
np_array = np.array([1.0, 2.0, 3.0])
tensor_from_np = torch.from_numpy(np_array)
print(f"NumPy: {np_array}")
print(f"Tensor: {tensor_from_np}")
# Modify the NumPy array -- tensor changes too!
np_array[0] = 999.0
print(f"\nAfter modifying NumPy array:")
print(f"NumPy: {np_array}")
print(f"Tensor: {tensor_from_np}") # also 999!
# Going the other direction
back_to_np = tensor_from_np.numpy()
print(f"\nBack to NumPy: {back_to_np}")
# Use .clone() if you need an independent copy
independent = torch.from_numpy(np_array).clone()
np_array[0] = 1.0
print(f"\nAfter modifying NumPy (with clone):")
print(f"NumPy: {np_array}")
print(f"Clone: {independent}") # still 999!
This shared-memory behavior is by design -- it makes the NumPy-PyTorch bridge zero-cost, which matters when you're loading large datasets. But whenever you need the tensor to be independent of its source array, use .clone(). Having said that, in most practical workflows you'll either start with PyTorch tensors directly or convert once and never touch the NumPy original again, so this rarely causes issues.
Data types: being explicit
PyTorch is more explicit about data types than NumPy. Neural network weights are almost always torch.float32 (32-bit floating point), and you'll occasionally see errors when you accidentally mix float64 (NumPy's default) with float32. Being aware of this saves debugging time:
# PyTorch defaults
print("Default dtypes:")
print(f" torch.tensor([1.0]): {torch.tensor([1.0]).dtype}") # float32
print(f" torch.tensor([1]): {torch.tensor([1]).dtype}") # int64
print(f" torch.randn(3): {torch.randn(3).dtype}") # float32
print(f" np.array([1.0]).dtype: {np.array([1.0]).dtype}") # float64!
# Converting NumPy float64 to PyTorch float32
data_np = np.random.randn(100, 4) # float64 by default
data_pt = torch.from_numpy(data_np.astype(np.float32))
print(f"\nNumPy dtype: {data_np.dtype}")
print(f"Tensor dtype: {data_pt.dtype}")
# Or use torch.tensor() which copies and can set dtype
data_pt2 = torch.tensor(data_np, dtype=torch.float32)
print(f"Via torch.tensor: {data_pt2.dtype}")
Autograd: automatic differentiation
This is PyTorch's killer feature. Remember episode #39 where we manually computed every single derivative for backpropagation? For a 3-layer network with ReLU activations, that was roughly 50 lines of careful chain-rule math. Now imagine doing that for a 100-layer transformer with attention mechanisms, layer normalization, and residual connections. You'd be debugging derivative code for weeks.
PyTorch's autograd system eliminates all of that. Set requires_grad=True on a tensor, perform operations on it, and PyTorch records every operation in a computational graph. When you call .backward(), it automatically computes the gradient of the output with respect to every tensor that has requires_grad=True -- no manual chain rule, no hand-coded derivatives.
# Simple example: y = x0^2 + 3*x1
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x[0]**2 + 3 * x[1] # y = 4 + 9 = 13
y.backward() # compute dy/dx0 and dy/dx1
print(f"x = {x.data}")
print(f"y = {y.item():.1f}")
print(f"dy/dx = {x.grad}")
print()
print("Verification:")
print(f" dy/dx0 = 2*x0 = 2*2 = {2*2.0:.1f} (autograd: {x.grad[0].item():.1f})")
print(f" dy/dx1 = 3 = {3.0:.1f} (autograd: {x.grad[1].item():.1f})")
Compare this to episode #39 where we manually computed every derivative. For a 2-variable function, manual differentiation is manageable. For a network with millions of parameters, it's simply impossible to do by hand. Autograd scales to any computational graph, no matter how complex. The .backward() call traverses the computational graph in reverse (hence "backward"), applying the chain rule at each node. The result -- gradients stored in .grad -- is identical to what our hand-coded backpropagation computed. PyTorch just does it automatically ;-)
A slightly more complex example
Let's see autograd handle a function that would require a few lines of chain rule by hand:
# More complex: z = (a*b + c)^2, where a, b, c all have requires_grad
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
c = torch.tensor(1.0, requires_grad=True)
# z = (a*b + c)^2 = (2*3 + 1)^2 = 7^2 = 49
intermediate = a * b + c
z = intermediate ** 2
z.backward()
print(f"a={a.item()}, b={b.item()}, c={c.item()}")
print(f"z = (a*b + c)^2 = {z.item():.1f}")
print()
print(f"dz/da = 2*(a*b+c)*b = 2*7*3 = {2*7*3:.1f} (autograd: {a.grad.item():.1f})")
print(f"dz/db = 2*(a*b+c)*a = 2*7*2 = {2*7*2:.1f} (autograd: {b.grad.item():.1f})")
print(f"dz/dc = 2*(a*b+c)*1 = 2*7*1 = {2*7*1:.1f} (autograd: {c.grad.item():.1f})")
Every gradient matches the manual calculus. Now imagine doing this for a function with 10 million intermediate operations -- autograd handles it identically. The algorithm is the same chain rule we implemented by hand, but automated.
Understanding the computational graph
Every operation on a tensor with requires_grad=True is recorded in a directed acyclic graph (DAG). PyTorch builds this graph dynamically as you run Python code:
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
c = a * b # MulBackward0
d = c + a # AddBackward0
e = d ** 2 # PowBackward0
print(f"e = {e.item()}")
print(f"e.grad_fn = {e.grad_fn}")
print(f"d.grad_fn = {d.grad_fn}")
print(f"c.grad_fn = {c.grad_fn}")
e.backward()
print(f"\nde/da = {a.grad.item()}")
print(f"de/db = {b.grad.item()}")
Each tensor knows which operation created it (you can see it via grad_fn) and which tensors were its inputs. When you call .backward(), PyTorch walks this graph in reverse, applying the chain rule at each node. After the backward pass, the graph is released (freed from memory) -- you get one backward pass per forward pass.
This dynamic graph construction is what originally set PyTorch apart from TensorFlow's static graph approach. (TensorFlow 2.x later adopted eager mode, which is similar, but PyTorch had it from day one.) You build the graph by running normal Python code, using standard if-statements and for-loops. This means debugging a PyTorch model is just debugging Python -- print statements, breakpoints, standard tooling all work. No special graph-compilation step, no weird trace errors, no "Session.run()". Just Python.
# Dynamic graphs: use normal Python control flow
def dynamic_computation(x, power):
"""The graph changes based on the input -- this is normal in PyTorch."""
result = x
for i in range(power):
result = result * x
return result
x = torch.tensor(2.0, requires_grad=True)
# Different power = different computational graph each time
for p in [2, 3, 4]:
y = dynamic_computation(x, p)
y.backward()
print(f"x^{p+1} at x=2: value={y.item():.0f}, "
f"gradient={(p+1)*2**p:.0f}, "
f"autograd={x.grad.item():.0f}")
x.grad.zero_() # reset gradients for next iteration
From NumPy to PyTorch: the same network, way less code
Let's rebuild our episode #39 network in PyTorch. Same architecture (2 -> 16 -> 8 -> 1), same problem (classifying points inside a circle), dramatically less code:
import torch.nn as nn
# Define the model -- 3 lines instead of an entire class
model = nn.Sequential(
nn.Linear(2, 16),
nn.ReLU(),
nn.Linear(16, 8),
nn.ReLU(),
nn.Linear(8, 1),
nn.Sigmoid()
)
print("Model architecture:")
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters())}")
Compare this to the NeuralNetwork class from episode #38 that took ~40 lines just for the forward pass plus He initialization plus the backward pass from episode #39. nn.Sequential handles architecture definition, weight initialization, forward pass logic, AND gradient computation. All of it, automatically.
Now let's train it on the same circle-classification problem:
# Data (same as episode #39)
np.random.seed(42)
X_np = np.random.randn(500, 2).astype(np.float32)
y_np = ((X_np[:, 0]**2 + X_np[:, 1]**2) < 1.5).astype(np.float32).reshape(-1, 1)
X = torch.from_numpy(X_np)
y = torch.from_numpy(y_np)
print(f"Dataset: {len(X)} points")
print(f"Inside circle: {int(y.sum().item())}")
print(f"Outside circle: {int(len(y) - y.sum().item())}")
# Loss function and optimizer -- one line each
loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Training loop
for epoch in range(500):
# Forward pass
pred = model(X)
loss = loss_fn(pred, y)
# Backward pass + weight update (the PyTorch mantra)
optimizer.zero_grad() # clear previous gradients
loss.backward() # compute gradients (autograd!)
optimizer.step() # update weights (Adam!)
if epoch % 100 == 0:
acc = ((pred > 0.5) == y).float().mean()
print(f"Epoch {epoch:>3d}: loss={loss.item():.4f}, acc={acc.item():.1%}")
# Final evaluation
with torch.no_grad():
pred_final = model(X)
acc_final = ((pred_final > 0.5) == y).float().mean()
print(f"\nFinal accuracy: {acc_final.item():.1%}")
That's the entire training loop. Compare this to the ~80+ lines of NumPy code from episodes #38-39 (forward pass class, backward pass, manual Adam implementation from episode #41). The forward pass is defined by nn.Sequential (no manual weight matrices). Backpropagation is loss.backward() (no manual derivative computation). Weight updates are optimizer.step() (no manual Adam implementation).
The PyTorch training mantra
Three lines you must remember -- the PyTorch training mantra that appears in every training loop:
optimizer.zero_grad() # 1. Clear accumulated gradients from previous step
loss.backward() # 2. Compute new gradients via autograd
optimizer.step() # 3. Update weights using the optimizer
Forgetting zero_grad() is the single most common PyTorch bug. Gradients accumulate by default (which is actually useful for gradient accumulation across mini-batches -- a technique used when your batch doesn't fit in GPU memory), but if you forget to clear them, you're training on the sum of all past gradients rather than the current batch's gradient. Your loss will behave erratically and you'll spend an hour staring at your code wondering what's wrong, only to discover you forgot one line. Ask me how I know ;-)
(Having said that, once you've been burned by this exactly once, you never forget it again. It becomes muscle memory: zero, backward, step. Zero, backward, step.)
CPU vs GPU
Tensors can live on CPU or GPU. Moving computations to a GPU provides massive speedup for matrix operations -- the kind neural networks are made of. GPUs have thousands of simple cores designed for exactly this type of parallel arithmetic:
# Check what's available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
# On Apple Silicon Macs
if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
device = torch.device('mps')
print(f"Using Apple Silicon GPU (MPS)")
Moving data and models between devices is straightforward -- everything gets a .to(device) call:
# Move model and data to device
model_gpu = model.to(device)
X_device = X.to(device)
y_device = y.to(device)
# Training loop is identical -- just data and model are on GPU now
pred = model_gpu(X_device)
loss = loss_fn(pred, y_device)
print(f"Prediction device: {pred.device}")
print(f"Loss: {loss.item():.4f}")
When does GPU actually matter? For our 500-sample, 2-feature problem, CPU is actually faster. The overhead of transferring data to GPU exceeds the computation speedup. For real-world tasks -- thousands of samples, hundreds of features, deep networks with millions of parameters -- GPU provides 10-100x speedup. The rule of thumb: if your training loop takes more than a few seconds per epoch on CPU, try GPU. For the kind of models we'll build in the next dozen episodes (CNNs, RNNs, transformers), GPU acceleration goes from "nice to have" to "absolutely essential".
On Apple Silicon Macs, use torch.device('mps') in stead of 'cuda' -- PyTorch supports Metal Performance Shaders for GPU acceleration. It's not as fast as a dedicated NVIDIA card, but for learning and prototyping it's perfectly adequate.
Detaching and inference mode
Not every operation should be tracked. During evaluation (inference), you don't need gradients -- tracking them wastes memory and computation. PyTorch provides two mechanisms to skip gradient tracking:
# Method 1: torch.no_grad() context manager
model.eval() # set model to evaluation mode (disables dropout, etc.)
with torch.no_grad():
test_pred = model(X)
# No computational graph built -- faster, less memory
test_acc = ((test_pred > 0.5) == y).float().mean()
print(f"Test accuracy (no_grad): {test_acc.item():.1%}")
print(f" requires_grad: {test_pred.requires_grad}") # False!
model.train() # back to training mode
# Method 2: detach a tensor from the graph
hidden_output = model[:2](X) # first two layers (Linear + ReLU)
hidden_detached = hidden_output.detach() # no gradient connection
print(f"\nHidden output requires_grad: {hidden_output.requires_grad}")
print(f"Detached requires_grad: {hidden_detached.requires_grad}")
print(f"Shape: {hidden_detached.shape}")
Always wrap evaluation and inference in torch.no_grad(). For training, leave it off. The model.eval() / model.train() toggle is important too -- it doesn't affect gradient tracking, but it switches behaviors like dropout (episode #40: dropout is active during training, disabled during evaluation) and batch normalization (uses batch statistics during training, running statistics during evaluation). Forgetting model.eval() during inference is another common bug -- your model will produce slightly different outputs each time because dropout is randomly zeroing neurons.
# Common evaluation pattern
def evaluate(model, X, y):
model.eval()
with torch.no_grad():
predictions = model(X)
loss = loss_fn(predictions, y)
accuracy = ((predictions > 0.5) == y).float().mean()
model.train() # always switch back!
return loss.item(), accuracy.item()
eval_loss, eval_acc = evaluate(model, X, y)
print(f"Evaluation: loss={eval_loss:.4f}, accuracy={eval_acc:.1%}")
Inspecting model parameters
When debugging or understanding a model, you'll often want to look at the actual weight values, their shapes, and their gradient magnitudes. PyTorch makes this straightforward:
# List all parameters
print("Model parameters:")
for name, param in model.named_parameters():
print(f" {name:>10s}: shape={str(list(param.shape)):>10s}, "
f"dtype={param.dtype}, "
f"requires_grad={param.requires_grad}")
print(f"\nTotal trainable parameters: "
f"{sum(p.numel() for p in model.parameters() if p.requires_grad)}")
# Peek at actual values
first_layer_weight = model[0].weight
print(f"\nFirst layer weight shape: {first_layer_weight.shape}")
print(f"First layer weight stats: "
f"mean={first_layer_weight.data.mean():.4f}, "
f"std={first_layer_weight.data.std():.4f}")
print(f"First layer bias: {model[0].bias.data}")
NB: notice that nn.Linear(2, 16) creates a weight matrix of shape [16, 2] (not [2, 16]). PyTorch uses the convention where the weight matrix is [out_features, in_features], so the forward pass computes output = input @ W.T + b. This is transposed from the convention we used in our from-scratch implementation in episode #38, but the math is identical -- just a notational choice.
The standard PyTorch workflow
Every PyTorch project follows the same five-step structure, regardless of whether you're building a tiny classifier or training a billion-parameter language model:
# Step 1: Define the model
model = nn.Sequential(
nn.Linear(2, 32),
nn.ReLU(),
nn.Linear(32, 16),
nn.ReLU(),
nn.Linear(16, 1),
nn.Sigmoid()
)
# Step 2: Define the loss function
loss_fn = nn.BCELoss() # binary cross-entropy (episode #39)
# Other common losses:
# nn.CrossEntropyLoss() -- multi-class classification
# nn.MSELoss() -- regression (episode #10)
# Step 3: Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Other common optimizers (episode #41):
# torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
# Step 4: Training loop
n_epochs = 300
for epoch in range(n_epochs):
pred = model(X)
loss = loss_fn(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 60 == 0:
acc = ((pred > 0.5) == y).float().mean()
print(f"Epoch {epoch:>3d}: loss={loss.item():.4f}, acc={acc.item():.1%}")
# Step 5: Evaluation
model.eval()
with torch.no_grad():
final_pred = model(X)
final_acc = ((final_pred > 0.5) == y).float().mean()
print(f"\nFinal: {final_acc.item():.1%} accuracy")
model.train()
This structure scales from our 500-sample toy problem to training GPT on trillions of tokens. The architecture changes, the data pipeline grows complex, the training loop adds distributed computing and gradient accumulation and mixed precision -- but the five steps remain the same. Define model, pick loss, pick optimizer, train loop, evaluate.
Saving and loading models
Once you've trained a model, you probably want to keep it around. PyTorch gives you two ways to serialize models:
# Save just the learned parameters (recommended)
torch.save(model.state_dict(), '/tmp/model_weights.pth')
print("Saved model weights")
# Load them back
model_loaded = nn.Sequential(
nn.Linear(2, 32), nn.ReLU(),
nn.Linear(32, 16), nn.ReLU(),
nn.Linear(16, 1), nn.Sigmoid()
)
model_loaded.load_state_dict(torch.load('/tmp/model_weights.pth',
weights_only=True))
model_loaded.eval()
# Verify it works
with torch.no_grad():
loaded_pred = model_loaded(X)
loaded_acc = ((loaded_pred > 0.5) == y).float().mean()
print(f"Loaded model accuracy: {loaded_acc.item():.1%}")
# Compare predictions -- should be identical
with torch.no_grad():
original_pred = model(X)
diff = (original_pred - loaded_pred).abs().max()
print(f"Max prediction difference: {diff.item():.10f}")
The recommended approach is saving state_dict() (just the parameters) rather than the entire model object. This is more portable and less fragile -- saving the entire model uses Python's pickle, which ties your saved file to the exact class definitions and module structure at save time. Saving just the weights lets you reconstruct the architecture in code and then load the learned values.
What to take away
Here's everything we covered, distilled:
- PyTorch tensors are NumPy arrays with GPU support and automatic differentiation -- the API is almost identical, and
torch.from_numpyconverts between them (shared memory -- use.clone()for independence); - Autograd records operations on tensors with
requires_grad=Trueand computes gradients automatically via.backward(). This replaces the entire manual backpropagation we built in episode #39; - The training mantra:
zero_grad()->backward()->step(). Forgettingzero_grad()accumulates stale gradients -- the single most common PyTorch bug; nn.Sequentialdefines a model,nn.BCELoss/nn.CrossEntropyLossdefine the loss,torch.optim.Adam/torch.optim.AdamWhandle optimization -- each replaces dozens of lines of our from-scratch code;- GPU (
cudaormps) provides 10-100x speedup for large-scale training. Use CPU for small problems, GPU when training takes more than a few seconds per epoch; - Dynamic computational graphs mean you debug PyTorch models with standard Python tools -- print, breakpoints, inspect. No graph compilation step required;
- Wrap evaluation in
torch.no_grad()and callmodel.eval()to skip gradient computation and disable training-time behaviors like dropout; - Save model weights with
torch.save(model.state_dict(), path)and load withmodel.load_state_dict().
We've covered tensors, autograd, and the basic training workflow -- the foundation of everything we'll build going forward. But we've been feeding the entire dataset to the model in one go, which doesn't scale. Real training needs mini-batches (as we discussed in episode #41), proper train/validation/test splits, and data loading utilities that handle all the shuffling and batching automatically. PyTorch has built-in tools for all of this, and that's where things get practical fast ;-)
Exercises
Exercise 1: Create a PyTorch tensor x with values [1.0, 2.0, 3.0, 4.0] and requires_grad=True. Compute y = sum(x^3) (the sum of cubes). Call .backward() and print the gradients. Verify by hand that the gradient of sum(x^3) with respect to each x_i is 3 * x_i^2.
Exercise 2: Build an nn.Sequential model with architecture 4 -> 32 -> 16 -> 1 (using ReLU for hidden layers and Sigmoid for the output). Generate 1000 random 4-dimensional points, label them 1 if the sum of their squared components is less than 2.0, and 0 otherwise. Train the model for 500 epochs with Adam (lr=0.01) and print the final accuracy.
Exercise 3: Take the model from Exercise 2 and save its weights to a file. Create a new model with the same architecture, load the weights back, and verify that both models produce identical predictions on the same input data. Print the maximum absolute difference between their predictions (it should be zero or extremely close to zero).