Learn AI Series (#43) - PyTorch Data and Training

@scipio 69

25 days ago

StemSocial

Learn AI Series (#43) - PyTorch Data and Training

What will I learn

You will learn the Dataset and DataLoader abstractions -- PyTorch's standard data pipeline;
building custom datasets for your own data (CSV files, images, anything);
data augmentation -- creating training variety from limited data;
the complete training loop with train/validation split and proper evaluation;
saving and loading model checkpoints to resume training or deploy models;
mixed precision training -- faster training on modern hardware.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#43) - PyTorch Data and Training

Solutions to Episode #42 Exercises

Exercise 1: Tensor with requires_grad, compute sum(x^3), verify gradients.

import torch

x = torch.tensor([1.0, 2.0, 3.0, 4.0], requires_grad=True)
y = (x ** 3).sum()       # y = 1 + 8 + 27 + 64 = 100
y.backward()

print(f"x = {x.data}")
print(f"y = sum(x^3) = {y.item()}")
print(f"Autograd gradients: {x.grad}")
print(f"Manual check (3*x^2): {3 * x.data ** 2}")
# Autograd: [3, 12, 27, 48]
# Manual:   3*1=3, 3*4=12, 3*9=27, 3*16=48  -- match!

The gradient of sum(x_i^3) with respect to each x_i is 3 * x_i^2 -- straight application of the power rule from episode #9. Autograd computes exactly the same values.

Exercise 2: Build 4->32->16->1 model, train on 4D hypersphere data.

import torch
import torch.nn as nn
import numpy as np

np.random.seed(42)
X_np = np.random.randn(1000, 4).astype(np.float32)
y_np = ((X_np ** 2).sum(axis=1) < 2.0).astype(np.float32).reshape(-1, 1)

X = torch.from_numpy(X_np)
y = torch.from_numpy(y_np)

model = nn.Sequential(
    nn.Linear(4, 32), nn.ReLU(),
    nn.Linear(32, 16), nn.ReLU(),
    nn.Linear(16, 1), nn.Sigmoid()
)

loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(500):
    pred = model(X)
    loss = loss_fn(pred, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

with torch.no_grad():
    final_pred = model(X)
    acc = ((final_pred > 0.5) == y).float().mean()
    print(f"Final accuracy: {acc.item():.1%}")
# You should see around 95%+ accuracy

Nothing fancy here -- same training pattern from episode #42, just with 4 input features in stead of 2. The hypersphere boundary (sum of squares < 2.0) in 4 dimensions is conceptually the same as the 2D circle boundary we've been using, but the model needs more capacity to carve out that decision surface in higher dimensions.

Exercise 3: Save and load model weights, verify identical predictions.

# Save weights from Exercise 2
torch.save(model.state_dict(), '/tmp/ex2_model.pth')

# Create a fresh model with same architecture
model2 = nn.Sequential(
    nn.Linear(4, 32), nn.ReLU(),
    nn.Linear(32, 16), nn.ReLU(),
    nn.Linear(16, 1), nn.Sigmoid()
)
model2.load_state_dict(torch.load('/tmp/ex2_model.pth', weights_only=True))
model2.eval()

# Compare predictions
with torch.no_grad():
    pred1 = model(X)
    pred2 = model2(X)
    max_diff = (pred1 - pred2).abs().max().item()
    print(f"Max prediction difference: {max_diff}")
# Should print 0.0 (or extremely close)

The difference should be exactly zero -- state_dict() captures every parameter with full floating point precision. The save/load roundtrip is lossless.

On to today's episode

In episode #42, we trained a model by loading all data into memory as one big tensor. That works for 500 samples. It does NOT work for 50 million images, a 10 GB text corpus, or streaming sensor data. Real-world training requires efficient data loading, batching, shuffling, and optional augmentation -- without loading everything into memory at once.

PyTorch's data pipeline is built on two abstractions: Dataset (defines how to access one sample) and DataLoader (handles batching, shuffling, and parallel loading). Together they make training on any data source -- files on disk, databases, web APIs -- seamless. Having said that, the concepts are straightforward once you see them in action. Here we go!

Dataset: one sample at a time

A PyTorch Dataset is any object that implements __len__ (how many samples) and __getitem__ (get one sample by index). That's the entire interface. If you've been following the Learn Python Series (espeically the episodes on classes and dunder methods), this will feel natural -- it's just Python's standard protocol for indexable containers.

import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np

class CircleDataset(Dataset):
    def __init__(self, n_samples=1000):
        np.random.seed(42)
        self.X = np.random.randn(n_samples, 2).astype(np.float32)
        self.y = ((self.X[:, 0]**2 + self.X[:, 1]**2) < 1.5).astype(np.float32)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return torch.tensor(self.X[idx]), torch.tensor(self.y[idx])

dataset = CircleDataset(1000)
print(f"Dataset size: {len(dataset)}")
x, y = dataset[0]
print(f"Sample 0: features={x}, label={y}")

The beauty of this interface: __getitem__ loads one sample at a time. For an image dataset, this means reading one image from disk when requested -- not loading all images into memory upfront. For a text dataset, it means reading one line from a file. The Dataset handles the how; the DataLoader (which we'll see next) handles the when and how many.

This is a pretty powerful pattern. Your dataset could be terabytes on disk, and PyTorch will only load one sample at a time into memory. The code inside __getitem__ can do anything -- open a file, query a database, download from a URL -- as long as it returns a tensor. The interface doesn't care about the data source, only the shape of the output.

DataLoader: batching and shuffling

The DataLoader wraps a Dataset and produces batches -- the mini-batches we need for SGD training (as we discussed in episode #41 when implementing mini-batch gradient descent from scratch):

train_data = CircleDataset(800)
val_data = CircleDataset(200)

train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
val_loader = DataLoader(val_data, batch_size=64, shuffle=False)

# Iterate over batches
for batch_X, batch_y in train_loader:
    print(f"Batch shape: {batch_X.shape}, labels: {batch_y.shape}")
    break  # just show the first batch

print(f"\nBatches per epoch: {len(train_loader)}")
print(f"  = ceil(800 / 32) = 25 batches")

shuffle=True randomizes sample order each epoch -- critical for training because it prevents the model from learning the order of samples in stead of the actual patterns. shuffle=False for validation gives you deterministic evaluation results. The batch size controls the noise-vs-accuracy tradeoff from episode #41: smaller batches add more noise (implicit regularization), larger batches give cleaner gradients.

For large datasets, num_workers=N enables parallel data loading on multiple CPU cores. While the GPU processes one batch, the CPU prepares the next in the background. Add pin_memory=True for faster CPU-to-GPU transfers on CUDA devices. These two flags together can dramatically reduce the time your GPU spends idle waiting for data:

# Production-ready DataLoader configuration
fast_loader = DataLoader(
    train_data,
    batch_size=64,
    shuffle=True,
    num_workers=4,       # 4 CPU workers loading data in parallel
    pin_memory=True,     # faster GPU transfer (if using CUDA)
    drop_last=True,      # drop incomplete final batch
    persistent_workers=True  # keep workers alive between epochs
)

print(f"Workers: {fast_loader.num_workers}")
print(f"Pin memory: {fast_loader.pin_memory}")
print(f"Batches: {len(fast_loader)} (dropped last partial batch)")

drop_last=True discards the final incomplete batch (if 800 samples / 64 = 12.5, you get 12 full batches instead of 12 full + 1 partial). This avoids batch normalization issues with tiny last batches, and the lost samples are seen in other epochs anyway since we shuffle.

Custom datasets for real data

Most real datasets aren't synthetic circles ;-) Let's look at a practical example -- building a Dataset from a CSV file, which is probably the most common data format you'll encounter:

import csv

class CSVDataset(Dataset):
    """Load features and labels from a CSV file."""
    def __init__(self, filepath, target_column=-1):
        with open(filepath) as f:
            reader = csv.reader(f)
            header = next(reader)     # skip header
            rows = list(reader)

        self.X = np.array(
            [[float(v) for j, v in enumerate(row) if j != target_column]
             for row in rows],
            dtype=np.float32
        )
        self.y = np.array(
            [float(row[target_column]) for row in rows],
            dtype=np.float32
        )

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return torch.tensor(self.X[idx]), torch.tensor(self.y[idx])

# Usage: dataset = CSVDataset('my_data.csv', target_column=-1)
# Then wrap in DataLoader as before
print("CSVDataset: loads CSV, separates features from target column")
print("Memory: entire CSV in memory (fine for <1GB)")
print("For larger CSVs: load lazily in __getitem__ with linecache or similar")

For images stored in directories (the standard layout where each subdirectory is a class label, like how ImageNet organizes its data), torchvision provides ImageFolder that reads this structure automatically. Combined with transforms.Compose for preprocessing (resize, convert to tensor, normalize), you get a complete image data pipeline in a few lines. We'll use this extensively when we build CNNs in upcoming episodes.

torchvision also provides prepackaged datasets -- MNIST, CIFAR-10, ImageNet -- with automatic downloading. These are invaluable for learning and benchmarking because you don't need to worry about data acquisition at all:

# Just to show the interface (don't actually run if you don't want to download)
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# This would download MNIST (~12MB) and create a Dataset
# mnist_train = datasets.MNIST(root='./data', train=True,
#                               download=True, transform=transform)
# mnist_loader = DataLoader(mnist_train, batch_size=64, shuffle=True)

print("torchvision.datasets: prepackaged datasets (MNIST, CIFAR-10, etc.)")
print("torchvision.transforms: image preprocessing pipeline")
print("We'll use these heavily in the CNN episodes")

Data augmentation

With limited training data, augmentation creates variety by applying random transformations to each sample. Each epoch, the same image is seen with a different random crop, flip, rotation, or color shift -- effectively multiplying the dataset size. The augmented samples aren't stored anywhere; they're generated on the fly inside the Dataset's __getitem__ method (or via the transform parameter).

For images, torchvision.transforms provides the standard augmentations:

from torchvision import transforms

train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# Validation: no augmentation, just resize and normalize
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

print("Training:   random flips, crops, rotations, color jitter")
print("Validation: deterministic resize + center crop only")
print("Normalization values: ImageNet mean/std (the standard)")

Augmentation is applied ONLY during training, never during validation or testing. The validation set measures performance on the data as-is -- you want to know how the model handles real data, not artificially transformed data. This is the same principle as using shuffle=True for training and shuffle=False for validation -- training introduces controlled randomness, evaluation must be deterministic.

For tabular data (like our circle dataset), augmentation is less common but still useful. Adding Gaussian noise to numerical features or randomly masking features during training can serve a similar regularization purpose as dropout (which we covered in episode #40):

class AugmentedCircleDataset(Dataset):
    def __init__(self, n_samples=1000, noise_std=0.1, training=True):
        np.random.seed(42)
        self.X = np.random.randn(n_samples, 2).astype(np.float32)
        self.y = ((self.X[:, 0]**2 + self.X[:, 1]**2) < 1.5).astype(np.float32)
        self.noise_std = noise_std
        self.training = training

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        x = self.X[idx].copy()
        if self.training:
            x += np.random.randn(2).astype(np.float32) * self.noise_std
        return torch.tensor(x), torch.tensor(self.y[idx])

# Training set with noise, validation without
aug_train = AugmentedCircleDataset(800, noise_std=0.1, training=True)
aug_val = AugmentedCircleDataset(200, noise_std=0.1, training=False)

# Same sample, different noise each time (training)
x1, y1 = aug_train[0]
x2, y2 = aug_train[0]
print(f"Same index, training=True:")
print(f"  Call 1: {x1}")
print(f"  Call 2: {x2}")
print(f"  Difference: {(x1 - x2).abs().sum():.4f} (random noise)")

The complete training loop

Now let's put everything together: Dataset, DataLoader, model, loss, optimizer, learning rate scheduler, and proper train/validation separation. This is the template you'll use for practically every PyTorch project going forward:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

# Data
train_data = CircleDataset(800)
val_data = CircleDataset(200)
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
val_loader = DataLoader(val_data, batch_size=64, shuffle=False)

# Model
model = nn.Sequential(
    nn.Linear(2, 32), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(32, 16), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(16, 1), nn.Sigmoid()
)

# Loss, optimizer, scheduler
loss_fn = nn.BCELoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

for epoch in range(50):
    # Training phase
    model.train()
    train_loss = 0
    for X_batch, y_batch in train_loader:
        pred = model(X_batch).squeeze()
        loss = loss_fn(pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    scheduler.step()

    # Validation phase
    model.eval()
    val_loss = 0
    correct = 0
    total = 0
    with torch.no_grad():
        for X_val, y_val in val_loader:
            val_pred = model(X_val).squeeze()
            val_loss += loss_fn(val_pred, y_val).item()
            correct += ((val_pred > 0.5) == y_val).sum().item()
            total += len(y_val)

    if epoch % 10 == 0:
        print(f"Epoch {epoch:>3d}: train_loss={train_loss/len(train_loader):.4f}, "
              f"val_loss={val_loss/len(val_loader):.4f}, "
              f"val_acc={correct/total:.1%}, "
              f"lr={scheduler.get_last_lr()[0]:.6f}")

A few critical details here. First, model.train() before the training loop and model.eval() before validation. This toggle controls dropout (active during training, disabled during eval) and batch normalization behavior (as we discussed in episode #40). Forgetting model.eval() during validation means dropout randomly zeros activations, giving inconsistent validation metrics. I've seen people spend hours debugging "noisy validation loss" when the entire issue was a missing model.eval() call.

Second, the .squeeze() call on the prediction. Our model outputs shape [batch_size, 1] (because the final layer is nn.Linear(16, 1)), but our labels have shape [batch_size]. The squeeze removes that extra dimension so the shapes match for the loss function. Missing this produces a cryptic broadcasting error or (worse) silently computes the wrong loss.

Third, notice we're using AdamW here (episode #41's recommended optimizer) with a cosine annealing scheduler (episode #40's recommended schedule). This is the standard modern combo.

Saving and loading checkpoints

Models can take hours or days to train. Save checkpoints so you can resume training after a crash, compare different epochs, or deploy the model for inference:

# Save everything needed to resume training
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'scheduler_state_dict': scheduler.state_dict(),
    'train_loss': train_loss,
    'val_loss': val_loss,
}
torch.save(checkpoint, 'checkpoint.pt')
print(f"Saved checkpoint at epoch {epoch}")

# Load checkpoint and resume
checkpoint = torch.load('checkpoint.pt', weights_only=False)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
start_epoch = checkpoint['epoch'] + 1
print(f"Resumed from epoch {checkpoint['epoch']}")
print(f"  Train loss was: {checkpoint['train_loss']:.4f}")
print(f"  Val loss was:   {checkpoint['val_loss']:.4f}")

state_dict() returns a dictionary of all learnable parameters. For the model, these are weights and biases. For the optimizer, these are momentum buffers and adaptive learning rates (for Adam/AdamW -- remember from episode #41 that Adam maintains running averages of first and second moments per parameter). For the scheduler, it's the current step count and internal state. Saving all three ensures you can resume training exactly where you left off -- not just the model weights but the optimizer's accumulated state and the scheduler's position in its learning rate curve.

For deployment (inference only), saving just the model is sufficient:

# Inference-only save (smaller file, no optimizer state)
torch.save(model.state_dict(), 'model_inference.pt')

# Load for inference
model.load_state_dict(torch.load('model_inference.pt', weights_only=True))
model.eval()

with torch.no_grad():
    test_input = torch.randn(1, 2)
    prediction = model(test_input)
    print(f"Input: {test_input.squeeze()}")
    print(f"Prediction: {prediction.item():.4f}")

A pattern I see in production fairly often: save checkpoints every N epochs during training (so you can resume if the machine dies), and save a final inference-only model at the end. The training checkpoints are large (3x the model size because they include optimizer state) and temporary. The inference checkpoint is small and permanent.

Mixed precision training

Modern GPUs (NVIDIA Ampere and newer, Apple M-series) have hardware support for float16 operations that run 2x faster and use half the memory compared to float32. Mixed precision training uses float16 for most computations and float32 only where numerical precision matters (loss computation, weight updates):

from torch.amp import autocast, GradScaler

# Create the scaler (handles gradient scaling to prevent underflow)
scaler = GradScaler()

model_mp = nn.Sequential(
    nn.Linear(2, 64), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(64, 32), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(32, 1), nn.Sigmoid()
)
optimizer_mp = torch.optim.AdamW(model_mp.parameters(), lr=0.001)

# Training loop with mixed precision
for epoch in range(5):
    model_mp.train()
    for X_batch, y_batch in train_loader:
        optimizer_mp.zero_grad()

        # Forward pass in float16 (faster, less memory)
        with autocast(device_type='cpu'):   # use 'cuda' on GPU
            pred = model_mp(X_batch).squeeze()
            loss = loss_fn(pred, y_batch)

        # Backward pass with gradient scaling
        scaler.scale(loss).backward()
        scaler.step(optimizer_mp)
        scaler.update()

    if epoch % 2 == 0:
        print(f"Epoch {epoch}: loss={loss.item():.4f}")

print("\nMixed precision: ~2x faster, half the GPU memory")
print("Effectively free performance -- use it whenever hardware supports it")

The autocast context manager tells PyTorch to run operations in float16 where possible. The GradScaler handles a subtle problem: float16 has a much smaller range than float32, so small gradients can underflow to zero. The scaler dynamically multiplies the loss before backward (making gradients larger so they don't underflow), then divides the gradients back before the optimizer step. All of this is transparent to your training logic -- you just wrap the forward pass in autocast and the backward pass in the scaler.

(Having said that, on CPU the speedup from mixed precision is minimal -- it really shines on GPUs with dedicated float16 hardware units. But it's good to learn the pattern now because you'll need it the moment you train anything serious on a GPU.)

What we covered -- the quick reference

Dataset defines how to access one sample via __len__ + __getitem__ -- keeps memory usage constant regardless of dataset size. Write the data loading logic once, DataLoader handles the rest;
DataLoader handles batching, shuffling, and parallel loading. shuffle=True for training, False for validation. num_workers and pin_memory for speed;
Custom datasets let you load any data format: CSV, images, databases, APIs -- just implement the two dunder methods. For standard image layouts, use torchvision.datasets.ImageFolder;
Data augmentation (random crops, flips, rotations, noise) creates training variety without storing extra data. Applied only during training, never during validation;
The training loop: model.train() -> forward -> loss -> zero_grad() -> backward() -> step() -> scheduler.step(). Toggle model.train() / model.eval() for dropout and batch norm;
Checkpoints save model + optimizer + scheduler state dicts to resume training exactly where you left off. Inference-only saves just the model state dict;
Mixed precision (autocast + GradScaler) gives roughly 2x speedup and half the memory for free on modern GPUs.

We now have the complete PyTorch training infrastructure: tensors and autograd (episode #42), data pipelines and training loops (this episode). But we've been defining models with nn.Sequential -- which works fine for simple stacks of layers but falls apart the moment you need anything non-linear in the architecture (skip connections, multiple inputs, branching paths). Building real network architectures requires understanding PyTorch's nn.Module class, and that's where things get properly interesting ;-)

Exercises

Exercise 1: Create a custom RangeDataset that generates n_samples random floats between 0 and 10 as input, and labels each sample 1 if the value is between 3 and 7 (inclusive), 0 otherwise. Wrap it in a DataLoader with batch_size=16 and shuffle=True. Print the first 3 batches with their shapes.

Exercise 2: Using the CircleDataset from this episode, write a complete training loop with: (a) an 80/20 train/val split via separate datasets, (b) a model of your choice, (c) AdamW optimizer with cosine annealing, (d) tracking both train and val loss each epoch. Print the final train and val accuracy after 30 epochs.

Exercise 3: Extend Exercise 2 with checkpointing: save a checkpoint every 10 epochs. After training completes, load the epoch-10 checkpoint into a fresh model and compare its validation accuracy to the final model's validation accuracy. Print both accuracies side by side.

Bedankt en tot de volgende keer!

@scipio

stem stemsocial steemstem python programming

0.000

0 comments

Learn AI Series (#43) - PyTorch Data and Training

Learn AI Series (#43) - PyTorch Data and Training

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#43) - PyTorch Data and Training

Solutions to Episode #42 Exercises

On to today's episode

Dataset: one sample at a time

DataLoader: batching and shuffling

Custom datasets for real data

Data augmentation

The complete training loop

Saving and loading checkpoints

Mixed precision training

What we covered -- the quick reference

Exercises

Bedankt en tot de volgende keer!

Curriculum (of the `Learn AI Series`):