Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd

Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd

ai-banner.png

What will I learn

  • You will learn why PyTorch won the deep learning framework war;
  • tensors -- NumPy arrays with GPU support and automatic differentiation;
  • autograd -- PyTorch's automatic differentiation engine that replaces manual backpropagation;
  • rebuilding our episode #39 network in PyTorch -- same architecture, far less code;
  • CPU vs GPU: when to use what;
  • the computational graph and how PyTorch tracks operations;
  • detaching tensors and inference mode for evaluation;
  • the standard PyTorch training loop and workflow.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd

For five episodes now, you've been building neural networks by hand -- writing forward passes, implementing backpropagation, coding entire optimizers from scratch in NumPy. This was deliberate. You understand what happens inside a neural network because you built every component yourself. Whe you read loss.backward() in someone else's code, you know what that means at the matrix level, because you've manually computed those gradients.

Now we switch to the tool that makes all of this practical: PyTorch. In stead of implementing backpropagation manually, PyTorch computes gradients automatically. In stead of hand-coding Adam (like we did last episode), you call torch.optim.Adam. In stead of reshaping NumPy arrays for batches, you use DataLoaders. The concepts are identical -- PyTorch just handles the plumbing so you can focus on the interesting parts: the architecture, the data, the experiments.

Why PyTorch and not TensorFlow? Because by 2026, the framework war is over. PyTorch won. It dominates research, it powers most production deployments, and virtually every new model -- from GPT-4 to Stable Diffusion -- is built with it. Meta's continued investment and the community momentum make it the clear standard. TensorFlow still exists (and has its strengths in mobile/embedded deployment), but if you're starting fresh today, PyTorch is the one to learn. Here we go!

Tensors: NumPy arrays that do more

A PyTorch tensor is a multi-dimensional array, just like a NumPy array, with two extra capabilities: it can run on GPUs and it can track operations for automatic differentiation. If you've been following this series since episode #2 where we set up NumPy, tensors will feel immediately familiar -- the API is nearly identical.

import torch
import numpy as np

# Creating tensors (familiar if you know NumPy)
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.zeros(3, 4)
c = torch.randn(2, 3)      # random normal distribution
d = torch.arange(10)        # 0 through 9
e = torch.ones(5)           # all ones

print(f"a: {a}")
print(f"  shape: {a.shape}, dtype: {a.dtype}")
print(f"\nb:\n{b}")
print(f"\nc (random):\n{c}")
print(f"\nd: {d}")

The API mirrors NumPy almost exactly: torch.zeros maps to np.zeros, torch.randn maps to np.random.randn, tensor.shape maps to array.shape. If you know NumPy (and after 41 episodes of this series, you certainly do), you already know 90% of the tensor API. The main syntax difference: NumPy uses @ or np.dot for matrix multiplication; PyTorch uses @ or torch.matmul -- same operator, different library.

# Operations -- same as NumPy
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([4.0, 5.0, 6.0])

print(f"x + y = {x + y}")
print(f"x * y = {x * y}")           # element-wise
print(f"x @ y = {(x @ y).item()}")  # dot product -> scalar

# Reshaping
m = torch.arange(12).reshape(3, 4)
print(f"\nReshaped:\n{m}")
print(f"Shape: {m.shape}")
print(f"Transpose shape: {m.T.shape}")

# Matrix multiplication
A = torch.randn(3, 4)
B = torch.randn(4, 5)
C = A @ B    # (3, 4) @ (4, 5) -> (3, 5)
print(f"\nMatrix multiply: {A.shape} @ {B.shape} = {C.shape}")

NumPy conversion: shared memory

One critical detail about converting between NumPy and PyTorch: torch.from_numpy shares memory with the original NumPy array. Modifying one modifies the other. This is efficient (no copying), but it can bite you if you're not aware of it:

# NumPy <-> PyTorch conversion (shared memory -- no copy!)
np_array = np.array([1.0, 2.0, 3.0])
tensor_from_np = torch.from_numpy(np_array)

print(f"NumPy:   {np_array}")
print(f"Tensor:  {tensor_from_np}")

# Modify the NumPy array -- tensor changes too!
np_array[0] = 999.0
print(f"\nAfter modifying NumPy array:")
print(f"NumPy:   {np_array}")
print(f"Tensor:  {tensor_from_np}")     # also 999!

# Going the other direction
back_to_np = tensor_from_np.numpy()
print(f"\nBack to NumPy: {back_to_np}")

# Use .clone() if you need an independent copy
independent = torch.from_numpy(np_array).clone()
np_array[0] = 1.0
print(f"\nAfter modifying NumPy (with clone):")
print(f"NumPy:  {np_array}")
print(f"Clone:  {independent}")          # still 999!

This shared-memory behavior is by design -- it makes the NumPy-PyTorch bridge zero-cost, which matters when you're loading large datasets. But whenever you need the tensor to be independent of its source array, use .clone(). Having said that, in most practical workflows you'll either start with PyTorch tensors directly or convert once and never touch the NumPy original again, so this rarely causes issues.

Data types: being explicit

PyTorch is more explicit about data types than NumPy. Neural network weights are almost always torch.float32 (32-bit floating point), and you'll occasionally see errors when you accidentally mix float64 (NumPy's default) with float32. Being aware of this saves debugging time:

# PyTorch defaults
print("Default dtypes:")
print(f"  torch.tensor([1.0]):    {torch.tensor([1.0]).dtype}")    # float32
print(f"  torch.tensor([1]):      {torch.tensor([1]).dtype}")      # int64
print(f"  torch.randn(3):        {torch.randn(3).dtype}")         # float32
print(f"  np.array([1.0]).dtype:  {np.array([1.0]).dtype}")        # float64!

# Converting NumPy float64 to PyTorch float32
data_np = np.random.randn(100, 4)   # float64 by default
data_pt = torch.from_numpy(data_np.astype(np.float32))
print(f"\nNumPy dtype: {data_np.dtype}")
print(f"Tensor dtype: {data_pt.dtype}")

# Or use torch.tensor() which copies and can set dtype
data_pt2 = torch.tensor(data_np, dtype=torch.float32)
print(f"Via torch.tensor: {data_pt2.dtype}")

Autograd: automatic differentiation

This is PyTorch's killer feature. Remember episode #39 where we manually computed every single derivative for backpropagation? For a 3-layer network with ReLU activations, that was roughly 50 lines of careful chain-rule math. Now imagine doing that for a 100-layer transformer with attention mechanisms, layer normalization, and residual connections. You'd be debugging derivative code for weeks.

PyTorch's autograd system eliminates all of that. Set requires_grad=True on a tensor, perform operations on it, and PyTorch records every operation in a computational graph. When you call .backward(), it automatically computes the gradient of the output with respect to every tensor that has requires_grad=True -- no manual chain rule, no hand-coded derivatives.

# Simple example: y = x0^2 + 3*x1
x = torch.tensor([2.0, 3.0], requires_grad=True)

y = x[0]**2 + 3 * x[1]   # y = 4 + 9 = 13

y.backward()               # compute dy/dx0 and dy/dx1

print(f"x = {x.data}")
print(f"y = {y.item():.1f}")
print(f"dy/dx = {x.grad}")
print()
print("Verification:")
print(f"  dy/dx0 = 2*x0 = 2*2 = {2*2.0:.1f}  (autograd: {x.grad[0].item():.1f})")
print(f"  dy/dx1 = 3     = {3.0:.1f}    (autograd: {x.grad[1].item():.1f})")

Compare this to episode #39 where we manually computed every derivative. For a 2-variable function, manual differentiation is manageable. For a network with millions of parameters, it's simply impossible to do by hand. Autograd scales to any computational graph, no matter how complex. The .backward() call traverses the computational graph in reverse (hence "backward"), applying the chain rule at each node. The result -- gradients stored in .grad -- is identical to what our hand-coded backpropagation computed. PyTorch just does it automatically ;-)

A slightly more complex example

Let's see autograd handle a function that would require a few lines of chain rule by hand:

# More complex: z = (a*b + c)^2, where a, b, c all have requires_grad
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
c = torch.tensor(1.0, requires_grad=True)

# z = (a*b + c)^2 = (2*3 + 1)^2 = 7^2 = 49
intermediate = a * b + c
z = intermediate ** 2

z.backward()

print(f"a={a.item()}, b={b.item()}, c={c.item()}")
print(f"z = (a*b + c)^2 = {z.item():.1f}")
print()
print(f"dz/da = 2*(a*b+c)*b = 2*7*3 = {2*7*3:.1f}  (autograd: {a.grad.item():.1f})")
print(f"dz/db = 2*(a*b+c)*a = 2*7*2 = {2*7*2:.1f}  (autograd: {b.grad.item():.1f})")
print(f"dz/dc = 2*(a*b+c)*1 = 2*7*1 = {2*7*1:.1f}  (autograd: {c.grad.item():.1f})")

Every gradient matches the manual calculus. Now imagine doing this for a function with 10 million intermediate operations -- autograd handles it identically. The algorithm is the same chain rule we implemented by hand, but automated.

Understanding the computational graph

Every operation on a tensor with requires_grad=True is recorded in a directed acyclic graph (DAG). PyTorch builds this graph dynamically as you run Python code:

a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)

c = a * b           # MulBackward0
d = c + a            # AddBackward0
e = d ** 2           # PowBackward0

print(f"e = {e.item()}")
print(f"e.grad_fn = {e.grad_fn}")
print(f"d.grad_fn = {d.grad_fn}")
print(f"c.grad_fn = {c.grad_fn}")

e.backward()
print(f"\nde/da = {a.grad.item()}")
print(f"de/db = {b.grad.item()}")

Each tensor knows which operation created it (you can see it via grad_fn) and which tensors were its inputs. When you call .backward(), PyTorch walks this graph in reverse, applying the chain rule at each node. After the backward pass, the graph is released (freed from memory) -- you get one backward pass per forward pass.

This dynamic graph construction is what originally set PyTorch apart from TensorFlow's static graph approach. (TensorFlow 2.x later adopted eager mode, which is similar, but PyTorch had it from day one.) You build the graph by running normal Python code, using standard if-statements and for-loops. This means debugging a PyTorch model is just debugging Python -- print statements, breakpoints, standard tooling all work. No special graph-compilation step, no weird trace errors, no "Session.run()". Just Python.

# Dynamic graphs: use normal Python control flow
def dynamic_computation(x, power):
    """The graph changes based on the input -- this is normal in PyTorch."""
    result = x
    for i in range(power):
        result = result * x
    return result

x = torch.tensor(2.0, requires_grad=True)

# Different power = different computational graph each time
for p in [2, 3, 4]:
    y = dynamic_computation(x, p)
    y.backward()
    print(f"x^{p+1} at x=2: value={y.item():.0f}, "
          f"gradient={(p+1)*2**p:.0f}, "
          f"autograd={x.grad.item():.0f}")
    x.grad.zero_()  # reset gradients for next iteration

From NumPy to PyTorch: the same network, way less code

Let's rebuild our episode #39 network in PyTorch. Same architecture (2 -> 16 -> 8 -> 1), same problem (classifying points inside a circle), dramatically less code:

import torch.nn as nn

# Define the model -- 3 lines instead of an entire class
model = nn.Sequential(
    nn.Linear(2, 16),
    nn.ReLU(),
    nn.Linear(16, 8),
    nn.ReLU(),
    nn.Linear(8, 1),
    nn.Sigmoid()
)

print("Model architecture:")
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters())}")

Compare this to the NeuralNetwork class from episode #38 that took ~40 lines just for the forward pass plus He initialization plus the backward pass from episode #39. nn.Sequential handles architecture definition, weight initialization, forward pass logic, AND gradient computation. All of it, automatically.

Now let's train it on the same circle-classification problem:

# Data (same as episode #39)
np.random.seed(42)
X_np = np.random.randn(500, 2).astype(np.float32)
y_np = ((X_np[:, 0]**2 + X_np[:, 1]**2) < 1.5).astype(np.float32).reshape(-1, 1)

X = torch.from_numpy(X_np)
y = torch.from_numpy(y_np)

print(f"Dataset: {len(X)} points")
print(f"Inside circle: {int(y.sum().item())}")
print(f"Outside circle: {int(len(y) - y.sum().item())}")

# Loss function and optimizer -- one line each
loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Training loop
for epoch in range(500):
    # Forward pass
    pred = model(X)
    loss = loss_fn(pred, y)

    # Backward pass + weight update (the PyTorch mantra)
    optimizer.zero_grad()  # clear previous gradients
    loss.backward()        # compute gradients (autograd!)
    optimizer.step()       # update weights (Adam!)

    if epoch % 100 == 0:
        acc = ((pred > 0.5) == y).float().mean()
        print(f"Epoch {epoch:>3d}: loss={loss.item():.4f}, acc={acc.item():.1%}")

# Final evaluation
with torch.no_grad():
    pred_final = model(X)
    acc_final = ((pred_final > 0.5) == y).float().mean()
    print(f"\nFinal accuracy: {acc_final.item():.1%}")

That's the entire training loop. Compare this to the ~80+ lines of NumPy code from episodes #38-39 (forward pass class, backward pass, manual Adam implementation from episode #41). The forward pass is defined by nn.Sequential (no manual weight matrices). Backpropagation is loss.backward() (no manual derivative computation). Weight updates are optimizer.step() (no manual Adam implementation).

The PyTorch training mantra

Three lines you must remember -- the PyTorch training mantra that appears in every training loop:

optimizer.zero_grad()  # 1. Clear accumulated gradients from previous step
loss.backward()        # 2. Compute new gradients via autograd
optimizer.step()       # 3. Update weights using the optimizer

Forgetting zero_grad() is the single most common PyTorch bug. Gradients accumulate by default (which is actually useful for gradient accumulation across mini-batches -- a technique used when your batch doesn't fit in GPU memory), but if you forget to clear them, you're training on the sum of all past gradients rather than the current batch's gradient. Your loss will behave erratically and you'll spend an hour staring at your code wondering what's wrong, only to discover you forgot one line. Ask me how I know ;-)

(Having said that, once you've been burned by this exactly once, you never forget it again. It becomes muscle memory: zero, backward, step. Zero, backward, step.)

CPU vs GPU

Tensors can live on CPU or GPU. Moving computations to a GPU provides massive speedup for matrix operations -- the kind neural networks are made of. GPUs have thousands of simple cores designed for exactly this type of parallel arithmetic:

# Check what's available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

# On Apple Silicon Macs
if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    device = torch.device('mps')
    print(f"Using Apple Silicon GPU (MPS)")

Moving data and models between devices is straightforward -- everything gets a .to(device) call:

# Move model and data to device
model_gpu = model.to(device)
X_device = X.to(device)
y_device = y.to(device)

# Training loop is identical -- just data and model are on GPU now
pred = model_gpu(X_device)
loss = loss_fn(pred, y_device)

print(f"Prediction device: {pred.device}")
print(f"Loss: {loss.item():.4f}")

When does GPU actually matter? For our 500-sample, 2-feature problem, CPU is actually faster. The overhead of transferring data to GPU exceeds the computation speedup. For real-world tasks -- thousands of samples, hundreds of features, deep networks with millions of parameters -- GPU provides 10-100x speedup. The rule of thumb: if your training loop takes more than a few seconds per epoch on CPU, try GPU. For the kind of models we'll build in the next dozen episodes (CNNs, RNNs, transformers), GPU acceleration goes from "nice to have" to "absolutely essential".

On Apple Silicon Macs, use torch.device('mps') in stead of 'cuda' -- PyTorch supports Metal Performance Shaders for GPU acceleration. It's not as fast as a dedicated NVIDIA card, but for learning and prototyping it's perfectly adequate.

Detaching and inference mode

Not every operation should be tracked. During evaluation (inference), you don't need gradients -- tracking them wastes memory and computation. PyTorch provides two mechanisms to skip gradient tracking:

# Method 1: torch.no_grad() context manager
model.eval()  # set model to evaluation mode (disables dropout, etc.)

with torch.no_grad():
    test_pred = model(X)
    # No computational graph built -- faster, less memory
    test_acc = ((test_pred > 0.5) == y).float().mean()
    print(f"Test accuracy (no_grad): {test_acc.item():.1%}")
    print(f"  requires_grad: {test_pred.requires_grad}")  # False!

model.train()  # back to training mode

# Method 2: detach a tensor from the graph
hidden_output = model[:2](X)           # first two layers (Linear + ReLU)
hidden_detached = hidden_output.detach()  # no gradient connection

print(f"\nHidden output requires_grad: {hidden_output.requires_grad}")
print(f"Detached requires_grad:      {hidden_detached.requires_grad}")
print(f"Shape: {hidden_detached.shape}")

Always wrap evaluation and inference in torch.no_grad(). For training, leave it off. The model.eval() / model.train() toggle is important too -- it doesn't affect gradient tracking, but it switches behaviors like dropout (episode #40: dropout is active during training, disabled during evaluation) and batch normalization (uses batch statistics during training, running statistics during evaluation). Forgetting model.eval() during inference is another common bug -- your model will produce slightly different outputs each time because dropout is randomly zeroing neurons.

# Common evaluation pattern
def evaluate(model, X, y):
    model.eval()
    with torch.no_grad():
        predictions = model(X)
        loss = loss_fn(predictions, y)
        accuracy = ((predictions > 0.5) == y).float().mean()
    model.train()  # always switch back!
    return loss.item(), accuracy.item()

eval_loss, eval_acc = evaluate(model, X, y)
print(f"Evaluation: loss={eval_loss:.4f}, accuracy={eval_acc:.1%}")

Inspecting model parameters

When debugging or understanding a model, you'll often want to look at the actual weight values, their shapes, and their gradient magnitudes. PyTorch makes this straightforward:

# List all parameters
print("Model parameters:")
for name, param in model.named_parameters():
    print(f"  {name:>10s}: shape={str(list(param.shape)):>10s}, "
          f"dtype={param.dtype}, "
          f"requires_grad={param.requires_grad}")

print(f"\nTotal trainable parameters: "
      f"{sum(p.numel() for p in model.parameters() if p.requires_grad)}")

# Peek at actual values
first_layer_weight = model[0].weight
print(f"\nFirst layer weight shape: {first_layer_weight.shape}")
print(f"First layer weight stats: "
      f"mean={first_layer_weight.data.mean():.4f}, "
      f"std={first_layer_weight.data.std():.4f}")
print(f"First layer bias: {model[0].bias.data}")

NB: notice that nn.Linear(2, 16) creates a weight matrix of shape [16, 2] (not [2, 16]). PyTorch uses the convention where the weight matrix is [out_features, in_features], so the forward pass computes output = input @ W.T + b. This is transposed from the convention we used in our from-scratch implementation in episode #38, but the math is identical -- just a notational choice.

The standard PyTorch workflow

Every PyTorch project follows the same five-step structure, regardless of whether you're building a tiny classifier or training a billion-parameter language model:

# Step 1: Define the model
model = nn.Sequential(
    nn.Linear(2, 32),
    nn.ReLU(),
    nn.Linear(32, 16),
    nn.ReLU(),
    nn.Linear(16, 1),
    nn.Sigmoid()
)

# Step 2: Define the loss function
loss_fn = nn.BCELoss()        # binary cross-entropy (episode #39)
# Other common losses:
# nn.CrossEntropyLoss()       -- multi-class classification
# nn.MSELoss()                -- regression (episode #10)

# Step 3: Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Other common optimizers (episode #41):
# torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Step 4: Training loop
n_epochs = 300
for epoch in range(n_epochs):
    pred = model(X)
    loss = loss_fn(pred, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 60 == 0:
        acc = ((pred > 0.5) == y).float().mean()
        print(f"Epoch {epoch:>3d}: loss={loss.item():.4f}, acc={acc.item():.1%}")

# Step 5: Evaluation
model.eval()
with torch.no_grad():
    final_pred = model(X)
    final_acc = ((final_pred > 0.5) == y).float().mean()
    print(f"\nFinal: {final_acc.item():.1%} accuracy")
model.train()

This structure scales from our 500-sample toy problem to training GPT on trillions of tokens. The architecture changes, the data pipeline grows complex, the training loop adds distributed computing and gradient accumulation and mixed precision -- but the five steps remain the same. Define model, pick loss, pick optimizer, train loop, evaluate.

Saving and loading models

Once you've trained a model, you probably want to keep it around. PyTorch gives you two ways to serialize models:

# Save just the learned parameters (recommended)
torch.save(model.state_dict(), '/tmp/model_weights.pth')
print("Saved model weights")

# Load them back
model_loaded = nn.Sequential(
    nn.Linear(2, 32), nn.ReLU(),
    nn.Linear(32, 16), nn.ReLU(),
    nn.Linear(16, 1), nn.Sigmoid()
)
model_loaded.load_state_dict(torch.load('/tmp/model_weights.pth',
                                         weights_only=True))
model_loaded.eval()

# Verify it works
with torch.no_grad():
    loaded_pred = model_loaded(X)
    loaded_acc = ((loaded_pred > 0.5) == y).float().mean()
    print(f"Loaded model accuracy: {loaded_acc.item():.1%}")

# Compare predictions -- should be identical
with torch.no_grad():
    original_pred = model(X)
    diff = (original_pred - loaded_pred).abs().max()
    print(f"Max prediction difference: {diff.item():.10f}")

The recommended approach is saving state_dict() (just the parameters) rather than the entire model object. This is more portable and less fragile -- saving the entire model uses Python's pickle, which ties your saved file to the exact class definitions and module structure at save time. Saving just the weights lets you reconstruct the architecture in code and then load the learned values.

What to take away

Here's everything we covered, distilled:

  • PyTorch tensors are NumPy arrays with GPU support and automatic differentiation -- the API is almost identical, and torch.from_numpy converts between them (shared memory -- use .clone() for independence);
  • Autograd records operations on tensors with requires_grad=True and computes gradients automatically via .backward(). This replaces the entire manual backpropagation we built in episode #39;
  • The training mantra: zero_grad() -> backward() -> step(). Forgetting zero_grad() accumulates stale gradients -- the single most common PyTorch bug;
  • nn.Sequential defines a model, nn.BCELoss / nn.CrossEntropyLoss define the loss, torch.optim.Adam / torch.optim.AdamW handle optimization -- each replaces dozens of lines of our from-scratch code;
  • GPU (cuda or mps) provides 10-100x speedup for large-scale training. Use CPU for small problems, GPU when training takes more than a few seconds per epoch;
  • Dynamic computational graphs mean you debug PyTorch models with standard Python tools -- print, breakpoints, inspect. No graph compilation step required;
  • Wrap evaluation in torch.no_grad() and call model.eval() to skip gradient computation and disable training-time behaviors like dropout;
  • Save model weights with torch.save(model.state_dict(), path) and load with model.load_state_dict().

We've covered tensors, autograd, and the basic training workflow -- the foundation of everything we'll build going forward. But we've been feeding the entire dataset to the model in one go, which doesn't scale. Real training needs mini-batches (as we discussed in episode #41), proper train/validation/test splits, and data loading utilities that handle all the shuffling and batching automatically. PyTorch has built-in tools for all of this, and that's where things get practical fast ;-)

Exercises

Exercise 1: Create a PyTorch tensor x with values [1.0, 2.0, 3.0, 4.0] and requires_grad=True. Compute y = sum(x^3) (the sum of cubes). Call .backward() and print the gradients. Verify by hand that the gradient of sum(x^3) with respect to each x_i is 3 * x_i^2.

Exercise 2: Build an nn.Sequential model with architecture 4 -> 32 -> 16 -> 1 (using ReLU for hidden layers and Sigmoid for the output). Generate 1000 random 4-dimensional points, label them 1 if the sum of their squared components is less than 2.0, and 0 otherwise. Train the model for 500 epochs with Adam (lr=0.01) and print the final accuracy.

Exercise 3: Take the model from Exercise 2 and save its weights to a file. Create a new model with the same architecture, load the weights back, and verify that both models produce identical predictions on the same input data. Print the maximum absolute difference between their predictions (it should be zero or extremely close to zero).

De groeten!

@scipio



0
0
0.000
0 comments