Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra

What will I learn
- You will learn why linear algebra is the language that machine learning speaks;
- what vectors and matrices actually represent in the context of data;
- the dot product -- the single most important operation in all of ML;
- matrix multiplication and what it means geometrically;
- how NumPy makes all of this fast and practical;
- broadcasting -- NumPy's most powerful (and most confusing) feature;
- the transpose and why you'll use it constantly for shape wrangling.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra (this post)
Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
Seven episodes. Seven episodes of building intuition, writing code, and watching models learn. We've gone from "what is ML?" all the way through constructing a complete training loop from scratch in episode #7. You watched loss numbers drop in your terminal. You saw parameters crawl toward the correct values. You even built early stopping. All of that with just two parameters -- a slope and an intercept.
Now here's the thing I've been building toward: when you move from two parameters to two hundred, or two thousand, or two million, writing out individual gradient formulas by hand stops being practical. You need a language for talking about collections of numbers and operations on those collections all at once. That language is linear algebra. And I have good news -- you've already been using it without knowing it ;-)
Every time you created a NumPy array in episodes #2 through #7, you were creating a vector or a matrix. Every time you wrote (price - predictions) ** 2 across a whole dataset, you were doing vectorized linear algebra. The pieces are already in your head. Today we're giving them proper names and showing you the handful of operations that power ALL of machine learning. Not a textbook worth of linear algebra -- just the parts you'll actually use.
Let's go.
Vectors: where your data lives
A vector is an ordered list of numbers. That's the whole definition. If you followed episode #3 where we turned apartment data into arrays of features, you were making vectors the entire time.
import numpy as np
# A 3-dimensional vector
v = np.array([3.0, 1.0, 4.0])
print(v) # [3. 1. 4.]
print(v.shape) # (3,)
Three numbers, three dimensions. But here's what makes vectors powerfull in ML: they represent something. A vector is not just a list -- it's a point in space, or a direction, or a description of an entity.
Remember our apartments from episode #3? Each apartment was a row of features:
# An apartment described by 4 features
house = np.array([120.0, 3.0, 2.0, 1985.0])
# [square_meters, bedrooms, bathrooms, year_built]
print(f"This apartment lives in {len(house)}-dimensional space")
Each number is a feature -- a measurable property. Together they form a point in 4-dimensional space. We can't visualize 4 dimensions (my brain certainly can't, and I've tried), but the math doesn't care. It works identically whether you have 4 dimensions or 4,000. The same distance calculations, the same dot products, the same everything.
Back in episode #4 when we computed Euclidean distances between apartments to find nearest neighbors -- that was a vector operation. We were computing distances between points in multi-dimensional space. The notation was just np.sqrt(((X - new_apartment) ** 2).sum(axis=1)). Underneath, that's a textbook linear algebra operation on vectors.
Vector arithmetic: element-wise operations
Vectors support arithmetic that operates element by element. This is the foundation of how NumPy makes code fast and readable:
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
# Addition: each element adds
print(a + b) # [5. 7. 9.]
# Subtraction: each element subtracts
print(a - b) # [-3. -3. -3.]
# Multiplication: element-wise (NOT dot product!)
print(a * b) # [ 4. 10. 18.]
# Scalar multiplication: scale every element
print(3 * a) # [3. 6. 9.]
That element-wise multiplication is important to distinguish from the dot product (coming up next). a * b gives you a new vector. The dot product gives you a single number. They're fundamentally different operations, and mixing them up is a classic source of bugs. I've done it, you'll do it, and the error messages are not always helpful. Ask me how I know ;-)
Vector norms: measuring length
The norm of a vector is its length -- how far the tip is from the origin. The most common norm (called the L2 norm or Euclidean norm) is exactly the Pythagorean theorem extended to N dimensions:
v = np.array([3.0, 4.0])
# Manual: sqrt(3^2 + 4^2) = sqrt(9 + 16) = sqrt(25) = 5
manual_norm = np.sqrt(np.sum(v ** 2))
print(f"Manual norm: {manual_norm}") # 5.0
# NumPy's built-in
print(f"NumPy norm: {np.linalg.norm(v)}") # 5.0
# Works for any dimensionality
v3d = np.array([1.0, 2.0, 3.0])
print(f"3D norm: {np.linalg.norm(v3d):.4f}") # 3.7417
Why does this matter for ML? Because norms show up everywhere. The distance between two vectors is the norm of their difference. Regularization (which prevents models from learning weights that are too large) uses the norm of the weight vector. And when we normalize vectors to unit length (norm = 1), we get something extremely useful for measuring direction -- which leads us straight to the dot product.
The dot product: the single most important operation in ML
I'm going to make a strong claim here. If I had to pick ONE mathematical operation that underlies all of machine learning, it's the dot product. Linear regression? Dot product. Every layer in a neural network? Dot products. Attention mechanisms in transformers? Dot products. Similarity search in embedding databases? Dot product. The gradient formulas from episodes #6 and #7? Full of dot products.
The operation is simple: multiply corresponding elements and add them up.
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
# Manual dot product: 1*4 + 2*5 + 3*6
result = a[0]*b[0] + a[1]*b[1] + a[2]*b[2]
print(f"Manual: {result}") # 32.0
# NumPy dot product
print(f"np.dot: {np.dot(a, b)}") # 32.0
# The @ operator (modern Pythonic way)
print(f"a @ b: {a @ b}") # 32.0
The @ operator was introduced in Python 3.5 specifically for matrix and vector multiplication. You'll see it everywhere in ML code, and it's my preferred syntax because it's clean and reads naturally: X @ weights is easier on the eyes than np.dot(X, weights).
What does the dot product actually measure?
This is where the intuition gets really powerful. The dot product measures how much two vectors point in the same direction. Think of it geometrically:
- Two vectors pointing the same way -> large positive dot product
- Two vectors at right angles (perpendicular) -> dot product of zero
- Two vectors pointing opposite ways -> large negative dot product
Let's see this in action:
# Same direction
north = np.array([0.0, 1.0])
also_north = np.array([0.0, 2.0])
print(f"Same direction: {north @ also_north}") # 2.0 (positive)
# Perpendicular
east = np.array([1.0, 0.0])
print(f"Perpendicular: {north @ east}") # 0.0
# Opposite direction
south = np.array([0.0, -1.0])
print(f"Opposite direction: {north @ south}") # -1.0 (negative)
# Somewhere in between
northeast = np.array([1.0, 1.0])
print(f"45 degrees: {north @ northeast}") # 1.0
Now think about what this means for ML. In episode #7, our training loop computed slope * sqm + intercept for each data point. With multiple features, that becomes a dot product: weights @ features + bias. The prediction is literally "how well do the features align with the learned weights?" Features that the model considers important get large weights. Features it considers irrelevant get weights near zero. The dot product combines all of this into a single prediction. Bam, jonguh!
Cosine similarity: the normalized dot product
Raw dot products are affected by vector magnitude. A vector that's twice as long gives a dot product twice as big, regardless of direction. To isolate the directional similarity -- just how much two vectors point the same way, ignoring their lengths -- we normalize:
def cosine_similarity(a, b):
"""Cosine of the angle between two vectors. Range: -1 to +1."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
v1 = np.array([1.0, 0.0, 1.0])
v2 = np.array([2.0, 0.0, 2.0]) # Same direction, double the magnitude
v3 = np.array([0.0, 1.0, 0.0]) # Completely different direction
print(f"v1 vs v2 (same direction): {cosine_similarity(v1, v2):.4f}") # 1.0
print(f"v1 vs v3 (perpendicular): {cosine_similarity(v1, v3):.4f}") # 0.0
# Partially similar vectors
v4 = np.array([1.0, 1.0, 1.0])
print(f"v1 vs v4 (partial match): {cosine_similarity(v1, v4):.4f}") # 0.8165
This is exactly how semantic search and recommendation systems work. When a search engine finds "relevant" documents, it's computing cosine similarity between query embeddings and document embeddings -- both are vectors in some high-dimensional space, and the cosine tells you how well they match. The math you just learned is behind billion-dollar products. Not bad for "multiply and add" ;-)
Matrices: organizing many vectors
A matrix is a 2D grid of numbers. In ML, you'll encounter it most often as the data matrix: each row is one data point, each column is one feature. Remember this convention from episode #3?
# 5 apartments, 4 features each
apartments = np.array([
[120.0, 3.0, 2.0, 1985.0],
[85.0, 2.0, 1.0, 2001.0],
[200.0, 5.0, 3.0, 1960.0],
[65.0, 2.0, 1.0, 2015.0],
[150.0, 4.0, 2.0, 1978.0],
])
print(f"Shape: {apartments.shape}") # (5, 4) -- 5 samples, 4 features
print(f"Rows (samples): {apartments.shape[0]}")
print(f"Columns (features): {apartments.shape[1]}")
Shape is everything. I cannot stress this enough. When something goes wrong in ML code -- and it will, regularly -- it's usually a shape mismatch. Get in the habit of checking .shape obsessively. I mentioned this back in episode #3 and I'll keep saying it because it's saved me more debugging time than any other single habit.
The convention is universal: (n_samples, n_features). Rows are data points. Columns are features. Scikit-learn expects this. PyTorch expects this. TensorFlow expects this. If your data is transposed (features as rows, samples as columns), everything will either crash with a cryptic error or -- worse -- run silently and give you garbage results.
Special matrices you'll encounter
There are a few matrix shapes that come up over and over in ML code:
# Zeros: useful for initialization
zeros = np.zeros((3, 4))
print(f"Zeros:\n{zeros}\n")
# Ones: useful for bias tricks and broadcasting
ones = np.ones((2, 3))
print(f"Ones:\n{ones}\n")
# Identity: the "do nothing" matrix (diagonal of 1s)
identity = np.eye(4)
print(f"Identity:\n{identity}\n")
# Random normal: common for weight initialization
np.random.seed(42)
random_weights = np.random.randn(3, 5)
print(f"Random weights shape: {random_weights.shape}")
print(f"Random weights:\n{random_weights.round(3)}")
The identity matrix is particularly interesting. Multiplying any vector or matrix by the identity gives you back the original -- it's the matrix equivalent of multiplying by 1. It shows up in regularization, in solving linear systems, and as a starting point for certain optimization algorithms. You'll see np.eye() pop up in unexpected places.
Matrix multiplication: transforming data
This is the workhorse operation of neural networks. Every layer in a neural network is essentially: multiply the input matrix by a weight matrix, add a bias, apply a nonlinear activation. That first step -- the matrix multiply -- is where the heavy computation lives.
The rule: for A @ B to work, the inner dimensions must match. If A is shape (m, n) and B is shape (n, p), the result is shape (m, p). Here's a mnemonic that I've found invaluable:
(m, n) @ (n, p) -> (m, p)
^^^
these must match
The inner dimensions disappear -- they get "consumed" by the multiplication. The outer dimensions survive to form the result shape.
# 3 samples, 4 features each
X = np.array([
[1.0, 2.0, 3.0, 4.0],
[5.0, 6.0, 7.0, 8.0],
[9.0, 10.0, 11.0, 12.0],
])
print(f"X shape: {X.shape}") # (3, 4)
# Weight matrix: 4 inputs -> 2 outputs
W = np.array([
[0.1, 0.5],
[0.2, 0.6],
[0.3, 0.7],
[0.4, 0.8],
])
print(f"W shape: {W.shape}") # (4, 2)
# Matrix multiply: (3, 4) @ (4, 2) -> (3, 2)
output = X @ W
print(f"Output shape: {output.shape}") # (3, 2)
print(f"Output:\n{output}")
What just happened geometrically? We transformed our data from a 4-dimensional feature space to a 2-dimensional output space. Each column in the weight matrix defines a new "axis" in the output space. The matrix multiplication projects every data point onto these new axes.
This is what neural networks do at every layer. They project data into new representation spaces where the patterns become easier to detect. The "learning" part is finding the right weight values -- the projection that makes the patterns most visible. We'll build this up properly when we get to neural network architecture, but the foundation is right here: matrix multiply, add bias, activate. Everything else is details.
What matrix multiplication actually computes
Each element in the output is a dot product between a row of the left matrix and a column of the right matrix. Let me trace through one element so you can see:
# Output[0, 0] = dot product of X row 0 and W column 0
row_0 = X[0] # [1, 2, 3, 4]
col_0 = W[:, 0] # [0.1, 0.2, 0.3, 0.4]
manual = row_0 @ col_0 # 1*0.1 + 2*0.2 + 3*0.3 + 4*0.4 = 3.0
print(f"X[0]: {row_0}")
print(f"W[:, 0]: {col_0}")
print(f"Manual dot: {manual}")
print(f"From matmul: {output[0, 0]}")
print(f"Match: {manual == output[0, 0]}")
# Output[1, 1] = dot product of X row 1 and W column 1
row_1 = X[1] # [5, 6, 7, 8]
col_1 = W[:, 1] # [0.5, 0.6, 0.7, 0.8]
manual2 = row_1 @ col_1
print(f"\nX[1]: {row_1}")
print(f"W[:, 1]: {col_1}")
print(f"Manual dot: {manual2}")
print(f"From matmul: {output[1, 1]}")
So matrix multiplication is really just a batch of dot products, organized efficiently. Each sample gets dot-producted with each output column. The fact that we can express this as a single @ operation (in stead of nested for-loops) is what makes both the notation clean and the computation fast. NumPy delegates this to highly optimized BLAS libraries written in C and Fortran, and the speed difference versus Python loops is not 2x or 5x -- it's often 100x or more.
A neural network layer in 3 lines
Let me show you why all of this matters by connecting it directly to where we're heading. A single layer of a neural network is:
def neural_layer(X, W, b):
"""One neural network layer: linear transform + bias."""
return X @ W + b
# Example: 4 samples, 3 features -> 2 outputs
X = np.array([
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0],
[0.5, 1.5, 2.5],
])
W = np.array([
[0.1, 0.4],
[0.2, 0.5],
[0.3, 0.6],
])
b = np.array([0.01, 0.02])
output = neural_layer(X, W, b)
print(f"Input shape: {X.shape}") # (4, 3)
print(f"Weight shape: {W.shape}") # (3, 2)
print(f"Bias shape: {b.shape}") # (2,)
print(f"Output shape: {output.shape}") # (4, 2)
print(f"Output:\n{output}")
Matrix multiply (X @ W) does the linear transformation. Adding b is the bias -- and notice how broadcasting (coming up in the next section) handles the fact that b has shape (2,) while X @ W has shape (4, 2). The bias gets added to every row automatically. If you stack two of these layers with a nonlinear function in between, you have a neural network. No kidding.
def simple_network(X, W1, b1, W2, b2):
"""A tiny 2-layer neural network."""
# Layer 1: linear + ReLU activation
z1 = X @ W1 + b1
a1 = np.maximum(0, z1) # ReLU: replace negatives with zero
# Layer 2: linear (output layer)
z2 = a1 @ W2 + b2
return z2
# Random weights for demonstration
np.random.seed(7)
W1 = np.random.randn(3, 4) * 0.5 # 3 inputs -> 4 hidden
b1 = np.zeros(4)
W2 = np.random.randn(4, 2) * 0.5 # 4 hidden -> 2 outputs
b2 = np.zeros(2)
result = simple_network(X, W1, b1, W2, b2)
print(f"Network input: {X.shape}") # (4, 3)
print(f"Network output: {result.shape}") # (4, 2)
print(f"Output:\n{result.round(4)}")
Eleven lines for a complete two-layer neural network forward pass. X @ W1 + b1 -- matrix multiply, broadcast bias. np.maximum(0, z1) -- ReLU activation (the simplest nonlinearity: just clamp negatives to zero). Then another @ W2 + b2 for the output layer. Every concept in this episode -- vectors, matrices, dot products, matrix multiplication, broadcasting -- is right there in those eleven lines.
We're NOT building neural networks yet (that's further ahead in the series), but I want you to see exactly where the math goes. The linear algebra isn't separate from the engineering. It IS the engineering.
Reshaping: the glue between operations
In ML, data rarely arrives in the exact shape your model expects. Reshaping is how you fix that -- rearranging the same numbers into a different grid layout without changing any values.
# A flat vector of 12 numbers
v = np.arange(1, 13)
print(f"Original: {v}")
print(f"Shape: {v.shape}") # (12,)
# Reshape to 3 rows, 4 columns
m1 = v.reshape(3, 4)
print(f"\nReshaped to (3, 4):\n{m1}")
# Reshape to 4 rows, 3 columns
m2 = v.reshape(4, 3)
print(f"\nReshaped to (4, 3):\n{m2}")
# The -1 trick: "figure out this dimension automatically"
m3 = v.reshape(2, -1) # 2 rows, NumPy figures out 6 columns
print(f"\nReshaped to (2, -1) = {m3.shape}:\n{m3}")
m4 = v.reshape(-1, 4) # NumPy figures out 3 rows, 4 columns
print(f"\nReshaped to (-1, 4) = {m4.shape}:\n{m4}")
The -1 trick is invaluable. You use it constantly in practice because you often know one dimension (like "I want 4 features per row") but the number of samples depends on your data. Let NumPy do the arithmetic.
A common pattern: converting a 1D prediction vector into a column vector for matrix operations:
# Predictions as a 1D vector
predictions = np.array([100.0, 200.0, 150.0, 250.0])
print(f"1D shape: {predictions.shape}") # (4,)
# Reshape to a column vector (4 rows, 1 column)
col = predictions.reshape(-1, 1)
print(f"Column shape: {col.shape}") # (4, 1)
print(f"Column:\n{col}")
This reshape between (n,) and (n, 1) is something you'll do regularly. A 1D array is ambiguous -- is it a row or a column? Reshaping to (n, 1) makes it explicitly a column, which matters for matrix multiplication shape matching.
Transpose: flipping rows and columns
The transpose swaps rows and columns. In NumPy, it's just .T:
m = np.array([
[1, 2, 3],
[4, 5, 6],
])
print(f"Original shape: {m.shape}") # (2, 3)
print(f"Original:\n{m}\n")
print(f"Transposed shape: {m.T.shape}") # (3, 2)
print(f"Transposed:\n{m.T}")
You'll use transpose constantly to make shapes compatible for matrix multiplication. The most common scenario: you have two matrices and the inner dimensions don't match. Transposing one of them often fixes it.
# Problem: (3, 4) @ (3, 4) doesn't work -- inner dims 4 and 3 don't match
A = np.random.randn(3, 4)
B = np.random.randn(3, 4)
# Solution: transpose B -> (4, 3), then (3, 4) @ (4, 3) -> (3, 3)
result = A @ B.T
print(f"A shape: {A.shape}") # (3, 4)
print(f"B.T shape: {B.T.shape}") # (4, 3)
print(f"A @ B.T shape: {result.shape}") # (3, 3)
Having said that, not every shape problem should be solved with a transpose. Sometimes your data is genuinely in the wrong layout and needs a reshape in stead. The distinction: transpose reorders existing dimensions (swaps axes), reshape rearranges how elements are packed into dimensions (changes the grid structure). They look similar but do very different things. When in doubt, print the shapes before and after, and verify the output makes sense.
Broadcasting: powerful but tricky
Broadcasting is NumPy's way of handling operations between arrays of different shapes. When shapes don't match exactly, NumPy tries to "stretch" the smaller array to fit the bigger one. This is incredibly useful -- it's how bias addition works in neural networks, how you normalize features, and how you subtract the mean from a dataset -- but it's also the source of subtle bugs if you don't understand the rules.
# Adding a scalar to a matrix -- scalar gets "broadcast" to every element
m = np.array([
[1, 2, 3],
[4, 5, 6],
])
print(f"m + 10:\n{m + 10}\n")
# [[11 12 13]
# [14 15 16]]
# Adding a row vector to each row -- vector gets broadcast across rows
row_bias = np.array([100, 200, 300])
print(f"m + row_bias:\n{m + row_bias}\n")
# [[101 202 303]
# [104 205 306]]
That second example is exactly how bias addition works in a neural network layer. After the matrix multiply X @ W, you add a bias vector b to every sample. Broadcasting makes this a one-liner in stead of a loop. It's one of the reasons NumPy code is so concise compared to what you'd write in plain Python.
The broadcasting rule
NumPy compares shapes element-wise from the trailing (rightmost) dimensions. Two dimensions are compatible if they're equal or one of them is 1:
# Shape (2, 3) + shape (3,) works:
# 2, 3
# 3 <-- trailing dims match
# Shape (2, 3) + shape (2, 1) works:
# 2, 3
# 2, 1 <-- 2 matches 2, 1 broadcasts to 3
# Shape (2, 3) + shape (2,) FAILS:
# 2, 3
# 2 <-- 2 != 3, neither is 1 -> ValueError
Let me show you a practical example that's directly relevent to what we'll be doing when we build models from scratch. Suppose you want to subtract the column means from a data matrix (this is called centering, and it's the first step of standardization):
# 5 apartments, 3 features
data = np.array([
[120.0, 3.0, 2.0],
[85.0, 2.0, 1.0],
[200.0, 5.0, 3.0],
[65.0, 2.0, 1.0],
[150.0, 4.0, 2.0],
])
# Column means -- shape (3,)
means = data.mean(axis=0)
print(f"Data shape: {data.shape}") # (5, 3)
print(f"Means shape: {means.shape}") # (3,)
print(f"Means: {means}")
# Broadcasting: (5, 3) - (3,) -> (5, 3)
centered = data - means
print(f"\nCentered data:\n{centered}")
print(f"Column means after centering: {centered.mean(axis=0)}") # ~[0, 0, 0]
That data - means line is broadcasting in action. The means vector (3,) gets stretched across all 5 rows automatically. Without broadcasting, you'd need a for-loop or an explicit np.tile() call. With broadcasting, it's clean, fast, and readable.
When broadcasting goes wrong
The dangerous part: when shapes are compatible but wrong, broadcasting produces results silently in stead of raising an error. You get numbers out, but they're garbage:
# Intended: subtract each column's mean from that column
# What if we accidentally compute row means in stead?
row_means = data.mean(axis=1) # shape (5,) -- one mean per ROW
# col_means = data.mean(axis=0) # shape (3,) -- one mean per COLUMN
print(f"data shape: {data.shape}") # (5, 3)
print(f"row_means shape: {row_means.shape}") # (5,)
# (5, 3) - (5,) -> trailing dims: 3 vs 5 -> DOES raise an error!
try:
bad = data - row_means
except ValueError as e:
print(f"Caught error: {e}")
# But if we reshape row_means to (5, 1), it broadcasts "successfully"
# across columns -- giving us a DIFFERENT subtraction than intended
row_means_col = row_means.reshape(-1, 1) # shape (5, 1)
wrong_result = data - row_means_col
print(f"\nWrong centering (row means subtracted per row):\n{wrong_result}")
print(f"Column means: {wrong_result.mean(axis=0)}") # NOT zeros!
See? The code runs fine. The shapes are compatible. But the operation is semantically wrong -- we subtracted row averages when we meant to subtract column averages. The columns are NOT centered around zero. This kind of bug is insidious because there's no error message, no crash, nothing to alert you. The only defense is knowing what shape you expect and checking it. Print .shape compulsively. I've been writing NumPy code for years and I still do this ;-)
Putting it all together: linear regression in matrix form
Let me connect everything in this episode back to what we built in episodes #6 and #7. Remember our training loop? It computed predictions as slope * sqm + intercept and gradients as -2 * (sqm * errors).mean() and -2 * errors.mean(). With two parameters that's manageable. But what about 50 features? You'd need 51 separate gradient formulas. That's not going to scale.
With linear algebra, the entire model becomes one line:
# 6 apartments, 3 features: [sqm, rooms, floor]
X_raw = np.array([
[65, 2, 3],
[82, 3, 1],
[45, 1, 5],
[120, 4, 2],
[55, 2, 4],
[90, 3, 3],
], dtype=np.float64)
# Add a column of 1s for the intercept (bias trick)
ones_col = np.ones((X_raw.shape[0], 1))
X = np.hstack([X_raw, ones_col])
print(f"X with bias column:\n{X}")
print(f"X shape: {X.shape}") # (6, 4) -- 3 features + 1 bias
# Prices (target)
y = np.array([185000, 240000, 130000, 350000, 165000, 260000], dtype=np.float64)
# Weights: one per feature + one for bias
w = np.zeros(4)
# Prediction: just a matrix-vector multiply!
predictions = X @ w
print(f"\nPredictions (all zeros, because w is zero): {predictions}")
The "bias trick" -- adding a column of ones to X -- lets us absorb the intercept into the weight vector. In stead of writing slope * sqm + weight_rooms * rooms + weight_floor * floor + intercept, we write X @ w where the last element of w serves as the intercept (because it multiplies the column of ones). Cleaner. Fewer special cases. The same @ operation handles everything.
And the gradient? Also one line of linear algebra:
# The gradient of MSE with respect to weights
# (The derivation comes in the next episode -- for now, trust the formula)
errors = y - predictions
gradient = -2 / len(y) * (X.T @ errors)
print(f"Gradient shape: {gradient.shape}") # (4,) -- one per weight
print(f"Gradient: {gradient}")
X.T @ errors -- that's a transpose and a matrix-vector multiply. One expression computes the gradient for ALL weights simultaneously. Compare this to episode #7 where we had separate gradient computations for slope and intercept. With linear algebra, 2 parameters or 2,000 -- the code is identical.
Let's run a quick training loop in matrix form to see it all working together:
# Mini training loop -- matrix-style
w = np.zeros(4)
lr = 1e-10 # Small because feature magnitudes vary widely
print(f"{'Epoch':>5s} {'Loss':>14s} {'Weights':>40s}")
print("-" * 65)
for epoch in range(200):
pred = X @ w
errors = y - pred
loss = (errors ** 2).mean()
grad = -2 / len(y) * (X.T @ errors)
w = w - lr * grad
if epoch % 40 == 0 or epoch == 199:
w_str = np.array2string(w, precision=1, separator=', ')
print(f"{epoch:>5d} {loss:>14,.0f} {w_str}")
print(f"\nLearned weights: {w.round(1)}")
print(f" sqm weight: {w[0]:.1f}")
print(f" rooms weight: {w[1]:.1f}")
print(f" floor weight: {w[2]:.1f}")
print(f" intercept: {w[3]:.1f}")
Same four-step training loop from episode #7 -- predict, compute loss, compute gradient, update -- but now generalized to handle any number of features. The structure is identical. Only the implementation changed from scalar arithmetic to linear algebra. And that's the whole point: linear algebra is not new math. It's the same operations you already know, expressed in a way that scales to any number of dimensions.
What to carry forward
- Vectors represent data points in multi-dimensional space -- the (n_samples, n_features) convention from episode #3 is a matrix of row vectors;
- The dot product measures directional similarity and is the fundamental operation behind linear models, neural networks, attention mechanisms, and embedding search;
- Matrix multiplication transforms data from one space to another -- this is literally what every neural network layer does (project, add bias, activate);
- The shape rule
(m, n) @ (n, p) -> (m, p)saves countless headaches -- inner dimensions must match, outer dimensions form the result; - Broadcasting handles operations between different-shaped arrays but can silently produce wrong results -- check
.shapeobsessively; - The transpose
.Tswaps rows and columns and is your primary tool for making shapes compatible; - Reshaping rearranges elements into new grid layouts -- the
-1trick lets NumPy figure out one dimension automatically; - The training loop from episode #7 generalizes naturally to multiple features when you express it in matrix form:
X @ wfor predictions,X.T @ errorsfor gradients. Same loop, different notation, any number of dimensions.
Next up: the other half of the math foundation -- derivatives (how loss changes when you wiggle a parameter) and probability (how to reason about uncertainty). After those two pieces we'll have everything we need to build linear regression properly from scratch, with the math fully transparent and the code fully general. The climb continues, but we're building on solid ground ;-)