Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works

What will I learn
- You will learn why accuracy alone is a terrible metric for most real-world problems;
- the confusion matrix -- the foundation that every classification metric is built from;
- precision, recall, and the fundamental tradeoff between them;
- the F1 score and when it's the right (and wrong) metric to use;
- AUC-ROC curves -- how to evaluate a classifier across all thresholds at once;
- regression metrics beyond MSE: R-squared, MAE, and RMSE in context;
- cross-validation -- why a single train/test split is not enough and how K-fold fixes it;
- how to choose the right metric for your specific problem.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works (this post)
Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
At the end of episode #12 I left you with a question. We had just built a logistic regression classifier from scratch -- sigmoid activation, cross-entropy loss, gradient descent, the whole thing. It made predictions. It classified students as pass or fail, patients as high-risk or low-risk. And we measured "accuracy" -- the fraction of correct predictions. Simple. Intuitive. Done, right?
Not even close.
Here's the dirty secret of ML that took me an embarrasingly long time to internalize: accuracy is one of the most misleading metrics in all of machine learning. It looks reasonable. It sounds reasonable. And in certain specific situations it IS reasonable. But for the majority of real-world classification problems, accuracy will lie to your face with a confident smile and you won't even know it. Today I'm going to show you exactly how and why it lies, and then we're going to build the proper evaluation toolkit -- the metrics that professionals actually use when they need to know if a model works.
This episode is less about building models and more about measuring them. That might sound less exciting than the from-scratch implementations in episodes #10 and #12. I'd argue it's MORE important. A bad model that you know is bad can be fixed. A bad model that you think is good? That's the one that causes real damage ;-)
Let's go.
When accuracy lies to your face
Picture this. You're building a fraud detection system for a bank. Out of every 10,000 transactions, roughly 50 are fraudulent. That's 0.5% fraud, 99.5% legitimate. Now someone hands you a "model" -- and I'm using that word very loosely here -- that predicts "not fraud" for every single transaction. Every. Single. One. It never flags anything. It does absolutely nothing. Let's see what accuracy says about this masterpiece:
import numpy as np
np.random.seed(42)
n = 10000
# Imbalanced dataset: 0.5% fraud
y_true = np.zeros(n)
y_true[:50] = 1 # 50 fraudulent transactions out of 10,000
np.random.shuffle(y_true)
# The world's laziest "model": always predict not fraud
y_pred_lazy = np.zeros(n)
accuracy = np.mean(y_pred_lazy == y_true)
print(f"Dataset: {n} transactions, {int(y_true.sum())} fraudulent ({y_true.mean():.1%})")
print(f"Model prediction: ALWAYS 'not fraud'")
print(f"Accuracy: {accuracy:.1%}")
print(f"Frauds caught: {int(np.sum((y_pred_lazy == 1) & (y_true == 1)))}/{int(y_true.sum())}")
99.5% accuracy. Catches zero fraud. The model does nothing, and accuracy rewards it with a near-perfect score. This is why accuracy is dangerous on imbalanced data.
And here's the thing -- most real-world problems have imbalanced classes. Fraud is rare. Diseases are (thankfully) uncommon. Manufacturing defects are infrequent. Spam is common now but it wasn't always. Churn prediction, anomaly detection, rare event forecasting -- imbalanced classes everywhere. If you only look at accuracy, you'll convince yourself your model is working when it's actually just predicting the majority class and ignoring the thing you care about.
So let's build something better. Something that can't be fooled by a model that does nothing.
The confusion matrix: where all metrics begin
Every binary classification prediction falls into exactly one of four categories. And I mean exactly one -- they're mutually exclusive and exhaustive. Understanding these four buckets is understanding evaluation, because every single classification metric you'll ever encounter is derived from them.
def confusion_matrix(y_true, y_pred):
"""Compute the four cells of the confusion matrix."""
tp = int(np.sum((y_pred == 1) & (y_true == 1))) # True Positive
fp = int(np.sum((y_pred == 1) & (y_true == 0))) # False Positive
fn = int(np.sum((y_pred == 0) & (y_true == 1))) # False Negative
tn = int(np.sum((y_pred == 0) & (y_true == 0))) # True Negative
return tp, fp, fn, tn
# Let's create a model that actually TRIES to detect fraud
# (not great, but at least it does something)
scores = np.random.randn(n) + y_true * 3 # fraudulent transactions get higher scores
y_pred = (scores > 1.5).astype(int)
tp, fp, fn, tn = confusion_matrix(y_true, y_pred)
print(f"Confusion Matrix:")
print(f" Predicted")
print(f" Fraud Legit")
print(f" Actual Fraud {tp:>5d} {fn:>5d}")
print(f" Actual Legit {fp:>5d} {tn:>5d}")
print(f"")
print(f" TP = {tp:>4d} (correctly flagged fraud)")
print(f" FP = {fp:>4d} (false alarm -- legit transaction flagged)")
print(f" FN = {fn:>4d} (MISSED fraud -- this is the expensive one)")
print(f" TN = {tn:>4d} (correctly cleared)")
print(f"")
print(f" Accuracy: {(tp + tn) / n:.1%}")
Let me be explicit about what each cell means, because getting these confused is a classic source of bugs:
- True Positive (TP): You predicted fraud, and it WAS fraud. The wins. This is what you're here for.
- False Positive (FP): You predicted fraud, but it was actually legitimate. The false alarms. The customer whose credit card got blocked while buying groceries because the model panicked.
- False Negative (FN): You predicted legitimate, but it was actually fraud. The misses. The stolen card that slipped through because the model didn't catch it. In fraud detection this is the expensive error -- the one that costs actual money.
- True Negative (TN): You predicted legitimate, and it was legitimate. Correctly ignored. The vast majority of transactions.
Notice how accuracy is just (TP + TN) / total. In our imbalanced fraud example, TN is enormous (9,950 legitimate transactions correctly classified as legitimate), which drowns out everything else. The tiny number of frauds that got missed? Barely moves the accuracy needle. But those missed frauds are the entire point of building the model.
The confusion matrix breaks through this illusion by giving you the full picture. Four numbers in stead of one. And from these four numbers, we can compute every metric that matters.
Precision: "when I raise an alarm, should you believe me?"
Precision answers a specific question: of all the things I flagged as positive, what fraction actually were positive?
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
print(f"Precision: {precision:.3f}")
print(f" Interpretation: when the model flags a transaction as fraud,")
print(f" it's actually fraud {precision:.0%} of the time.")
print(f" ({tp} true alerts out of {tp + fp} total alerts)")
High precision means the model's alerts are trustworthy. When it says "fraud," it means fraud. Few false alarms. A spam filter with high precision rarely dumps legitimate email into the spam folder -- when it says "spam," you can trust it.
But precision alone doesn't tell you about the cases the model missed entirely. A model that only flags one transaction -- and gets it right -- has 100% precision. It also missed 49 out of 50 frauds. Precision doesn't penalize missed cases. That's what recall is for.
Recall: "did I catch everything that matters?"
Recall (also called sensitivity or true positive rate) answers the complementary question: of all the actual positives, what fraction did I find?
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
print(f"Recall: {recall:.3f}")
print(f" Interpretation: out of {tp + fn} actual fraud cases,")
print(f" the model caught {tp} ({recall:.0%}).")
print(f" It missed {fn} fraudulent transactions.")
High recall means few misses. A cancer screening test with high recall catches nearly every case of cancer. You don't want to send a patient home with undiagnosed cancer because the test missed it. That's a false negative, and in medical screening, false negatives can be fatal.
But high recall has a cost: to catch everything, the model might flag lots of things that aren't actually positive. The cancer test might send half of all healthy patients for unnecessary biopsies. High recall, low precision.
The precision-recall tradeoff
And here's the fundamental tension that I hinted at in episode #12 when we discussed threshold selection: improving precision usually hurts recall, and vice versa. They're opposing forces. Pushing one up tends to push the other down.
Remember from episode #12 -- our logistic regression model outputs probabilities, and we turn those into class labels by choosing a threshold. Everything above the threshold becomes class 1, everything below becomes class 0. The threshold is the knob that controls this tradeoff. Let me show you concretely:
print(f"{'Threshold':>10s} {'Precision':>10s} {'Recall':>10s} {'Flagged':>10s}")
print("-" * 48)
for threshold in [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5]:
preds = (scores > threshold).astype(int)
tp_t, fp_t, fn_t, tn_t = confusion_matrix(y_true, preds)
prec = tp_t / (tp_t + fp_t) if (tp_t + fp_t) > 0 else 0
rec = tp_t / (tp_t + fn_t) if (tp_t + fn_t) > 0 else 0
print(f" {threshold:>8.1f} {prec:>10.3f} {rec:>10.3f} {int(preds.sum()):>10d}")
Look at what happens. Low threshold (0.0): the model flags almost everything. Recall is high -- it catches most of the fraud. But precision is terrible because it's also flagging thousands of legitimate transactions. High threshold (3.5): the model flags almost nothing. Precision is high -- the few things it does flag are probably real fraud. But recall drops because it's missing most of the actual cases.
Where you set the threshold depends entirely on the relative cost of each error type. And this is a business decision, not a statistical one:
- Fraud detection: missing fraud costs real money. A false alarm costs a phone call to the customer ("did you make this purchase?"). False negatives are expensive, false positives are cheap. Optimize for recall -- lower the threshold.
- Email spam filter: a false positive (real email goes to spam) could cost you a business deal. A false negative (spam gets through) is annoying but harmless. False positives are expensive, false negatives are cheap. Optimize for precision -- raise the threshold.
- Cancer screening: missing cancer (FN) is potentially fatal. An unnecessary follow-up test (FP) costs time and stress but isn't deadly. Optimize for recall -- set the threshold LOW.
This is the context I was planting seeds for at the end of episode #12 when we swept different thresholds and watched the numbers change. Now you have the vocabulary: precision, recall, and the deliberate tradeoff between them.
F1 score: when you need a single number
Sometimes you need to collapse precision and recall into one number for comparison. "Model A has precision 0.85 and recall 0.72. Model B has precision 0.78 and recall 0.81. Which is better?" Depends on the problem, yes. But if precision and recall are roughly equally important, the F1 score gives you a principled way to combine them:
def f1_score(precision, recall):
"""Harmonic mean of precision and recall."""
if precision + recall == 0:
return 0.0
return 2 * precision * recall / (precision + recall)
# Find the threshold that maximizes F1 on our fraud data
best_f1 = 0
best_threshold = 0
thresholds = np.arange(-2, 5, 0.05)
f1_values = []
for threshold in thresholds:
preds = (scores > threshold).astype(int)
tp_t, fp_t, fn_t, tn_t = confusion_matrix(y_true, preds)
prec = tp_t / (tp_t + fp_t) if (tp_t + fp_t) > 0 else 0
rec = tp_t / (tp_t + fn_t) if (tp_t + fn_t) > 0 else 0
f1 = f1_score(prec, rec)
f1_values.append(f1)
if f1 > best_f1:
best_f1 = f1
best_threshold = threshold
print(f"Best F1: {best_f1:.3f} at threshold {best_threshold:.2f}")
# Show the precision/recall at that optimal F1 threshold
preds_best = (scores > best_threshold).astype(int)
tp_b, fp_b, fn_b, tn_b = confusion_matrix(y_true, preds_best)
prec_b = tp_b / (tp_b + fp_b) if (tp_b + fp_b) > 0 else 0
rec_b = tp_b / (tp_b + fn_b) if (tp_b + fn_b) > 0 else 0
print(f"At best F1: precision={prec_b:.3f}, recall={rec_b:.3f}")
The F1 is the harmonic mean of precision and recall. Why harmonic mean and not arithmetic mean? Because the harmonic mean penalizes imbalance more severely. A model with 99% precision and 1% recall gets an arithmetic mean of 50% -- looks acceptable. The harmonic mean gives F1 = 0.02 -- correctly revealing that the model is garbage. You can't game the F1 by doing well on one metric while completely ignoring the other.
The formula: F1 = 2 * P * R / (P + R). Properties worth remembering:
- If either precision or recall is zero, F1 is zero. You can't hide a complete failure.
- F1 reaches 1.0 only if BOTH precision and recall are perfect.
- F1 always lies between 0 and 1, same as its components.
Having said that, F1 treats precision and recall as equally important. If they're NOT equally important in your problem (and they usually aren't -- go back to the fraud vs spam examples), F1 might not be the right choice. There's a generalized version called F-beta where you can weight recall higher (beta > 1) or precision higher (beta < 1), but for now, knowing F1 is a solid foundation.
AUC-ROC: the full picture
Every metric we've discussed so far depends on a specific threshold. Change the threshold, change the metric. That makes comparison tricky -- maybe Model A is better at threshold 0.5 but worse at threshold 0.3. Is A better than B, or did you just pick a threshold that favors it?
The ROC curve (Receiver Operating Characteristic -- terrible name, comes from radar signal processing in World War II, long story) sidesteps this by evaluating the model at ALL thresholds simultaneously. It plots the True Positive Rate (which is just recall) against the False Positive Rate at every possible threshold:
def compute_roc(y_true, scores, n_thresholds=200):
"""Compute ROC curve points."""
thresholds = np.linspace(scores.max() + 0.1, scores.min() - 0.1, n_thresholds)
tpr_list = []
fpr_list = []
for t in thresholds:
preds = (scores >= t).astype(int)
tp_r, fp_r, fn_r, tn_r = confusion_matrix(y_true, preds)
tpr = tp_r / (tp_r + fn_r) if (tp_r + fn_r) > 0 else 0
fpr = fp_r / (fp_r + tn_r) if (fp_r + tn_r) > 0 else 0
tpr_list.append(tpr)
fpr_list.append(fpr)
return np.array(fpr_list), np.array(tpr_list)
fpr, tpr = compute_roc(y_true, scores)
# AUC: area under the ROC curve
# Using the trapezoidal rule -- same numerical integration idea from calculus
auc = np.trapz(tpr, fpr)
print(f"AUC-ROC: {auc:.4f}")
print(f"")
print(f"Interpretation guide:")
print(f" AUC = 0.50 --> random guessing (coin flip)")
print(f" AUC = 0.70 --> poor but better than nothing")
print(f" AUC = 0.80 --> acceptable")
print(f" AUC = 0.90 --> good")
print(f" AUC = 0.95+ --> excellent")
print(f" AUC = 1.00 --> perfect separation (suspicious -- check for data leakage)")
The AUC (Area Under the Curve) compresses the entire ROC curve into a single number. It has a beautiful probabilistic interpretation: AUC is the probability that the model ranks a randomly chosen positive example higher than a randomly chosen negative example. An AUC of 0.90 means that if you picked a random fraud case and a random legitimate case, there's a 90% chance the model gives the fraud case a higher score.
This makes AUC ideal for comparing models -- the model with higher AUC is better at separating the classes, regardless of what threshold you later choose. You first pick the model with the best AUC, THEN choose a threshold based on your precision/recall requirements.
Let me show you the actual ROC curve data points so you can see the shape:
# Show sampled points along the ROC curve
print(f"\nROC curve (sampled points):")
print(f"{'FPR':>8s} {'TPR':>8s}")
step = max(1, len(fpr) // 12)
for i in range(0, len(fpr), step):
print(f" {fpr[i]:>6.3f} {tpr[i]:>6.3f}")
# Compare: what would a random model look like?
random_scores = np.random.randn(n)
fpr_rand, tpr_rand = compute_roc(y_true, random_scores)
auc_rand = np.trapz(tpr_rand, fpr_rand)
print(f"\nRandom model AUC: {auc_rand:.4f} (should be close to 0.50)")
print(f"Our model AUC: {auc:.4f}")
A perfect model's ROC curve goes straight up to (0, 1) and then across to (1, 1) -- AUC = 1.0. A random model's ROC curve is the diagonal from (0, 0) to (1, 1) -- AUC = 0.5. Our model should be somewhere in between, with the curve bowing upward toward the top-left corner. The more it bows, the better.
One caveat: AUC-ROC can be misleading on extremely imbalanced datasets (like our 0.5% fraud example) because the False Positive Rate uses TN in the denominator, and TN is massive. A better alternative in heavily imbalanced settings is the precision-recall curve and its AUC. But for most practical purposes, AUC-ROC is the standard and it works well.
Cross-validation: don't trust a single split
Here's something that's been quietly bothering me since episode #10 (and it should have been bothering you too). When we evaluate a model, we do a single random 80/20 train/test split and report the test accuracy. But what if we got lucky with that particular split? What if all the "easy" examples happened to land in the test set? Or all the "hard" examples? Our performance estimate could be wildly optimistic or pessimistic based on pure chance.
K-fold cross-validation fixes this by splitting the data into K equal parts (called folds), training on K-1 folds, testing on the remaining fold, and rotating which fold is held out. You end up with K performance estimates instead of one, and you report both the mean and the standard deviation.
def logistic_train_eval(X_train, y_train, X_test, y_test,
lr=0.1, epochs=500, threshold=0.5):
"""Train logistic regression and return test accuracy + F1."""
# Scale (fit on train only -- episode #11's critical rule)
mean = X_train.mean(axis=0)
std = X_train.std(axis=0)
std[std == 0] = 1
X_tr_s = (X_train - mean) / std
X_te_s = (X_test - mean) / std
# Add bias column
X_tr_b = np.column_stack([X_tr_s, np.ones(len(X_tr_s))])
X_te_b = np.column_stack([X_te_s, np.ones(len(X_te_s))])
# Train logistic regression (same as episode #12)
w = np.zeros(X_tr_b.shape[1])
for _ in range(epochs):
z = X_tr_b @ w
y_hat = 1 / (1 + np.exp(-z))
gradient = (1/len(y_train)) * X_tr_b.T @ (y_hat - y_train)
w -= lr * gradient
# Evaluate
z_test = X_te_b @ w
proba = 1 / (1 + np.exp(-z_test))
preds = (proba >= threshold).astype(int)
acc = np.mean(preds == y_test)
tp = np.sum((preds == 1) & (y_test == 1))
fp = np.sum((preds == 1) & (y_test == 0))
fn = np.sum((preds == 0) & (y_test == 1))
prec = tp / (tp + fp) if (tp + fp) > 0 else 0
rec = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * prec * rec / (prec + rec) if (prec + rec) > 0 else 0
return acc, f1
def cross_validate(X, y, k=5, lr=0.1, epochs=500):
"""K-fold cross-validation for logistic regression."""
n = len(y)
indices = np.random.permutation(n)
fold_size = n // k
acc_scores = []
f1_scores = []
for fold in range(k):
# Carve out the test fold
test_start = fold * fold_size
test_end = test_start + fold_size
test_idx = indices[test_start:test_end]
train_idx = np.concatenate([indices[:test_start], indices[test_end:]])
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
acc, f1 = logistic_train_eval(X_train, y_train, X_test, y_test,
lr=lr, epochs=epochs)
acc_scores.append(acc)
f1_scores.append(f1)
print(f" Fold {fold + 1}: accuracy={acc:.3f} F1={f1:.3f}")
print(f"\n Accuracy: {np.mean(acc_scores):.3f} +/- {np.std(acc_scores):.3f}")
print(f" F1: {np.mean(f1_scores):.3f} +/- {np.std(f1_scores):.3f}")
return acc_scores, f1_scores
# Generate a classification dataset with clear signal
np.random.seed(42)
n = 300
X_cv = np.random.randn(n, 4)
y_cv = ((X_cv[:, 0] + 0.5 * X_cv[:, 1] - X_cv[:, 2]
+ np.random.randn(n) * 0.5) > 0).astype(float)
print(f"Dataset: {n} samples, {int(y_cv.sum())} positive ({y_cv.mean():.1%})")
print(f"\n5-Fold Cross-Validation:\n")
cv_acc, cv_f1 = cross_validate(X_cv, y_cv, k=5)
That +/- number is the standard deviation across folds. A model with accuracy 0.82 +/- 0.02 is stable -- it performs consistently regardless of which data ends up in the test fold. A model with 0.82 +/- 0.12 is unreliable -- its performance swings wildly depending on the split. The standard deviation tells you how much you should trust the mean.
Common K values: 5 and 10 are standard. Higher K gives a more accurate estimate (each fold is larger, so test sets are more representative) but costs more computation (you train K models). For small datasets, you can go all the way to leave-one-out (K = N) where each sample gets its own turn as the test set. For datasets with thousands of samples, K=5 is usually fine.
One important detail: notice how inside each fold, we fit the scaler on the training data only and then transform the test data. Same rule from episode #11 about data leakage -- each fold is a completely independent train/test experiment. The test fold must be treated as unseen data, even though technically you know what's in it. If you scale the entire dataset before splitting, you've leaked information from the test folds into the training process, and your estimates will be overoptimistic.
Stratified cross-validation: keeping class proportions
There's one subtlety worth mentioning. Regular K-fold just splits the data randomly. But if your classes are imbalanced (say, 10% positive and 90% negative), a random split might give you a fold with 2% positive and another with 18%. The model's performance would vary wildly just because of different class distributions in each fold -- not because the model is actually unstable.
Stratified K-fold fixes this by ensuring each fold maintains the same class ratio as the full dataset. If the full data is 10% positive, each fold will be approximately 10% positive too:
def stratified_cross_validate(X, y, k=5, lr=0.1, epochs=500):
"""Stratified K-fold: preserves class ratios in each fold."""
# Separate indices by class
pos_idx = np.where(y == 1)[0]
neg_idx = np.where(y == 0)[0]
np.random.shuffle(pos_idx)
np.random.shuffle(neg_idx)
# Split each class into K parts
pos_folds = np.array_split(pos_idx, k)
neg_folds = np.array_split(neg_idx, k)
acc_scores = []
f1_scores = []
for fold in range(k):
test_idx = np.concatenate([pos_folds[fold], neg_folds[fold]])
train_idx = np.concatenate([
np.concatenate([pos_folds[j] for j in range(k) if j != fold]),
np.concatenate([neg_folds[j] for j in range(k) if j != fold])
])
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
acc, f1 = logistic_train_eval(X_train, y_train, X_test, y_test,
lr=lr, epochs=epochs)
acc_scores.append(acc)
f1_scores.append(f1)
pos_ratio = y_test.mean()
print(f" Fold {fold + 1}: acc={acc:.3f} F1={f1:.3f} "
f"(test positive ratio: {pos_ratio:.1%})")
print(f"\n Accuracy: {np.mean(acc_scores):.3f} +/- {np.std(acc_scores):.3f}")
print(f" F1: {np.mean(f1_scores):.3f} +/- {np.std(f1_scores):.3f}")
return acc_scores, f1_scores
# Create an imbalanced dataset to show the difference
np.random.seed(42)
n_imb = 400
X_imb = np.random.randn(n_imb, 3)
# Only 15% positive
y_imb = ((X_imb[:, 0] + X_imb[:, 1] + np.random.randn(n_imb) * 0.8) > 1.5).astype(float)
print(f"Imbalanced dataset: {n_imb} samples, {int(y_imb.sum())} positive "
f"({y_imb.mean():.1%})")
print(f"\nStratified 5-Fold Cross-Validation:\n")
s_acc, s_f1 = stratified_cross_validate(X_imb, y_imb, k=5)
See how each fold's positive ratio is consistent? That's stratification at work. The standard deviation across folds should be smaller with stratified splitting because you've removed one source of variability (unequal class distributions) from the experiment. In practice, always use stratified K-fold for classification problems. There's no reason not to.
Regression metrics: beyond MSE
We've focused heavily on classification metrics because that's where the pitfalls are deepest. But let's not forget regression -- we built linear regression in episodes #10 and #11, and we measured it with MSE. MSE is fine, but it's not the only metric, and it has some practical limitations.
Here's a quick tour of the regression metrics you should know:
# Example: predicting apartment prices (callback to episode #10)
np.random.seed(42)
y_true_reg = np.array([180000, 220000, 350000, 280000, 150000,
420000, 195000, 310000, 265000, 175000], dtype=float)
y_pred_reg = np.array([195000, 210000, 330000, 295000, 160000,
390000, 180000, 340000, 250000, 190000], dtype=float)
# MAE -- Mean Absolute Error
mae = np.mean(np.abs(y_true_reg - y_pred_reg))
# MSE -- Mean Squared Error
mse = np.mean((y_true_reg - y_pred_reg) ** 2)
# RMSE -- Root Mean Squared Error (MSE in original units)
rmse = np.sqrt(mse)
# R-squared -- proportion of variance explained (episode #11)
ss_res = np.sum((y_true_reg - y_pred_reg) ** 2)
ss_tot = np.sum((y_true_reg - y_true_reg.mean()) ** 2)
r_squared = 1 - ss_res / ss_tot
print(f"Regression metrics for apartment price predictions:\n")
print(f" MAE: EUR {mae:>10,.0f} (average absolute error)")
print(f" MSE: {mse:>14,.0f} (penalizes large errors more)")
print(f" RMSE: EUR {rmse:>10,.0f} (MSE back in original units)")
print(f" R^2: {r_squared:>10.4f} (fraction of variance explained)")
print(f"")
print(f"Interpretation:")
print(f" On average, predictions are off by EUR {mae:,.0f}")
print(f" The model explains {r_squared*100:.1f}% of price variation")
print(f" Remaining {(1-r_squared)*100:.1f}% is noise or missing features")
Quick comparison of when to use what:
- MAE (Mean Absolute Error): simple, interpretable, in the same units as the target. Treats all errors equally -- being off by EUR 20,000 is twice as bad as being off by EUR 10,000. Good when all errors matter equally.
- MSE (Mean Squared Error): penalizes large errors more heavily (because of the squaring). Being off by EUR 20,000 is FOUR times as bad as being off by EUR 10,000. Good when large errors are especially problematic.
- RMSE (Root MSE): just
sqrt(MSE), which brings it back to the original units. Easier to interpret than MSE -- "average error is about EUR 18,000" makes more sense than "average squared error is 324,000,000." - R-squared: the one I introduced in episode #11. Fraction of variance explained. Scale-free (always between negative infinity and 1.0, typically between 0 and 1 for any reasonable model). Intuitive: "the model explains 85% of the price variation." R^2 of 0 means you're no better than just predicting the mean for everything.
One important warning about R-squared: it can be negative. That happens when your model is worse than the trivial "always predict the mean" strategy. If you see negative R^2, something went terribly wrong -- wrong features, broken preprocessing, bug in the code, or the model simply doesn't fit. It's actually a useful smoke test: negative R^2 = stop and debug.
Putting it all together: evaluate a full model
Let's bring everything together in one practical scenario. We'll generate a classification dataset, train our logistic regression from episode #12, and evaluate it with every metric we've built today:
# Full evaluation pipeline
np.random.seed(42)
n = 500
# Medical screening: predict disease risk from 5 biomarkers
biomarkers = np.random.randn(n, 5)
risk = (0.8 * biomarkers[:, 0] + 0.5 * biomarkers[:, 1]
- 0.3 * biomarkers[:, 2] + 0.2 * biomarkers[:, 3]
+ np.random.randn(n) * 0.8)
y = (risk > 0.5).astype(float)
print(f"Medical screening dataset: {n} patients, 5 biomarkers")
print(f" High risk: {int(y.sum())} ({y.mean():.1%})")
print(f" Low risk: {int(n - y.sum())} ({1 - y.mean():.1%})")
# 80/20 split
idx = np.random.permutation(n)
split = int(0.8 * n)
X_train, X_test = biomarkers[idx[:split]], biomarkers[idx[split:]]
y_train, y_test = y[idx[:split]], y[idx[split:]]
# Scale
mean, std = X_train.mean(axis=0), X_train.std(axis=0)
std[std == 0] = 1
X_train_s = (X_train - mean) / std
X_test_s = (X_test - mean) / std
# Train logistic regression
X_tr_b = np.column_stack([X_train_s, np.ones(len(X_train_s))])
X_te_b = np.column_stack([X_test_s, np.ones(len(X_test_s))])
w = np.zeros(X_tr_b.shape[1])
for epoch in range(1000):
z = X_tr_b @ w
y_hat = 1 / (1 + np.exp(-z))
gradient = (1/len(y_train)) * X_tr_b.T @ (y_hat - y_train)
w -= 0.5 * gradient
# Get predictions and scores
test_scores = X_te_b @ w
test_proba = 1 / (1 + np.exp(-test_scores))
test_preds = (test_proba >= 0.5).astype(int)
# ALL the metrics
tp, fp, fn, tn = confusion_matrix(y_test, test_preds)
accuracy = (tp + tn) / len(y_test)
prec = tp / (tp + fp) if (tp + fp) > 0 else 0
rec = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * prec * rec / (prec + rec) if (prec + rec) > 0 else 0
fpr_curve, tpr_curve = compute_roc(y_test, test_scores)
auc = np.trapz(tpr_curve, fpr_curve)
print(f"\n--- Full Evaluation Report ---")
print(f"")
print(f"Confusion Matrix:")
print(f" Predicted")
print(f" Pos Neg")
print(f" Actual Pos {tp:>4d} {fn:>4d}")
print(f" Actual Neg {fp:>4d} {tn:>4d}")
print(f"")
print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {prec:.3f}")
print(f"Recall: {rec:.3f}")
print(f"F1 Score: {f1:.3f}")
print(f"AUC-ROC: {auc:.3f}")
print(f"")
print(f"Medical context:")
print(f" {tp} high-risk patients correctly identified")
print(f" {fn} high-risk patients MISSED (dangerous!)")
print(f" {fp} low-risk patients flagged for follow-up (inconvenient)")
print(f" {tn} low-risk patients correctly cleared")
That evaluation report is what you should produce for every classification model you build. Not just accuracy -- the full picture. The confusion matrix, precision, recall, F1, AUC. And the domain-specific interpretation: what do false positives and false negatives actually MEAN in your application?
Choosing the right metric: a decision framework
There is no universally "best" metric. The right one depends on your problem. Here's the framework I use:
For classification:
- Classes roughly balanced AND you care about overall correctness? --> Accuracy is fine, or F1 for a more robust measure
- Classes imbalanced? --> F1, precision/recall separately, or AUC-ROC. NEVER accuracy alone.
- False positives are the expensive error? --> Optimize precision (raise the threshold)
- False negatives are the expensive error? --> Optimize recall (lower the threshold)
- Need to compare models without committing to a threshold? --> AUC-ROC
- Heavily imbalanced (1% or less positive)? --> Consider precision-recall AUC instead of ROC-AUC
For regression:
- Want a simple, interpretable error in original units? --> MAE or RMSE
- Large errors are especially bad? --> MSE or RMSE (the squaring emphasizes outliers)
- Want a scale-free measure you can compare across problems? --> R-squared
- Sanity check: is the model doing anything at all? --> R-squared (negative = worse than the mean)
The key insight: pick your metric BEFORE you start building models. Not after. If you try a dozen models and then pick the metric that makes the best one look good, you're fooling yourself. Decide what matters for the problem, choose the metric that captures it, then optimize for that metric.
Let's recap
We stepped back from building models today and focused on measuring them. This is not optional -- evaluation is how you know whether your models work, where they fail, and how to improve them. Here's what we covered:
- Accuracy is misleading on imbalanced datasets. A model that always predicts the majority class can score 99%+ while doing absolutely nothing useful. Never rely on accuracy alone for classification;
- The confusion matrix (TP, FP, FN, TN) is the foundation. Every classification metric is derived from these four numbers. Understand them and you understand evaluation;
- Precision = "when I say yes, am I right?" -- important when false alarms are costly. Recall = "did I find everything?" -- important when misses are costly. They trade off against each other as you adjust the threshold;
- F1 score is the harmonic mean of precision and recall. Zero if either is zero, 1.0 only if both are perfect. Use it when precision and recall are roughly equally important;
- AUC-ROC evaluates across all thresholds simultaneously, making it ideal for model comparison. AUC of 0.5 = random, 1.0 = perfect. The model with higher AUC is better regardless of threshold;
- Cross-validation (K-fold) gives a more reliable performance estimate than a single train/test split. Always use stratified K-fold for classification. Report mean and standard deviation;
- For regression: MAE for typical error, RMSE for emphasizing large errors, R-squared for proportion of variance explained;
- Choose your metric before building models, based on the cost of different error types in your specific domain.
Now we have both the models (episodes #10-12) and the tools to measure them (this episode). But there's a gap we haven't adressed yet. All our examples so far used clean, synthetic data -- data we generated ourselves with np.random. Real-world data is messy. It has missing values. It has columns in the wrong format. It has outliers that don't belong. It has features encoded as text when the model needs numbers. Cleaning and preparing data for a model is not glamorous work, but every ML practitioner will tell you the same thing: it's where you spend 80% of your time. And doing it wrong silently destroys your model's performance.
Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!
Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).
Consider setting @stemsocial as a beneficiary of this post's rewards if you would like to support the community and contribute to its mission of promoting science and education on Hive.