Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong

@scipio 68

21 days ago

StemSocial

Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong

What will I learn

You will learn what anomaly detection is and why it's fundamentally different from classification -- you're modeling "normal" in stead of learning class boundaries;
statistical baselines with Z-scores and IQR that work surprisingly well when your data cooperates;
Isolation Forest -- the algorithm that isolates anomalies with random trees, and why shorter paths mean weirder points;
One-Class SVM -- learning the tightest boundary around normal data using the kernel trick from episode #20;
Local Outlier Factor -- density-based anomaly scoring that catches outliers K-Means and global thresholds miss;
how to combine anomaly detection with the dimensionality reduction and clustering tools from episodes #22-25;
evaluation strategies when labeled anomalies are rare or completely nonexistent.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong

Over the last four episodes we've been deep in unsupervised territory. We found groups with K-Means and DBSCAN (episodes #22-23), compressed high-dimensional data with PCA (#24), and visualized manifold structure with t-SNE and UMAP (#25). All of those techniques share a common goal: find the structure -- the patterns, the clusters, the geometry that defines what's "normal" in a dataset.

Today we flip the question around. In stead of asking "what does the data look like?" we ask "what DOESN'T look like the rest of the data?"

A fraudulent transaction among millions of legitimate ones. A failing sensor in a factory full of healthy machines. A network intrusion buried in gigabytes of normal traffic. A spam account hiding among real users. These are anomalies -- data points that don't follow the patterns everything else follows. And detecting them is one of the most practically valuable things ML can do, because in many real-world domains the interesting stuff IS the weird stuff. The normal transactions are boring. The one that looks nothing like the others is the one that might be stealing your money.

So why not just train a classifier? Good question. Remember episode #13 where we talked about evaluation metrics and class imbalance? Anomaly detection is class imbalance taken to the absolute extreme. You might have 10,000 normal data points and 5 anomalies. A supervised classifier trained on this data would learn to predict "normal" for everything and achieve 99.95% accuracy while catching exactly zero anomalies. And that's assuming you even HAVE labeled anomalies to train on -- in most real scenarios, you don't. The anomalies you know about are the ones your current system already caught. The dangerous ones are the ones you've never seen before.

Anomaly detection algorithms take a fundamentally different approach: they learn what "normal" looks like, and flag anything that doesn't match. No labels needed. The model builds a profile of the majority, and the minority -- the stuff that breaks the pattern -- reveals itself through its distance from that profile.

Let's dive right in.

Statistical baselines: don't overthink it

Before reaching for fancy ML algorithms, simple statistics deserve a serious look. If your data is roughly normally distributed (and you'd be surprised how often a single feature is, or can be made so with a log transform), the Z-score approach works remarkably well. It flags points that are far from the mean, measured in standard deviations:

import numpy as np
from sklearn.datasets import make_blobs

np.random.seed(42)

# Generate "normal" data: a single cluster of 500 points
X_normal, _ = make_blobs(
    n_samples=500, centers=1, cluster_std=1.0, random_state=42
)

# Inject 15 anomalies scattered across the space
X_anomalies = np.random.uniform(-8, 8, size=(15, 2))

# Combine into one dataset
X = np.vstack([X_normal, X_anomalies])
y_true = np.array([0] * 500 + [1] * 15)  # 0=normal, 1=anomaly

print(f"Dataset: {X.shape[0]} points, {X.shape[1]} features")
print(f"Normal: {np.sum(y_true == 0)}, Anomalies: {np.sum(y_true == 1)}")
print(f"Anomaly rate: {np.mean(y_true):.1%}")

# Z-score method: flag points where ANY feature is >3 std from the mean
means = X.mean(axis=0)
stds = X.std(axis=0)
z_scores = np.abs((X - means) / stds)
max_z = z_scores.max(axis=1)  # worst z-score across features
z_predictions = (max_z > 3).astype(int)

detected = np.sum((z_predictions == 1) & (y_true == 1))
false_alarms = np.sum((z_predictions == 1) & (y_true == 0))
print(f"\nZ-score (threshold=3):")
print(f"  Detected: {detected}/{15} anomalies")
print(f"  False alarms: {false_alarms}")

The Z-score threshold of 3 is a classic choice -- under a normal distribution, about 99.7% of data falls within 3 standard deviations of the mean. Anything beyond that is flagged. You can tighten it (threshold=2, catches more but with more false alarms) or loosen it (threshold=4, very few false alarms but might miss subtle anomalies).

For data that isn't Gaussian, the IQR (Interquartile Range) method is more robust. It flags anything below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. This is the same logic behind the "whiskers" on a box plot, and it handles skewed distributions much better than Z-scores:

# IQR method (per feature)
Q1 = np.percentile(X, 25, axis=0)
Q3 = np.percentile(X, 75, axis=0)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# A point is anomalous if ANY feature is outside the IQR bounds
iqr_flags = np.any((X < lower) | (X > upper), axis=1).astype(int)

detected_iqr = np.sum((iqr_flags == 1) & (y_true == 1))
false_iqr = np.sum((iqr_flags == 1) & (y_true == 0))
print(f"IQR method:")
print(f"  Detected: {detected_iqr}/{15} anomalies")
print(f"  False alarms: {false_iqr}")

Both methods share a critical limitation: they check each feature independently. A point might have perfectly normal values on every single feature and still be anomalous in the combination of features. Imagine someone who earns $30K/year (normal), lives in a $5M house (normal for SOME people), and is 22 years old (normal). Each feature independently is fine. The combination? That's suspicious. Capturing these multi-feature anomalies requires methods that understand the joint distribution -- which is exactly what the ML-based approaches do.

Having said that, always start with statistics. If your anomalies are extreme values on individual features, Z-scores and IQR will catch them faster and more transparently than any ML algorithm. You can explain "this value is 5 standard deviations from the mean" to anyone. Try explaining an Isolation Forest anomaly score to a non-technical stakeholder ;-)

Isolation Forest: the inverse logic

Most anomaly detection algorithms try to model what normal data looks like and then find deviations. Isolation Forest (Liu et al., 2008) takes a brilliant conceptual shortcut: it doesn't try to model normalcy at all. In stead, it directly targets what makes anomalies different -- they're easier to isolate.

The idea comes from a simple observation about random partitioning. Pick a random feature and a random split value. This divides your data into two halves. Now repeat: pick another random feature and split within one of those halves. Keep going. How many splits does it take to completely isolate a single point from all others?

For a normal point -- one sitting in the middle of a dense cluster surrounded by similar points -- isolation takes many splits. You have to carve away all its neighbors one by one. For an anomaly -- sitting out in empty space, far from everything -- isolation is easy. Maybe one or two splits and it's already alone. The average number of splits needed to isolate a point -- its path length in a random tree -- becomes the anomaly score. Short path = easy to isolate = probably anomalous.

Isolation Forest builds an ensemble of these random trees (just like Random Forest from episode #18 builds an ensemble for classification) and averages the path lengths. The ensemble smooths out the randomness of individual trees and produces stable anomaly scores.

from sklearn.ensemble import IsolationForest

# contamination: expected fraction of anomalies in the data
# We know it's 15/515 ~ 0.03
iso = IsolationForest(
    n_estimators=100,
    contamination=0.03,
    random_state=42
)
iso_predictions = iso.fit_predict(X)
# IsolationForest returns -1 for anomalies, 1 for normal
iso_labels = (iso_predictions == -1).astype(int)

detected = np.sum((iso_labels == 1) & (y_true == 1))
false_alarms = np.sum((iso_labels == 1) & (y_true == 0))
print(f"Isolation Forest:")
print(f"  Detected: {detected}/{15} anomalies")
print(f"  False alarms: {false_alarms}")

# The raw anomaly scores (lower = more anomalous)
scores = iso.decision_function(X)
print(f"\n  Score range (normal):  "
      f"{scores[y_true == 0].min():.3f} to {scores[y_true == 0].max():.3f}")
print(f"  Score range (anomaly): "
      f"{scores[y_true == 1].min():.3f} to {scores[y_true == 1].max():.3f}")

That contamination parameter is the elephant in the room. It tells the model what fraction of the data you expect to be anomalous, and the algorithm uses it to set the decision threshold. Set it too low and you'll miss real anomalies. Too high and you'll drown in false alarms. When you have some labeled examples (even a small validation set), use them to calibrate this threshold properly. When you don't have labels at all, domain knowledge is your best guide. "In our payment system, about 0.1% of transactions are fraudulent" is the kind of prior information that turns a tuning knob into a principled choice.

Why Isolation Forest is the practical default

I want to spend a moment on why Isolation Forest has become the go-to first choice for anomaly detection in production systems. It's fast -- O(n log n) per tree, and the trees are embarrassingly parallel. It handles high-dimensional data without breaking a sweat (each tree only uses a random subset of features per split, so irrelevant dimensions don't hurt it -- same insight as the random feature selection in Random Forest from episode #18). It makes no assumptions about the distribution of normal data. And it scales gracefully to datasets with millions of rows.

Compare that to the alternatives: One-Class SVM is O(n^2) to O(n^3) in the number of training points. LOF computes pairwise distances in a neighborhood, which gets expensive in high dimensions. Statistical methods assume specific distributions. Isolation Forest just builds trees, and trees are fast.

import time

# Speed comparison on increasing data sizes
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler

for n in [1000, 5000, 10000]:
    X_speed, _ = make_blobs(
        n_samples=n, centers=3, n_features=20, random_state=42
    )

    # Isolation Forest
    start = time.time()
    IsolationForest(n_estimators=100, random_state=42).fit(X_speed)
    t_iso = time.time() - start

    # LOF
    start = time.time()
    LocalOutlierFactor(n_neighbors=20).fit_predict(X_speed)
    t_lof = time.time() - start

    print(f"n={n:>6d}: IsoForest={t_iso:.2f}s, LOF={t_lof:.2f}s")

On 10K points in 20 dimensions, Isolation Forest finishes in under a second while LOF takes noticeably longer. And the gap widens as the data grows. For one-off analysis with a few thousand points, any algorithm works. For a production system processing millions of events per day, speed matters.

One-Class SVM: learning the boundary of normal

Remember SVMs from episode #20? We used them to find the maximum-margin hyperplane between two classes. One-Class SVM (Scholkopf et al., 1999) takes that same idea but trains on only ONE class -- the normal data. In stead of separating class A from class B, it separates the normal data from "everything else" by finding the tightest boundary that encloses the normal points. Anything outside the boundary is declared anomalous.

The kernel trick (which we covered in episode #20 when we talked about the RBF kernel for nonlinear classification) is what makes this powerful. In the original feature space, the boundary around normal data might need to be a complex, wiggly shape. But after the kernel transforms the data into a high-dimensional space, a simpler boundary (the maximum-margin hyperplane) can do the job. It's the same mathematical machinery, just pointed at a different problem.

from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler

# Feature scaling is critical for SVMs -- same lesson as episode #20
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# nu: upper bound on fraction of training errors (similar to contamination)
oc_svm = OneClassSVM(kernel='rbf', gamma='scale', nu=0.05)
oc_predictions = oc_svm.fit_predict(X_scaled)
oc_labels = (oc_predictions == -1).astype(int)

detected = np.sum((oc_labels == 1) & (y_true == 1))
false_alarms = np.sum((oc_labels == 1) & (y_true == 0))
print(f"One-Class SVM:")
print(f"  Detected: {detected}/{15} anomalies")
print(f"  False alarms: {false_alarms}")

# Show the scores
scores_svm = oc_svm.decision_function(X_scaled)
print(f"  Score range (normal):  "
      f"{scores_svm[y_true == 0].min():.3f} to "
      f"{scores_svm[y_true == 0].max():.3f}")
print(f"  Score range (anomaly): "
      f"{scores_svm[y_true == 1].min():.3f} to "
      f"{scores_svm[y_true == 1].max():.3f}")

The nu parameter is an upper bound on the fraction of training points that can be treated as outliers AND a lower bound on the fraction of support vectors. It plays a similar role to contamination in Isolation Forest. If you set nu=0.05, the model allows up to 5% of the training data to fall outside the normal boundary. It's another way of expressing "how many anomalies do I expect?"

Notice the StandardScaler before fitting. If you remember from episode #20, SVMs with RBF kernels are sensitive to feature scales because the kernel computes distances in feature space. An unscaled feature with a range of 0-1000 would completely dominate a feature with range 0-1. Same lesson, same fix.

When One-Class SVM shines (and when it doesn't)

One-Class SVM works best when the normal data forms a compact, well-defined region in feature space. Think sensor readings that stay within an operating envelope, or network traffic that follows predictable patterns. The SVM draws a tight boundary around that region, and anything outside is flagged.

It struggles with multimodal normal data -- if your normal data has multiple clusters (like, say, different customer segments that are all "normal" but look very different from each other), the RBF kernel tries to fit a single smooth boundary around all of them. That boundary can end up being very loose, letting real anomalies through, or very tight, flagging normal points in the gaps between clusters as anomalous. For multimodal data, Isolation Forest is more robust because its tree-based approach naturally adapts to multiple modes.

# Multimodal normal data: 3 clusters, all "normal"
np.random.seed(42)
X_multi = np.vstack([
    np.random.randn(200, 2) * 0.5 + np.array([-3, 0]),
    np.random.randn(200, 2) * 0.5 + np.array([3, 0]),
    np.random.randn(200, 2) * 0.5 + np.array([0, 4]),
])
# Anomalies in the gaps between clusters
X_gap_anomalies = np.array([[0, 0], [0, 2], [-1, 2], [1, 2], [0, -2]])
X_multi_all = np.vstack([X_multi, X_gap_anomalies])
y_multi = np.array([0] * 600 + [1] * 5)

# One-Class SVM
X_multi_scaled = StandardScaler().fit_transform(X_multi_all)
oc_multi = OneClassSVM(kernel='rbf', gamma='scale', nu=0.02)
oc_pred = oc_multi.fit_predict(X_multi_scaled)
oc_det = np.sum((oc_pred == -1) & (y_multi == 1))

# Isolation Forest
iso_multi = IsolationForest(
    n_estimators=100, contamination=0.01, random_state=42
)
iso_pred = iso_multi.fit_predict(X_multi_all)
iso_det = np.sum((iso_pred == -1) & (y_multi == 1))

print(f"Multimodal data (3 normal clusters, 5 gap anomalies):")
print(f"  One-Class SVM detected: {oc_det}/5")
print(f"  Isolation Forest detected: {iso_det}/5")

Isolation Forest handles the multimodal case more naturally because each tree independently partitions the space. The gaps between normal clusters are sparse regions -- easy to isolate into -- regardless of how many clusters there are. The SVM needs to draw one boundary around all three clusters, which invariably includes the gaps between them as "inside the boundary."

Local Outlier Factor: context is everything

Local Outlier Factor (LOF, Breunig et al., 2000) takes yet another approach to defining "anomalous." In stead of measuring distance from the data center (Z-scores) or ease of isolation (Isolation Forest) or distance from a boundary (One-Class SVM), LOF asks: is this point's local density consistent with its neighbors' density?

Think about it like this. A person standing alone in a field is obviously an outlier. But a person standing alone in a city park at 3 AM? They might not be far from anyone in absolute terms (buildings full of people all around), but relative to the expected density of a city, they stand out. LOF captures this by comparing each point's density to the density of its K nearest neighbors. If your density is much lower than your neighbors' density, you're an outlier -- even if your absolute distance from the mean is unremarkable.

This is related to the density concepts we saw with DBSCAN in episode #23. DBSCAN defines clusters as dense regions and labels sparse points as noise. LOF goes a step further: it quantifies HOW much less dense a point is compared to its surroundings, producing a continuous score rather than a binary cluster/noise label.

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.03)
lof_predictions = lof.fit_predict(X)
lof_labels = (lof_predictions == -1).astype(int)

detected = np.sum((lof_labels == 1) & (y_true == 1))
false_alarms = np.sum((lof_labels == 1) & (y_true == 0))
print(f"Local Outlier Factor:")
print(f"  Detected: {detected}/{15} anomalies")
print(f"  False alarms: {false_alarms}")

# LOF scores (lower = more anomalous, <-1 means outlier)
lof_scores = lof.negative_outlier_factor_
print(f"  Score range (normal):  "
      f"{lof_scores[y_true == 0].min():.3f} to "
      f"{lof_scores[y_true == 0].max():.3f}")
print(f"  Score range (anomaly): "
      f"{lof_scores[y_true == 1].min():.3f} to "
      f"{lof_scores[y_true == 1].max():.3f}")

The n_neighbors parameter is the LOF equivalent of DBSCAN's eps or t-SNE's perplexity (episode #25) -- it controls the scale of "locality." Too few neighbors gives unstable density estimates (a single point can swing the score). Too many neighbors smooths out local variations and makes LOF behave more like a global method. A good default is 20, but for small datasets (under 500 points) something like 10 works better, and for large datasets you can go up to 50 or even 100.

LOF's big limitation: it's transductive

LOF has the same transductive limitation as HDBSCAN (episode #23) and t-SNE (episode #25): it can only score the points it was trained on. If a new data point arrives, you can't just ask "is this point anomalous?" without refitting the entire model. Scikit-learn does offer LocalOutlierFactor(novelty=True) which enables a predict() method, but it comes with an important catch -- the training data must be clean (only normal points, no anomalies). If your training set contains anomalies, they'll distort the density estimates and make the model unreliable for scoring new points.

# Novelty mode: train on CLEAN data, score NEW points
lof_novelty = LocalOutlierFactor(
    n_neighbors=20, novelty=True, contamination=0.03
)
# Train only on the normal points
lof_novelty.fit(X_normal)

# Score ALL points (including anomalies)
novelty_scores = lof_novelty.decision_function(X)
novelty_preds = lof_novelty.predict(X)
novelty_labels = (novelty_preds == -1).astype(int)

detected_nov = np.sum((novelty_labels == 1) & (y_true == 1))
false_nov = np.sum((novelty_labels == 1) & (y_true == 0))
print(f"LOF (novelty mode, trained on clean data):")
print(f"  Detected: {detected_nov}/{15} anomalies")
print(f"  False alarms: {false_nov}")

In practice, the "clean training data" requirement is a chicken-and-egg problem. You need to remove anomalies from the training set, but identifying anomalies is the whole point. The pragmatic approach: fit LOF in normal mode first, remove the flagged anomalies, then retrain in novelty mode on the cleaned data. It's not perfect, but it usually works well enough.

The big comparison: all methods head-to-head

Let's put all four ML methods side by side on our test dataset, along with the statistical baselines. This is the anomaly detection version of the classifier bake-off we did in episode #21:

from sklearn.metrics import precision_score, recall_score, f1_score

# Collect all predictions
all_results = {
    'Z-score': z_predictions,
    'IQR': iqr_flags,
    'IsoForest': iso_labels,
    'OC-SVM': oc_labels,
    'LOF': lof_labels,
}

print(f"{'Method':>10s}  {'Precision':>10s}  {'Recall':>8s}  "
      f"{'F1':>6s}  {'Detected':>9s}  {'False+':>7s}")
print("-" * 58)

for name, preds in all_results.items():
    p = precision_score(y_true, preds, zero_division=0)
    r = recall_score(y_true, preds, zero_division=0)
    f1 = f1_score(y_true, preds, zero_division=0)
    det = np.sum((preds == 1) & (y_true == 1))
    fp = np.sum((preds == 1) & (y_true == 0))
    print(f"{name:>10s}  {p:>10.2f}  {r:>8.2f}  "
          f"{f1:>6.2f}  {det:>5d}/{15:<3d}  {fp:>7d}")

On this synthetic dataset where anomalies are scattered uniformly far from a single normal cluster, you'll find that most methods perform reasonably well. The real differences emerge on harder datasets -- multimodal normal data, anomalies that are close to normal points, high-dimensional data with irrelevant features. And that's where the algorithm choice actually matters.

Evaluation: the genuinely hard problem

Evaluating anomaly detection is uniquely difficult, and I want to spend real time on this because it trips people up constantly. In supervised learning (episode #13), we had clean ground truth labels and could compute accuracy, F1, recall, precision -- the whole toolkit. In anomaly detection, you typically face one of two scenarios:

Scenario 1: You have some labels. Maybe you have a handful of confirmed fraudulent transactions, or a security team manually reviewed a sample and flagged real intrusions. Even a small labeled set is gold. Use it to compute precision, recall, and F1 as above. But be aware that your labeled anomalies might not be representative of ALL anomaly types -- you're evaluating against known patterns, and the whole point of anomaly detection is catching UNKNOWN patterns.

Scenario 2: You have NO labels at all. This is the common case. Nobody has gone through your data and tagged the anomalies. You can't compute precision or recall because there's no ground truth.

What do you do in scenario 2? Two practical approaches:

# Approach 1: Score-based inspection
# Every algorithm provides continuous anomaly scores, not just binary labels
# Rank points by score and manually inspect the top-K

# Get scores from each algorithm
iso_scores = iso.decision_function(X)  # lower = more anomalous
lof_scores_all = lof.negative_outlier_factor_  # lower = more anomalous
svm_scores = oc_svm.decision_function(X_scaled)  # lower = more anomalous

# Rank by Isolation Forest score (most anomalous first)
ranking = np.argsort(iso_scores)

print("Top 20 most anomalous points (by Isolation Forest):")
print(f"{'Rank':>5s}  {'IsoF score':>11s}  {'LOF score':>10s}  "
      f"{'True label':>11s}")
print("-" * 42)
for i in range(20):
    idx = ranking[i]
    label = "ANOMALY" if y_true[idx] == 1 else "normal"
    print(f"{i+1:>5d}  {iso_scores[idx]:>11.3f}  "
          f"{lof_scores_all[idx]:>10.3f}  {label:>11s}")

If the top 20 flagged points are genuinely unusual when you inspect them, your model is working. If they're boring normal points, your model is not helping. This "inspect the top K" approach is how anomaly detection is evaluated in practice far more often than with formal metrics.

# Approach 2: Ensemble agreement
# If multiple algorithms independently flag the same point,
# it's more likely to be a real anomaly

# Binary flags from each method
flags = np.column_stack([
    z_predictions, iqr_flags, iso_labels, oc_labels, lof_labels
])
agreement = flags.sum(axis=1)  # how many methods flagged this point

print("\nEnsemble agreement analysis:")
print(f"{'Votes':>6s}  {'Count':>6s}  {'True anomalies':>15s}")
print("-" * 32)
for v in range(6):
    mask = agreement == v
    n_points = mask.sum()
    n_true = np.sum(y_true[mask] == 1) if n_points > 0 else 0
    print(f"{v:>6d}  {n_points:>6d}  {n_true:>15d}")

Points flagged by all 5 methods are almost certainly anomalous. Points flagged by only 1 method are more likely false alarms. This ensemble approach is really valuable in practice because it gives you a natural confidence ranking. The "how many methods agree" count is essentially a consensus score that's more robust than any single method's output.

Combining with dimensionality reduction

Remember the curse of dimensionality from episode #24? In high-dimensional spaces, distances become meaningless -- everything is approximately equidistant from everything else. That same curse hits anomaly detection hard. If your features include 200 variables and only 5 of them are relevant to the anomaly pattern, the signal is drowned by noise in the other 195 dimensions.

PCA (episode #24) and the tools we learned in episode #25 can help enormously here. Reduce dimensions first to strip out noise, then run anomaly detection on the cleaned representation:

from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

# Synthetic high-dimensional data
np.random.seed(42)
n_features = 50

# Normal data: random in 50 dimensions
X_hd_normal = np.random.randn(500, n_features)

# Anomalies: normal EXCEPT features 0 and 1 are extreme
X_hd_anom = np.random.randn(15, n_features)
X_hd_anom[:, 0] = np.random.uniform(5, 8, 15)
X_hd_anom[:, 1] = np.random.uniform(-8, -5, 15)

X_hd = np.vstack([X_hd_normal, X_hd_anom])
y_hd = np.array([0] * 500 + [1] * 15)

# Direct Isolation Forest on 50D data
iso_hd = IsolationForest(
    n_estimators=100, contamination=0.03, random_state=42
)
pred_hd = (iso_hd.fit_predict(X_hd) == -1).astype(int)
det_hd = np.sum((pred_hd == 1) & (y_hd == 1))

# PCA to 10 components, THEN Isolation Forest
pca = PCA(n_components=10)
X_hd_pca = pca.fit_transform(X_hd)
iso_pca = IsolationForest(
    n_estimators=100, contamination=0.03, random_state=42
)
pred_pca = (iso_pca.fit_predict(X_hd_pca) == -1).astype(int)
det_pca = np.sum((pred_pca == 1) & (y_hd == 1))

print(f"50D direct:     detected {det_hd}/{15}")
print(f"PCA(10) + Iso:  detected {det_pca}/{15}")
print(f"PCA variance retained: "
      f"{pca.explained_variance_ratio_.sum():.1%}")

The PCA preprocessing helps because it concentrates the signal (the anomaly pattern) into fewer components while distributing the noise across all components. When you keep only the top 10 components, you're keeping the dimensions where the anomalies stand out and discarding the 40 noise dimensions that were diluting the signal. The same principle we used to improve K-Means clustering in episode #24 applies here -- less really is more when the extra dimensions are noise.

You can also use UMAP (episode #25) for a nonlinear version of this preprocessing. Since UMAP preserves local density structure, it naturally maps anomalous points (which have low local density in the original space) to isolated positions in the embedding. Running anomaly detection on a UMAP embedding can catch anomalies that PCA's linear projection misses.

Choosing the right algorithm

After covering four ML methods and two statistical baselines, here's the practical decision guide. Like the clustering guide from episode #23, this comes from running these on many different real datasets, not just textbook examples ;-)

Start with Z-scores or IQR if your data is low-dimensional (a few features) and roughly well-behaved (unimodal, not too skewed). These methods are fast, transparent, and require zero tuning. If simple statistics solve your problem, don't overcomplicate it. I've seen people deploy Isolation Forests for problems where a Z-score check on two features would have caught every anomaly with zero false alarms.

Use Isolation Forest as the default ML approach. It's fast, scalable, handles high-dimensional data, makes no distribution assumptions, and works well across a wide range of problems. If you're building an anomaly detection system and you're not sure which algorithm to use, start here. It's the "Random Forest of anomaly detection" -- rarely the absolute best on any specific problem, but almost never bad.

Use One-Class SVM when your normal data has a clear, compact boundary -- sensor readings with known operating envelopes, manufacturing processes with tight tolerances, anything where "normal" occupies a well-defined region. Remember to scale your features (same as with regular SVMs, episode #20). Skip it for multimodal data or when you have more than ~10,000 training points (the O(n^2) cost becomes painful).

Use LOF when anomalies are defined by local context -- when a point might look normal globally but is anomalous relative to its immediate neighborhood. This is common in datasets with multiple clusters of different densities (same scenario where DBSCAN struggled in episode #23). LOF's density comparison is the right tool for "this point doesn't belong to the crowd it's standing next to."

Combine methods when the stakes are high. No single algorithm catches every type of anomaly. An ensemble that flags points based on agreement across multiple methods (as we built above) is more robust than any individual method. In production fraud detection systems, it's common to run 3-5 different anomaly detection algorithms in parallel and escalate points that multiple methods flag.

A realistic multi-feature example

Let's build something closer to a real-world scenario. In stead of scattered random anomalies, let's simulate server monitoring data where the anomalies represent actual failure modes -- subtler and harder to catch:

np.random.seed(42)

# Simulate server metrics: CPU, memory, disk I/O, network, latency
n_normal = 1000
n_anomaly = 20

# Normal operation: correlated metrics with some noise
cpu = np.random.normal(45, 10, n_normal).clip(5, 95)
memory = cpu * 0.8 + np.random.normal(0, 5, n_normal)
disk_io = np.random.normal(100, 20, n_normal).clip(10, 300)
network = np.random.normal(50, 15, n_normal).clip(5, 200)
latency = 10 + cpu * 0.1 + np.random.normal(0, 2, n_normal)

X_server = np.column_stack([cpu, memory, disk_io, network, latency])

# Anomaly type 1: memory leak (memory way too high relative to CPU)
cpu_a1 = np.random.normal(40, 8, 10)
mem_a1 = cpu_a1 * 0.8 + 40 + np.random.normal(0, 3, 10)  # memory spike
disk_a1 = np.random.normal(100, 20, 10)
net_a1 = np.random.normal(50, 15, 10)
lat_a1 = 10 + cpu_a1 * 0.1 + np.random.normal(0, 2, 10)

# Anomaly type 2: network saturation (high latency + high network)
cpu_a2 = np.random.normal(50, 10, 10)
mem_a2 = cpu_a2 * 0.8 + np.random.normal(0, 5, 10)
disk_a2 = np.random.normal(100, 20, 10)
net_a2 = np.random.normal(180, 10, 10)  # network spike
lat_a2 = 40 + cpu_a2 * 0.1 + np.random.normal(0, 3, 10)  # latency spike

X_anomalies = np.vstack([
    np.column_stack([cpu_a1, mem_a1, disk_a1, net_a1, lat_a1]),
    np.column_stack([cpu_a2, mem_a2, disk_a2, net_a2, lat_a2]),
])

X_all = np.vstack([X_server, X_anomalies])
y_all = np.array([0] * n_normal + [1] * n_anomaly)

# Scale features -- critical for SVM, helpful for LOF
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(X_all)

# Run all methods
results = {}

# Isolation Forest
iso_srv = IsolationForest(
    n_estimators=200, contamination=0.02, random_state=42
)
results['IsoForest'] = (iso_srv.fit_predict(X_all) == -1).astype(int)

# One-Class SVM
oc_srv = OneClassSVM(kernel='rbf', gamma='scale', nu=0.03)
results['OC-SVM'] = (oc_srv.fit_predict(X_all_scaled) == -1).astype(int)

# LOF
lof_srv = LocalOutlierFactor(n_neighbors=20, contamination=0.02)
results['LOF'] = (lof_srv.fit_predict(X_all) == -1).astype(int)

print(f"Server monitoring: {n_normal} normal + {n_anomaly} anomalies")
print(f"Features: CPU, Memory, Disk I/O, Network, Latency\n")
print(f"{'Method':>10s}  {'Precision':>10s}  {'Recall':>8s}  {'F1':>6s}")
print("-" * 38)
for name, preds in results.items():
    p = precision_score(y_all, preds, zero_division=0)
    r = recall_score(y_all, preds, zero_division=0)
    f1 = f1_score(y_all, preds, zero_division=0)
    print(f"{name:>10s}  {p:>10.2f}  {r:>8.2f}  {f1:>6.2f}")

This example is more interesting than the earlier synthetic one because the anomalies aren't just "random points far from the center." They're specific failure modes where some features are abnormal while others are perfectly normal. The memory leak has normal CPU, normal disk, normal network -- only the CPU-memory relationship is broken. That's the kind of multi-feature anomaly that Z-scores miss and ML methods can catch.

The contamination problem: what if you don't know the rate?

I've been setting contamination in every example, but in the real world, you often genuinely don't know what fraction of your data is anomalous. Setting it wrong can cripple your detector.

The pragmatic approach is to treat contamination as a threshold tuning problem and separate the scoring from the thresholding:

# Step 1: Get anomaly SCORES (not binary labels)
iso_final = IsolationForest(n_estimators=200, random_state=42)
iso_final.fit(X_all)
scores_final = iso_final.decision_function(X_all)

# Step 2: Explore different thresholds
print(f"{'Threshold':>10s}  {'Flagged':>8s}  {'Detected':>9s}  "
      f"{'False+':>7s}  {'Precision':>10s}  {'Recall':>8s}")
print("-" * 60)

for threshold in [-0.3, -0.2, -0.15, -0.1, -0.05, 0.0]:
    flagged = (scores_final < threshold).astype(int)
    n_flag = flagged.sum()
    det = np.sum((flagged == 1) & (y_all == 1))
    fp = np.sum((flagged == 1) & (y_all == 0))
    p = det / n_flag if n_flag > 0 else 0
    r = det / n_anomaly
    print(f"{threshold:>10.2f}  {n_flag:>8d}  {det:>5d}/{n_anomaly:<3d}  "
          f"{fp:>7d}  {p:>10.2f}  {r:>8.2f}")

This separation of "scoring" from "thresholding" is important. The algorithm produces a continuous anomaly score for every point. The threshold converts that score into a binary decision. By separating these two steps, you can tune the threshold independently based on your operational needs: a security system that CANNOT miss intrusions sets a permissive threshold (high recall, more false alarms). A fraud review team with limited capacity sets a strict threshold (high precision, fewer but more certain flags). Same model, different thresholds, different operating points -- exactly the precision-recall tradeoff we discussed in episode #13.

The unsupervised learning toolkit: where we stand

With today's episode, our unsupervised learning toolkit is complete. Let me show you how all the pieces from episodes #22 through #26 fit together:

# The full unsupervised analysis pipeline
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA

# Step 1: Dimensionality reduction (ep #24-25)
# Strip noise, compress to essential structure
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_all_scaled)
print(f"Reduced: {X_all_scaled.shape[1]}D -> {X_reduced.shape[1]}D")

# Step 2: Anomaly detection (ep #26 -- today)
# Find points that don't belong to any normal pattern
iso_pipe = IsolationForest(
    n_estimators=200, contamination=0.02, random_state=42
)
anomaly_flags = iso_pipe.fit_predict(X_reduced)
clean_mask = anomaly_flags == 1
print(f"Flagged {np.sum(anomaly_flags == -1)} anomalies, "
      f"{np.sum(clean_mask)} clean points")

# Step 3: Clustering on clean data (ep #22-23)
# Find groups in the normal data
X_clean = X_reduced[clean_mask]
km = KMeans(n_clusters=3, n_init=10, random_state=42)
cluster_labels = km.fit_predict(X_clean)
print(f"Found {len(set(cluster_labels))} clusters in clean data")
for c in sorted(set(cluster_labels)):
    print(f"  Cluster {c}: {np.sum(cluster_labels == c)} points")

This pipeline -- reduce, detect anomalies, cluster the rest -- is a standard unsupervised analysis workflow in production. You remove the noise (both feature noise via PCA and data point noise via anomaly detection), and then the clustering algorithms work on cleaner, more structured data. Each tool we've built across these five episodes has a specific job, and together they form a complete system for understanding unlabeled data.

Zo, wat hebben we geleerd?

We covered a lot of ground today. Here's the full picture, tying everything back to the concepts we've been building across the unsupervised learning arc:

Anomaly detection learns what "normal" looks like and flags deviations. It's fundamentally different from classification because labeled anomalies are rare or nonexistant, and because the class imbalance (episode #13) makes supervised approaches impractical;
Statistical methods (Z-scores, IQR) are effective baselines for low-dimensional, roughly Gaussian data. Start here. If they work, you're done. They check features independantly, so they miss multi-feature anomalies where individual values are normal but the combination is suspicious;
Isolation Forest isolates anomalies with random trees -- shorter path length means more anomalous. It's the practical default: fast (O(n log n)), scalable, no distribution assumptions, handles high dimensions. Think of it as the Random Forest (episode #18) of anomaly detection;
One-Class SVM draws the tightest boundary around normal data using the kernel trick (episode #20). Excels when normal data forms a compact region. Requires feature scaling. Struggles with multimodal normal data and large datasets;
Local Outlier Factor compares each point's density to its neighbors' density -- catching anomalies that are local rather than global. Related to the density concepts from DBSCAN (episode #23). Transductive in default mode (same limitation as HDBSCAN and t-SNE);
The contamination parameter (or nu, or threshold) controls sensitivity. Separate scoring from thresholding in practice. The right threshold depends on the precision-recall tradeoff for your specific use case (episode #13);
Combine with PCA (episode #24) to handle high-dimensional data -- reduce dimensions to strip noise, then run anomaly detection on the cleaned representation. Same principle that improved K-Means in episode #24;
Ensemble agreement across multiple methods gives more robust anomaly flagging than any single algorithm alone.

This wraps up the unsupervised learning block that started in episode #22. We can find groups (clustering), compress and visualize (dimensionality reduction), and identify outliers (anomaly detection). These three capabilities -- grouping, compressing, and flagging -- are the core of what unsupervised learning offers. Everything we build from here assumes you have these tools in your belt.

The natural next question after "what groups exist in this data?" is "which items should I show to which users?" Recommendations live at the intersection of everything we've learned -- distances, similarity, clustering, and user preferences as data. It's one of the most commercially important applications of ML, and you already have all the building blocks.

Bedankt en tot de volgende keer!

@scipio

stem stemsocial steemstem python programming

0.000

1 comments

@stemsocial 64

20 days ago

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

Consider setting @stemsocial as a beneficiary of this post's rewards if you would like to support the community and contribute to its mission of promoting science and education on Hive.

0.000

Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong

Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong

Statistical baselines: don't overthink it

Isolation Forest: the inverse logic

Why Isolation Forest is the practical default

One-Class SVM: learning the boundary of normal

When One-Class SVM shines (and when it doesn't)

Local Outlier Factor: context is everything

LOF's big limitation: it's transductive

The big comparison: all methods head-to-head

Evaluation: the genuinely hard problem

Combining with dimensionality reduction

Choosing the right algorithm

A realistic multi-feature example

The contamination problem: what if you don't know the rate?

The unsupervised learning toolkit: where we stand

Zo, wat hebben we geleerd?

Bedankt en tot de volgende keer!

Curriculum (of the `Learn AI Series`):