Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary

avatar

Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary

ai-banner.png

What will I learn

  • You will learn the maximum margin principle -- why SVMs seek the widest possible gap between classes, and how that leads to better generalization than just "getting everything right";
  • support vectors: the handful of data points that define the entire decision boundary (every other point is irrelevant);
  • the soft margin and the C parameter -- how SVMs handle noisy, overlapping data where no perfect boundary exists;
  • the kernel trick -- mapping data to higher dimensions without actually computing those dimensions, one of the most elegant ideas in all of machine learning;
  • RBF, polynomial, and linear kernels, with practical guidance on when each applies;
  • SVMs for regression (SVR) -- fitting a tube in stead of drawing a line;
  • a full scikit-learn workflow with Pipeline, GridSearchCV, and feature scaling (which is NOT optional for SVMs);
  • when SVMs beat tree-based methods and when they don't -- honest positioning alongside everything we've built so far.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary

At the end of episode #19 I said we'd covered the entire classical supervised learning toolkit: linear models, decision trees, random forests, gradient boosting. And I mentioned that the next episodes would explore models that think about boundaries and distances in stead of trees and splits. Today we get into the first of those -- and it's a model with a very different philosophy from everything we've built so far.

Every classifier we've trained draws a decision boundary somewhere. Logistic regression (episode #12) draws a line and tries to maximize the probability of getting each point on the correct side. Decision trees (episode #17) draw boundaries perpendicular to feature axes, slicing the space into rectangles. Random forests and gradient boosting (episodes #18-19) combine many such rectangular boundaries into something more flexible. But none of these algorithms ask what might seem like the most obvious question: "what is the BEST boundary?"

Support Vector Machines ask exactly that question. And their answer is elegant: the best boundary is the one with the maximum margin -- the widest possible gap between the classes. Not just any correct boundary. The best correct boundary. The one that leaves the most breathing room.

Let's build the intuition, write the code, and see where SVMs fit in the growing toolkit.

The maximum margin idea

Imagine two clusters of points on a flat surface. Blue dots on the left, red dots on the right, clearly separated by a gap. You need to draw a straight line between them. There are infinitely many lines that correctly classify all the points. Any line that passes through the gap works.

But intuitively, some lines feel better than others. A line that barely skims past a blue dot -- so close you could practically touch it -- feels fragile. Move that blue dot a tiny bit and it crosses the boundary. A line that sits smack in the middle of the gap, equidistant from both clusters, feels robust. New points would have to travel a long way to land on the wrong side.

SVMs formalize this intuition. They find the boundary that maximizes the margin -- the distance between the boundary and the nearest data points from each class. The wider that buffer zone, the more confident we can be that unseen data will fall on the correct side. This is directly related to the generalization concept we've been discussing since episode #11 -- a wider margin means less sensitivity to the specific noise in our training sample.

Let's see it in code:

import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

np.random.seed(42)

# Two well-separated clusters
n = 200
X = np.vstack([
    np.random.randn(n // 2, 2) + np.array([2, 2]),
    np.random.randn(n // 2, 2) + np.array([-2, -2])
])
y = np.array([0] * (n // 2) + [1] * (n // 2))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Linear SVM
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)
print(f"Linear SVM accuracy: {svm.score(X_test, y_test):.1%}")
print(f"Number of support vectors per class: {svm.n_support_}")
print(f"Total support vectors: {sum(svm.n_support_)}")
print(f"Total training points: {len(X_train)}")

Two clean clusters, four units apart. The linear SVM finds the boundary right in the middle of that gap. Simple enough. But the really interesting part is which data points the SVM actually cares about.

Support vectors: the VIPs of your dataset

Here's the remarkable thing about SVMs: the decision boundary is determined entirely by a small subset of the training data -- the points closest to the boundary. These are the support vectors, and they give the algorithm its name. Every other data point could move, disappear, get duplicated, or be replaced with something completely different, and as long as those support vectors stay where they are, the boundary doesn't change at all.

Think about that for a second. In logistic regression (episode #12), every single data point contributes to computing the weights. Change one point and the weights shift slightly. In decision trees (episode #17), every point participates in the information gain calculations. In random forests (episode #18), every point appears in roughly 63% of the bootstrap samples. But an SVM looks at the full dataset, identifies the handful of critical points near the boundary, and says: "these are the only ones that matter."

# The support vectors are stored in the model
print(f"\nSupport vector indices: {svm.support_[:10]}...")
print(f"Support vectors per class: {svm.n_support_}")
print(f"Total training points: {len(X_train)}")
print(f"Points that actually matter: {len(svm.support_)} "
      f"({len(svm.support_)/len(X_train):.0%} of data)")

# Move a non-support-vector point far away -- boundary unchanged
X_train_copy = X_train.copy()
non_sv_idx = [i for i in range(len(X_train)) if i not in svm.support_]
X_train_copy[non_sv_idx[0]] = [10, 10]  # move to extreme position

svm_copy = SVC(kernel='linear')
svm_copy.fit(X_train_copy, y_train)

# Compare: same boundary?
print(f"\nOriginal coefficients: {svm.coef_[0]}")
print(f"After moving non-SV:   {svm_copy.coef_[0]}")
print(f"Intercept original:    {svm.intercept_[0]:.4f}")
print(f"Intercept modified:    {svm_copy.intercept_[0]:.4f}")

In practice, you'll often see that only 5-20% of the training data are support vectors. The rest are "spectators" -- correctly classified and comfortably far from the boundary. This sparsity is also why SVMs can be memory-efficient for prediction: you only need to store the support vectors, not the entire training set. Compare that to random forests where you store 200 complete trees, each with potentially thousands of leaf nodes.

The soft margin: handling messy reality

Real data isn't perfectly separable. There will be points that fall on the wrong side of any boundary, or points sitting right in the middle of where the margin wants to be. A hard-margin SVM (which insists on zero misclassifications and all points outside the margin) would fail completely on such data -- or wouldn't find any valid solution at all.

Soft-margin SVMs, introduced by Cortes and Vapnik in 1995, handle this gracefully. They allow some points to violate the margin -- to sit inside the margin zone or even on the wrong side -- but they penalize those violations. The C parameter controls how harshly violations are penalized.

  • High C = "classify everything correctly, even if the margin has to be razor thin." Prioritizes correctness over margin width.
  • Low C = "keep the margin wide, even if some points end up misclassified." Prioritizes generalization over training accuracy.

Sound familiar? It should. This is the bias-variance tradeoff again -- the same fundamental tension we first met in episode #11 with polynomial regression and have seen in every model since. High C = low bias, high variance (overfitting risk). Low C = high bias, low variance (underfitting risk).

from sklearn.datasets import make_moons

# Non-linearly separable data with noise
X_moons, y_moons = make_moons(n_samples=300, noise=0.2, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(
    X_moons, y_moons, test_size=0.2, random_state=42
)

print(f"{'C':>8s}  {'Train':>8s}  {'Test':>8s}  {'SVs':>6s}")
print("-" * 36)

for C in [0.01, 0.1, 1.0, 10.0, 100.0]:
    svm = SVC(kernel='rbf', C=C)
    svm.fit(X_tr, y_tr)
    print(f"{C:>8.2f}  {svm.score(X_tr, y_tr):>8.1%}  "
          f"{svm.score(X_te, y_te):>8.1%}  "
          f"{sum(svm.n_support_):>6d}")

Watch the pattern. Very low C (0.01) gives a smooth, wide-margin boundary with many support vectors -- the model is underfitting, lumping too many points into the margin zone. Very high C (100.0) gives a wiggly, tight boundary with fewer support vectors -- the model is chasing every training point, including noise. The sweet spot is somewhere in between, and cross-validation (episode #13) will help you find it. Same story as every other hyperparameter we've tuned so far, just a different knob.

The kernel trick: the SVM superpower

So far we've been drawing straight lines (or hyperplanes in higher dimensions). Linear boundaries. But the moons dataset above has two interlocking crescents -- no straight line can separate them. So how did the RBF SVM in the previous code block manage?

This is where SVMs become genuinely fascinating. The kernel trick is one of the most elegant ideas in machine learning. I'd argue it's up there with the backpropagation insight (which we'll encounter later in this series). Here's the core concept:

If data isn't linearly separable in its current feature space, maybe it IS separable in a higher-dimensional space. If we could map every 2D point into, say, 100 dimensions by computing polynomial combinations and radial distances and all sorts of derived features, perhaps a hyperplane in that 100-dimensional space would cleanly separate the classes.

The naive approach would be to actually compute all those high-dimensional features -- create 100 new columns, fill them in, run a linear SVM in that expanded space. Expensive. Impractical for very high (or infinite) dimensional spaces.

The kernel trick avoids this entirely. It turns out that the SVM optimization algorithm only needs to compute dot products between pairs of data points, never the actual coordinates in the high-dimensional space. A kernel function computes what the dot product WOULD BE in the high-dimensional space, without ever actually going there. You get the answer without doing the work. Mathematical sleight of hand -- but it works, and it's mathematically rigorous.

# Compare kernels on the moons dataset
kernels = ['linear', 'poly', 'rbf', 'sigmoid']

print(f"{'Kernel':>10s}  {'Train':>8s}  {'Test':>8s}")
print("-" * 30)

for kernel in kernels:
    svm = SVC(kernel=kernel, C=1.0)
    svm.fit(X_tr, y_tr)
    print(f"{kernel:>10s}  {svm.score(X_tr, y_tr):>8.1%}  "
          f"{svm.score(X_te, y_te):>8.1%}")

The linear kernel is just the regular dot product -- no transformation at all. Fast. Appropriate when data actually is linearly separable, or when you have very high-dimensional data (like text with thousands of vocabulary features) where linear boundaries often work suprisingly well.

The polynomial kernel computes dot products raised to a power. degree=2 creates quadratic boundaries, degree=3 creates cubic ones. Remember the polynomial features from episode #15? The polynomial kernel does the same thing implicitly -- it's as if you created all polynomial interaction features up to degree N, but without actually creating them.

The RBF (Radial Basis Function) kernel is the default and most commonly used. It implicitly maps data to an infinite-dimensional space (yes, really -- the mathematical expansion has infinite terms). It can model any smooth decision boundary given enough support vectors. The gamma parameter controls how far the influence of each training point reaches. Small gamma means broad influence (smooth boundary). Large gamma means narrow influence (wiggly boundary, overfitting risk).

The sigmoid kernel makes the SVM behave somewhat like a single-layer neural network. Rarely used in practice -- included here for completeness.

Tuning the RBF kernel: C and gamma together

The RBF kernel has two hyperparameters that interact strongly: C (regularization strength) and gamma (kernel bandwidth). Together they control the model's complexity. Tuning them independently doesn't work well -- you need to search the 2D space.

This is a perfect use case for the GridSearchCV we set up in episode #16:

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# CRITICAL: SVMs require feature scaling
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf'))
])

param_grid = {
    'svm__C': [0.1, 1, 10, 100],
    'svm__gamma': ['scale', 'auto', 0.01, 0.1, 1]
}

search = GridSearchCV(
    svm_pipeline, param_grid,
    cv=5, scoring='accuracy', n_jobs=-1,
    return_train_score=True
)
search.fit(X_tr, y_tr)

print(f"Best params: {search.best_params_}")
print(f"Best CV score: {search.best_score_:.3f}")
print(f"Test score: {search.score(X_te, y_te):.3f}")

Notice something important in that pipeline: the StandardScaler. This is NOT optional for SVMs. The kernel computes distances between data points (dot products are distance-related), and features with large ranges will dominate features with small ranges. If one feature ranges from 0 to 1,000,000 and another from 0 to 1, the SVM effectively ignores the small-range feature -- the distances are dominated by the large one. Always scale before SVM. Always.

Remember how I said in episodes #17 and #18 that tree-based methods don't need scaling? That was a genuine advantage of trees. SVMs bring us back to the StandardScaler world we lived in during the linear model episodes (#11, #14, #16). Pipelines make this painless -- the scaler is baked into the model object, so you never accidentally forget it.

Having said that, let me show you why the interaction between C and gamma matters:

# Show the C-gamma interaction
print(f"{'C':>8s}  {'gamma':>8s}  {'Train':>8s}  {'Test':>8s}  {'SVs':>6s}")
print("-" * 44)

for C in [0.1, 1, 10, 100]:
    for gamma in [0.01, 0.1, 1, 10]:
        pipe = Pipeline([
            ('scaler', StandardScaler()),
            ('svm', SVC(kernel='rbf', C=C, gamma=gamma))
        ])
        pipe.fit(X_tr, y_tr)
        svm_model = pipe.named_steps['svm']
        print(f"{C:>8.1f}  {gamma:>8.2f}  "
              f"{pipe.score(X_tr, y_tr):>8.1%}  "
              f"{pipe.score(X_te, y_te):>8.1%}  "
              f"{sum(svm_model.n_support_):>6d}")

Low C + low gamma: smooth, wide-margin, underfitting. High C + high gamma: wiggly, tight, overfitting. The best results come from moderate values of both -- but "moderate" depends entirely on your data. That's why grid search exists ;-)

SVMs for regression: fitting a tube

SVMs aren't just for classification. Support Vector Regression (SVR) takes the same geometric thinking and applies it to continuous targets. In stead of finding a boundary between classes, SVR fits a "tube" around the data with width epsilon. Points inside the tube contribute zero loss -- the model considers them "close enough." Only points outside the tube (the support vectors, again) influence the fit.

from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

# Regression data: sine wave with noise
np.random.seed(42)
X_reg = np.sort(np.random.uniform(0, 6, 100)).reshape(-1, 1)
y_reg = np.sin(X_reg.ravel()) + np.random.randn(100) * 0.15

X_tr_r, X_te_r, y_tr_r, y_te_r = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# SVR with RBF kernel
svr_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svr', SVR(kernel='rbf', C=100, gamma='scale', epsilon=0.1))
])
svr_pipe.fit(X_tr_r, y_tr_r)

svr_model = svr_pipe.named_steps['svr']
pred_te = svr_pipe.predict(X_te_r)

print(f"Total training points: {len(X_tr_r)}")
print(f"Support vectors: {len(svr_model.support_)}")
print(f"Points inside tube (ignored): "
      f"{len(X_tr_r) - len(svr_model.support_)}")
print(f"\nTest R-squared: {r2_score(y_te_r, pred_te):.3f}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_te_r, pred_te)):.3f}")

The epsilon parameter is unique to SVR and controls the width of the tolerance tube. A wider tube means fewer support vectors and a smoother fit (more tolerance for error). A narrower tube means more support vectors and a tighter fit. Think of it as SVR's version of regularization -- controlling how much error you're willing to tolerate before the model starts caring.

Let's see the effect:

print(f"{'epsilon':>10s}  {'SVs':>6s}  {'R2':>8s}  {'RMSE':>8s}")
print("-" * 36)

for eps in [0.01, 0.05, 0.1, 0.2, 0.5]:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('svr', SVR(kernel='rbf', C=100, gamma='scale', epsilon=eps))
    ])
    pipe.fit(X_tr_r, y_tr_r)
    svr_m = pipe.named_steps['svr']
    preds = pipe.predict(X_te_r)
    r2 = r2_score(y_te_r, preds)
    rmse = np.sqrt(mean_squared_error(y_te_r, preds))
    print(f"{eps:>10.2f}  {len(svr_m.support_):>6d}  "
          f"{r2:>8.3f}  {rmse:>8.3f}")

Small epsilon (0.01): almost every point is outside the tube, so almost every point is a support vector. The model fits very tightly. Large epsilon (0.5): most points fall inside the tube, very few support vectors, smooth but potentially underfitting. Same bias-variance tradeoff, different paramter. Always the same story.

SVMs vs everything: the honest comparison

Now we have the full picture of classical supervised learning. Let's put SVMs alongside everything we've built so far and see where they fit. I'll use the same comparison framework from episodes #18 and #19 -- same data, same cross-validation, same metrics:

from sklearn.ensemble import (RandomForestClassifier,
                               GradientBoostingClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Generate data with nonlinear patterns and interactions
np.random.seed(42)
n = 500
X_full = np.random.randn(n, 6)
y_full = ((X_full[:, 0]**2
           + 0.5 * X_full[:, 1]
           - X_full[:, 2]
           + X_full[:, 3] * X_full[:, 4]
           + 0.3 * X_full[:, 5]**2
           + np.random.randn(n) * 0.5) > 0.5).astype(int)

models = {
    "Logistic Regression": Pipeline([
        ('scaler', StandardScaler()),
        ('model', LogisticRegression(max_iter=1000)),
    ]),
    "Decision Tree (d=5)": Pipeline([
        ('model', DecisionTreeClassifier(max_depth=5, random_state=42)),
    ]),
    "Random Forest (200)": Pipeline([
        ('model', RandomForestClassifier(
            n_estimators=200, random_state=42)),
    ]),
    "Gradient Boosting": Pipeline([
        ('model', GradientBoostingClassifier(
            n_estimators=200, learning_rate=0.1,
            max_depth=3, random_state=42)),
    ]),
    "SVM (RBF)": Pipeline([
        ('scaler', StandardScaler()),
        ('model', SVC(kernel='rbf', C=10, gamma='scale')),
    ]),
}

from sklearn.tree import DecisionTreeClassifier

print(f"{'Model':>25s}  {'CV Accuracy':>12s}  {'CV F1':>10s}")
print("-" * 51)

for name, model in models.items():
    acc = cross_val_score(model, X_full, y_full,
                          cv=5, scoring='accuracy')
    f1 = cross_val_score(model, X_full, y_full,
                          cv=5, scoring='f1')
    print(f"{name:>25s}  {acc.mean():>8.3f} +/- {acc.std():.3f}  "
          f"{f1.mean():>6.3f}")

On this data (which has nonlinear patterns -- squared terms, an interaction between features 3 and 4), the SVM should be competitive with the tree-based ensembles. The logistic regression will lag behind because it can't capture nonlinearity without manual feature engineering (remember episode #15?). The decision tree will do OK but be unstable. The random forest and gradient boosting will be strong. And the SVM -- well, it depends on whether the RBF kernel matches the data's geometry well.

When SVMs shine and when they don't

After building all these models across episodes #10-20, here's my honest positioning of SVMs relative to the rest:

SVMs excel at:

  • High-dimensional data with relatively few samples. Text classification where you have 10,000+ TF-IDF features but only hundreds of documents? SVMs were THE dominant method for this before deep learning. Genomics data where you have 20,000 genes measured on 200 patients? SVMs outperform tree-based methods because the trees can't find meaningful splits in that enormous feature space with so few samples.
  • Clean, well-structured data where the boundary between classes is geometrically well-defined. The maximum margin principle provides strong mathematical generalization guarantees.
  • Small to medium datasets. The mathematical guarantees about the margin mean SVMs generalize well even with limited data, which is a real advantage over methods that need more samples to converge.

SVMs struggle with:

  • Large datasets. Training complexity is O(n^2) to O(n^3) in the number of samples. 10,000 samples? Fine. 100,000? Slow. A million? Impractical. Gradient boosting handles large data much better because tree construction scales more linearly with data size.
  • Noisy data where the boundary is inherently fuzzy. Random forests handle noise gracefully because averaging reduces the impact of individual noisy points. SVMs can chase noisy support vectors unless you tune C carefully.
  • Multi-class classification. SVMs are inherently binary classifiers -- they draw a boundary between two classes. For multiple classes, sklearn uses one-vs-one (train a separate SVM for every pair of classes) or one-vs-rest (train one SVM per class, each separating that class from all others). With 10 classes, one-vs-one trains 45 separate SVMs. It works but adds computation.
  • Probability estimation. SVMs produce distances from the boundary, not probabilities. Setting probability=True in sklearn fits a separate logistic calibration model (Platt scaling) on top of the SVM output. This adds computation and isn't always well-calibrated. If you need reliable probabilities (like "this email is 87% spam"), logistic regression from episode #12 is a better starting point.
  • Interpretability. You can't easily explain "why" an SVM made a specific prediction, unlike decision trees where you can print the if/else logic. The decision boundary is defined by the support vectors and the kernel -- good luck explaining a kernel function to a non-technical stakeholder.

A complete pipeline: the apartment dataset revisited

Let me tie things together with our running apartment dataset. We've used this data in episodes #15, #17, #18, and #19 -- and each time we've compared how different models handle the same patterns. The SVM adds a new perspective:

from sklearn.metrics import mean_absolute_error

# Same apartment data from episodes #15, #17, #18, #19
np.random.seed(42)
n = 500

sqm = np.random.uniform(30, 150, n)
rooms = np.random.randint(1, 6, n).astype(float)
age = np.random.uniform(0, 50, n)
floor = np.random.randint(0, 10, n).astype(float)
has_elevator = np.random.randint(0, 2, n).astype(float)

price = (2500 * sqm
         + 800 * rooms
         - 300 * age
         - 4000 * floor * (1 - has_elevator)
         + 20 * sqm * (50 - age) / 50
         + np.random.randn(n) * 12000)

X_apt = np.column_stack([sqm, rooms, age, floor, has_elevator])
apt_features = ["sqm", "rooms", "age", "floor", "elevator"]

X_tr_a, X_te_a, y_tr_a, y_te_a = train_test_split(
    X_apt, price, test_size=0.2, random_state=42
)

# SVR pipeline (scaling is mandatory!)
svr_apt = Pipeline([
    ('scaler', StandardScaler()),
    ('svr', SVR(kernel='rbf', C=100, gamma='scale', epsilon=0.1))
])
svr_apt.fit(X_tr_a, y_tr_a)

pred_te = svr_apt.predict(X_te_a)
r2 = r2_score(y_te_a, pred_te)
rmse = np.sqrt(mean_squared_error(y_te_a, pred_te))
mae = mean_absolute_error(y_te_a, pred_te)

print("=== Apartment Price -- SVR (RBF) ===\n")
print(f"Test R-squared: {r2:.4f}")
print(f"Test RMSE:      EUR {rmse:,.0f}")
print(f"Test MAE:       EUR {mae:,.0f}")

Compare this to the random forest from episode #18 (R2 ~ 0.93, no scaling needed, no tuning needed) and the gradient boosting from episode #19 (R2 ~ 0.95 with tuning). The SVR will likely be competitive but probably won't beat the tree ensembles on this kind of tabular data. And that's fine -- because SVMs aren't designed to dominate tabular data. They're designed for the scenarios listed above.

The model zoo: where we stand after 20 episodes

Let me step back and look at the full landscape we've built. Starting from nothing in episode #1, we now have five distinct families of supervised learning algorithms:

Model familyEpisodesKey strengthKey weakness
Linear models#10-12Simple, fast, interpretable, good probabilitiesCan't capture nonlinearity without engineering
Decision trees#17Interpretable, handles interactions automaticallyUnstable, overfits easily
Random forests#18Robust, minimal tuning, rarely overfitsCan't extrapolate, step-function predictions
Gradient boosting#19Highest accuracy on tabular data, rewards tuningSensitive to hyperparameters, sequential training
SVMs#20Strong in high dimensions, elegant math, sparseScales poorly to large data, no native probabilities

There is no universal winner. The right model depends on your data, your constraints, and your goals. A small genomics dataset? Try SVMs. A large tabular dataset where you need the best accuracy? Gradient boosting. Need a quick, reliable baseline? Random forest. Need interpretability above all? A single pruned decision tree or a linear model. Need calibrated probabilities? Logistic regression.

The good news is that the sklearn API (episode #16) makes switching between all of these trivial -- change the class name, keep the pipeline, compare cross-validation scores. You invested in learning the framework once, and now every new algorithm slots right in.

Looking ahead

With 20 episodes behind us, we've completed the major families of supervised learning algorithms. We've built them from scratch (linear regression in #10, logistic regression in #12, decision trees in #17, AdaBoost in #19) and we've learned to use the professional implementations via scikit-learn. We know how to prepare data (#14), engineer features (#15), evaluate models (#13), and tune hyperparameters (#16).

All of the above has been supervised learning -- we give the model labeled data (features + correct answers) and it learns to predict labels for new data. But there's a whole other world of machine learning where you DON'T have labels. You have data, but no "correct answers." And you want to discover structure: groups, patterns, anomalies, reduced representations. That's unsupervised learning, and it thinks about data in a fundamentally different way.

But first -- we're going to put everything we've learned into practice with a proper project.

Let's recap

We explored a model with a very different philosophy from the tree-based methods of episodes #17-19. Here's what we covered:

  • SVMs find the maximum margin boundary -- not just any correct boundary, but the one with the widest possible gap between classes. This margin provides mathematical generalization guarantees that are unique among ML algorithms;
  • Support vectors are the few training points closest to the boundary. They alone determine the model -- every other point is irrelevant. This sparsity is both elegant and memory-efficient;
  • The C parameter controls the tradeoff between wide margins (underfitting) and correct classification (overfitting) -- the bias-variance tradeoff again, same concept from episode #11 but with a different knob;
  • The kernel trick implicitly maps data to higher-dimensional spaces where linear separation becomes possible, without ever computing those high-dimensional features. RBF (the default) maps to infinite dimensions and can model any smooth boundary;
  • Feature scaling is mandatory for SVMs because the kernel computes distances -- use StandardScaler in a Pipeline (episodes #14 and #16);
  • SVR fits a tolerance tube around regression data, where epsilon controls the tube width and only points outside the tube (support vectors) affect the fit;
  • SVMs excel at high-dimensional, small-to-medium datasets (text, genomics) but struggle to scale to millions of samples where gradient boosting dominates;
  • We now have five distinct algorithm families in our toolkit (linear, trees, forests, boosting, SVMs) -- and the sklearn API lets you switch between them by changing one line of code.

De groeten! Tot snel ;-)

@scipio



0
0
0.000
0 comments