Learn AI Series (#24) - Dimensionality Reduction - PCA

avatar

Learn AI Series (#24) - Dimensionality Reduction - PCA

ai-banner.png

What will I learn

  • You will learn the curse of dimensionality -- why more features isn't always better, and how high-dimensional spaces break your distance-based algorithms;
  • Principal Component Analysis (PCA) from scratch in NumPy -- finding the axes of maximum variance using eigenvectors and eigenvalues;
  • the connection between PCA and the linear algebra from episode #8 -- covariance matrices, eigendecomposition, and what those eigenvectors actually represent geometrically;
  • how many components to keep using explained variance ratios and the scree plot;
  • PCA for visualization -- compressing 64 dimensions down to 2 so you can actually SEE your data;
  • PCA as preprocessing for ML pipelines -- denoising, decorrelation, and improving distance-based algorithms like K-Means and SVMs;
  • when PCA fails and why linear assumptions matter.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#24) - Dimensionality Reduction - PCA

At the end of episode #23 I hinted at a problem that sits underneath everything we've been doing with clustering. We talked about five different clustering algorithms, compared their strengths and weaknesses on various dataset shapes, and built a complete unsupervised learning toolkit. And then I mentioned this: what if your data lives in 50 dimensions, and most of those dimensions are noise? Distances become meaningless. Clustering algorithms that depend on distance start to choke. The signal drowns in the noise of irrelevant features.

Today we tackle that problem head-on with Principal Component Analysis -- PCA. It's one of the oldest and most widely-used techniques in all of machine learning, and it does something remarkably powerful: it finds the directions in which your data actually varies, keeps those, and throws away the rest. You end up with fewer features that carry more information per feature. Less noise. Cleaner distance calculations. Better clustering, better classification, and -- as a bonus -- the ability to visualize high-dimensional data in 2D or 3D so you can actually SEE what's going on.

The math connects directly to the linear algebra we covered in episode #8. If you found eigenvectors and eigenvalues abstract back then, today they become concrete and practical. PCA IS eigendecomposition, applied to the covariance matrix of your data. That sentence will make perfect sense by the end of this episode.

Let's dive right in.

The curse of dimensionality

Imagine you have a dataset with 500 features and 1,000 samples. That's 500 columns, each one a dimension in feature space. Now try to picture 500-dimensional space. You can't -- nobody can. Human intuition maxes out at three dimensions (and honestly, we struggle with three). But the problem goes beyond visualization. High-dimensional spaces have genuinely strange geometric properties that break the assumptions of many algorithms.

Here's the key insight: as the number of dimensions grows, data becomes increasingly sparse. Think about it concretely. In 1D, 10 evenly-spaced points cover a line segment pretty densely. In 2D, you'd need 10 x 10 = 100 points to cover a square at the same density. In 3D, 1,000. In 500D? You'd need 10^500 points -- more than the number of atoms in the universe, many times over -- to achieve the same coverage. With only 1,000 samples in 500 dimensions, your data is ridiculously sparse. Every point is essentially alone, surrounded by vast empty space in every direction.

This sparsity destroys distance calculations. When everything is far from everything else, the difference between "near" and "far" becomes negligible. Let me show you:

import numpy as np

np.random.seed(42)
n_samples = 500

# As dimensions increase, max and min pairwise distances converge
print(f"{'Dimensions':>12s}  {'Min dist':>10s}  {'Max dist':>10s}  {'Ratio':>8s}")
print("-" * 46)

for n_dims in [2, 5, 10, 50, 100, 500]:
    X = np.random.randn(n_samples, n_dims)
    # Pairwise distances (sample 1000 pairs for speed)
    idx_a = np.random.choice(n_samples, 1000)
    idx_b = np.random.choice(n_samples, 1000)
    dists = np.linalg.norm(X[idx_a] - X[idx_b], axis=1)
    min_d, max_d = dists.min(), dists.max()
    ratio = max_d / min_d
    print(f"{n_dims:>12d}  {min_d:>10.2f}  {max_d:>10.2f}  {ratio:>8.2f}")

Watch that ratio column. In 2D the max distance might be 10x the min distance -- a meaningful spread. By 500 dimensions the ratio drops close to 1. Everything is approximately the same distance from everything else. K-Means (episode #22) assigns points to the nearest centroid -- but if all centroids are roughly equidistant, the assignment is basically random. DBSCAN (episode #23) defines clusters by density neighborhoods -- but in high dimensions, no point has meaningful neighbors. SVMs (episode #20) find maximum margins -- but margins become meaningless when the distance structure collapses.

This is the curse of dimensionality, and it's not just a theoretical concern. Real-world datasets frequently have hundreds or thousands of features. Genomic data might have 20,000 gene expression levels per sample. Text data with bag-of-words encoding can have vocabularies of 50,000+ words. Image pixels give you width x height x channels features. If most of those features are noise or redundant, you NEED to reduce dimensions before doing anything useful with the data.

The intuition: finding the best camera angle

PCA's core idea is actually very visual. Imagine a cloud of 3D data points shaped like a thin pancake -- wide in two directions, thin in the third. If I asked you to take a photograph that captures as much of the pancake's shape as possible, you'd naturally look at it from above -- straight down the thin axis. From that angle, the 2D photo captures almost all the interesting variation. Looking from the side would show you a thin line -- you'd lose most of the structure.

PCA does exactly this, automatically, in any number of dimensions. It finds the directions (called principal components) in which the data varies the most, then projects the data onto those directions. The first principal component points in the direction of maximum variance -- the "widest" spread of the data. The second component is the direction of maximum variance perpendicular to the first. The third is perpendicular to both. And so on. Each successive component captures less and less variance, until you reach directions where the data barely varies at all -- the "thin" dimensions of the pancake.

By keeping only the top K components (the directions with the most variance) and throwing away the rest, you compress your data from D dimensions to K dimensions while preserving the most important structure. The discarded dimensions were mostly noise anyway -- they didn't carry much information.

PCA from scratch: eigenvectors of the covariance matrix

The math behind PCA connects directly to the linear algebra we built up in episode #8. If you recall, we talked about matrices as transformations, eigenvectors as special directions that a matrix doesn't rotate (only stretches), and eigenvalues as the stretching factors. Today those concepts become practical tools.

Here's the algorithm, step by step:

import numpy as np

def pca_from_scratch(X, n_components):
    """PCA via eigendecomposition of the covariance matrix."""
    # Step 1: Center the data (subtract column means)
    means = X.mean(axis=0)
    X_centered = X - means

    # Step 2: Compute the covariance matrix
    # Each entry C[i,j] = how features i and j co-vary
    cov_matrix = np.cov(X_centered, rowvar=False)

    # Step 3: Eigendecomposition
    # eigenvectors = directions of variance
    # eigenvalues = amount of variance in each direction
    eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

    # Step 4: Sort by eigenvalue (largest first)
    sorted_idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[sorted_idx]
    eigenvectors = eigenvectors[:, sorted_idx]

    # Step 5: Project data onto the top n_components eigenvectors
    components = eigenvectors[:, :n_components]
    X_projected = X_centered @ components

    # Explained variance: what fraction of total variance each component captures
    explained_variance = eigenvalues[:n_components] / eigenvalues.sum()

    return X_projected, explained_variance, components

Let me walk through each step because the details matter.

Step 1 -- Centering. We subtract the mean of each feature so the data cloud is centered at the origin. This is essential because PCA finds directions of maximum variance FROM the mean. If the data isn't centered, the first component will partly capture the mean offset in stead of the spread, which wastes a component on something trivial. This is conceptually the same centering step we've been applying since episode #14 when we talked about data preparation.

Step 2 -- The covariance matrix. This is the heart of PCA. The covariance matrix C is a D x D matrix (where D is the number of features) where each entry C[i, j] measures how much features i and j vary together. The diagonal entries C[i, i] are the variances of individual features. The off-diagonal entries are the covariances -- positive means they increase together, negative means one goes up when the other goes down, zero means they're independent. This matrix completely describes the "shape" of the data cloud (assuming the data is roughly Gaussian, which is a limitation we'll discuss later).

Step 3 -- Eigendecomposition. Here's where episode #8 pays off. The eigenvectors of the covariance matrix are the principal components -- the directions of maximum variance. The eigenvalues are the amounts of variance along each direction. A large eigenvalue means the data spreads a lot in that direction (important!). A small eigenvalue means the data barely varies there (probably noise). We use np.linalg.eigh because the covariance matrix is guaranteed to be symmetric, and eigh exploits that for faster, more numerically stable computation.

Step 4 -- Sort. eigh returns eigenvalues in ascending order, but we want the largest first (the most informative directions first). So we reverse-sort.

Step 5 -- Project. The matrix multiplication X_centered @ components projects each data point onto the new coordinate system defined by the top eigenvectors. Each original D-dimensional point becomes an n_components-dimensional point. The projection preserves as much variance as possible given the reduced dimensionality.

Let's test it on synthetic data where we KNOW the answer:

np.random.seed(42)

# Create 5D data where most variance is in 2 directions
base = np.random.randn(200, 2)
X = np.column_stack([
    base[:, 0] * 3,                                   # high variance
    base[:, 1] * 2,                                   # medium variance
    base[:, 0] * 0.5 + np.random.randn(200) * 0.1,   # correlated + noise
    np.random.randn(200) * 0.1,                        # pure noise
    np.random.randn(200) * 0.1,                        # pure noise
])

print(f"Original shape: {X.shape}")
print(f"Feature variances: {X.var(axis=0).round(2)}")

X_pca, var_explained, components = pca_from_scratch(X, n_components=2)

print(f"\nReduced shape: {X_pca.shape}")
print(f"Explained variance per component: {var_explained.round(4)}")
print(f"Total explained: {var_explained.sum():.1%}")
print(f"\nFirst component weights:")
for i, w in enumerate(components[:, 0]):
    print(f"  Feature {i}: {w:>7.3f}")

Five dimensions compressed down to two, and those two components capture the vast majority of the variance. The three noisy dimensions -- which had variances around 0.01 compared to 9.0 and 4.0 for the signal dimensions -- are essentially discarded. PCA found the "pancake shape" in 5D space and photographed it from the best angle.

Look at the component weights too. The first principal component loads heavily on the high-variance features and ignores the noise features. PCA didn't just pick the two highest-variance original features -- it found the linear combination that maximizes variance, which can be (and often is) a mix of multiple original features. That's the power of eigendecomposition: it finds structure that no single feature alone reveals.

How many components to keep?

The explained variance ratio tells you what fraction of the total information each component carries. Plotting the cumulative sum gives you a clear picture of how much information you retain as you add components:

from sklearn.decomposition import PCA

# Fit PCA with ALL components to see the full spectrum
pca_full = PCA()
pca_full.fit(X)

cumulative = np.cumsum(pca_full.explained_variance_ratio_)

print("Component | Variance | Cumulative | Bar")
print("-" * 55)
for i in range(len(pca_full.explained_variance_ratio_)):
    var = pca_full.explained_variance_ratio_[i]
    cum = cumulative[i]
    bar = "#" * int(var * 50)
    print(f"    PC{i+1}   |  {var:>6.1%}  |   {cum:>6.1%}   | {bar}")

You'll see a scree plot pattern (named after the scree at the base of a cliff). The first couple of components carry most of the variance, then there's a sharp drop, and the remaining components contribute almost nothing. That sharp drop is your signal-to-noise boundary. Components before the drop are capturing real structure. Components after it are fitting noise.

Common decision rules:

  • 95% rule: keep enough components to explain 95% of the total variance. This is the safe default for most applications. Conservative, rarely loses important information.
  • Elbow / knee: look for the sharp drop in individual component variance (same idea as the elbow method for choosing K in K-Means from episode #22). Components before the drop are signal; after it, noise.
  • Task-based: if you're using PCA as preprocessing before a classifier, try different numbers of components and see which gives the best cross-validation score. The downstream task IS the evaluator. Remember from episode #13 how we said evaluation should always be tied to the actual objective? Same principle here.

For visualization, you always reduce to 2 or 3 components (because that's what a screen or paper can display). For compression and preprocessing, 95% is the standard starting point.

PCA for visualization: see your data

This is where PCA becomes immediately, viscerally useful. You've got a dataset with 64 features and you want to understand its structure. You can't plot 64 dimensions. But you CAN project to 2D and look:

from sklearn.datasets import load_digits

digits = load_digits()
X_digits, y_digits = digits.data, digits.target

print(f"Digits dataset: {X_digits.shape[0]} samples, "
      f"{X_digits.shape[1]} features")
print(f"Each sample is an 8x8 pixel image flattened to 64 numbers")

# Project 64D -> 2D
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_digits)

print(f"After PCA: {X_2d.shape[0]} samples, {X_2d.shape[1]} features")
print(f"Variance captured: "
      f"{pca_2d.explained_variance_ratio_.sum():.1%}")

# Where do different digits land in 2D space?
print(f"\n{'Digit':>6s}  {'PC1 center':>11s}  {'PC2 center':>11s}  {'Count':>6s}")
print("-" * 40)
for digit in range(10):
    mask = y_digits == digit
    center = X_2d[mask].mean(axis=0)
    print(f"{digit:>6d}  {center[0]:>11.1f}  {center[1]:>11.1f}  "
          f"{mask.sum():>6d}")

Even with just two components out of 64, different digit classes form distinguishable clusters. Digits that look visually similar (like 3, 5, and 8 which share curved strokes) cluster closer together in PCA space than digits that look very different (like 0 and 1). PCA preserved the high-level structure of handwritten digit similarity while dropping 62 dimensions of detail. It's not perfect -- two components only capture about 28% of the total variance for this dataset -- but it's enough to reveal meaningful structure that was invisible in the original 64D space.

This pairing of PCA with clustering is extremely powerful. Remember how I said in episode #22 that K-Means struggles in high dimensions because distances become meaningless? Run PCA first, reduce to maybe 10-15 components that capture 90%+ of the variance, THEN cluster. The clustering runs faster and often finds cleaner groups because the noisy dimensions that were polluting the distance calculation are gone.

PCA as preprocessing for ML

Beyond visualization, PCA serves as a preprocessing step that can improve downstream ML models. It does three things simultaneously: reduces noise (by dropping low-variance components that are mostly noise), reduces computation (fewer features means faster training), and decorrelates features (by construction, principal components are orthogonal -- zero correlation between them).

That decorrelation property is particularly valuable. Many algorithms assume or benefit from uncorrelated inputs. Linear regression (episode #10) has problems with multicollinearity -- highly correlated features make coefficient estimates unstable. K-Means (episode #22) and SVMs (episode #20) use distance calculations that can be dominated by correlated feature groups that effectively "vote" multiple times. PCA eliminates these issues by rotating the data into a coordinate system where features are guaranteed to be uncorrelated.

Let's measure the impact:

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Without PCA: full 64 features
scores_full_rf = cross_val_score(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X_digits, y_digits, cv=5
)

# Without PCA: SVM on full features (scaled)
pipe_svm_full = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', random_state=42))
])
scores_full_svm = cross_val_score(pipe_svm_full, X_digits, y_digits, cv=5)

# With PCA retaining 95% variance: Random Forest
pipe_pca_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
])
scores_pca_rf = cross_val_score(pipe_pca_rf, X_digits, y_digits, cv=5)

# With PCA: SVM
pipe_pca_svm = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),
    ('svm', SVC(kernel='rbf', random_state=42))
])
scores_pca_svm = cross_val_score(pipe_pca_svm, X_digits, y_digits, cv=5)

n_comp = PCA(n_components=0.95).fit(
    StandardScaler().fit_transform(X_digits)
).n_components_

print(f"PCA retains {n_comp} components (of 64) for 95% variance\n")
print(f"{'Model':<20s}  {'Full (64)':>12s}  {'PCA ({0})':>12s}".format(n_comp))
print("-" * 48)
print(f"{'Random Forest':<20s}  {scores_full_rf.mean():.3f} +/- "
      f"{scores_full_rf.std():.3f}  {scores_pca_rf.mean():.3f} +/- "
      f"{scores_pca_rf.std():.3f}")
print(f"{'SVM (RBF)':<20s}  {scores_full_svm.mean():.3f} +/- "
      f"{scores_full_svm.std():.3f}  {scores_pca_svm.mean():.3f} +/- "
      f"{scores_pca_svm.std():.3f}")

The results are instructive. For Random Forest, PCA might actually hurt slightly -- trees handle irrelevant features naturally by simply not splitting on them (remember from episode #17 how decision trees pick the best feature at each split? If a feature is useless, it never gets picked). So removing features via PCA doesn't help trees much, and the information loss can slightly reduce accuracy.

For SVM though, PCA can help -- especially on high-dimensional data where the RBF kernel's distance computation is polluted by noisy features. Fewer, cleaner dimensions means the kernel measures more meaningful similarity. This is the same reasoning behind why we always scale features for SVMs (episode #20) -- distance-based methods are sensitive to what's in the feature space.

The general pattern: tree-based models (decision trees, random forests, gradient boosting) don't need PCA because they have built-in feature selection. Distance-based models (K-Means, SVMs, KNN, logistic regression) benefit from PCA because it removes noise from the distance calculation. Use PCA selectively, not as a blanket preprocessing step.

Having said that, note the StandardScaler before PCA in the pipeline. This is critical. PCA finds directions of maximum variance, so features with large ranges (like "annual income" in the thousands) will dominate features with small ranges (like "age" from 18 to 80). If you don't standardize first, PCA effectively just picks the features with the biggest numbers, which is almost never what you want. Always scale before PCA. Always.

PCA for denoising

Here's a subtle but powerful application that builds on everything we've just learned. If the signal in your data is concentrated in the top principal components and the noise is spread across ALL components, then you can denoise by projecting to fewer components and reconstructing back to the original space:

# Add noise to digit images
np.random.seed(42)
noise_level = 4
X_noisy = X_digits + np.random.randn(*X_digits.shape) * noise_level

# Denoise: project to K components, then reconstruct
for n_comp in [10, 20, 30, 50]:
    pca_denoise = PCA(n_components=n_comp)
    X_denoised = pca_denoise.inverse_transform(
        pca_denoise.fit_transform(X_noisy)
    )
    mse_noisy = np.mean((X_noisy - X_digits) ** 2)
    mse_denoised = np.mean((X_denoised - X_digits) ** 2)
    improvement = (1 - mse_denoised / mse_noisy) * 100

    print(f"Components: {n_comp:>3d}  |  "
          f"Noisy MSE: {mse_noisy:.1f}  |  "
          f"Denoised MSE: {mse_denoised:.1f}  |  "
          f"Improvement: {improvement:.0f}%")

The trick is inverse_transform -- it takes the reduced-dimension data and projects it back to the original 64-dimensional space. But because we threw away the low-variance components (which were mostly noise), the reconstruction is a cleaned-up version of the original. The signal that lived in the top components survives the round trip. The noise that was spread thinly across all 64 components gets wiped out in the components we dropped.

There's a sweet spot for n_comp. Too few components and you lose signal along with the noise. Too many and you keep the noise. You're looking for the transition point in the scree plot -- the same signal-noise boundary we talked about earlier. This is essentially a form of regularization (remember how we discussed adding constraints to prevent overfitting in episodes #11 and #19?). By limiting the number of components, you're constraining the reconstruction to use only the most informative dimensions, which acts as a smoothing filter.

This denoising principle shows up again later in the series when we get to autoencoders -- neural networks that learn to compress data to a small representation and reconstruct it. An autoencoder with a linear activation function and mean squared error loss actually learns the SAME subspace as PCA. The neural network version is just PCA in disguise. The interesting part comes when you make the autoencoder nonlinear, which lets it capture structure that PCA's linear projections miss.

Scaling matters: a practical gotcha

I mentioned this above, but it's worth hammering on because I've seen it bite people repeatedly. PCA maximizes variance. Features measured in large units (dollars, meters, milliseconds) have larger variance just because of their SCALE, not because they carry more information. Without standardization, PCA will pick up on those scale differences and give you components that are really just "which feature has the biggest numbers."

from sklearn.preprocessing import StandardScaler

# Simulate features with wildly different scales
np.random.seed(42)
n = 300

# Feature A: income (large numbers, actually important)
# Feature B: age (small numbers, actually important)
# Feature C: noise (large numbers, NOT important)
income = np.concatenate([
    np.random.normal(30000, 5000, 150),
    np.random.normal(80000, 10000, 150),
])
age = np.concatenate([
    np.random.normal(25, 3, 150),
    np.random.normal(55, 5, 150),
])
noise = np.random.normal(50000, 20000, 300)  # big numbers, pure noise

X_mixed = np.column_stack([income, age, noise])
labels_true = np.array([0] * 150 + [1] * 150)

# PCA WITHOUT scaling
pca_raw = PCA(n_components=1)
X_raw_1d = pca_raw.fit_transform(X_mixed)

# PCA WITH scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_mixed)
pca_scaled = PCA(n_components=1)
X_scaled_1d = pca_scaled.fit_transform(X_scaled)

print("Without scaling:")
print(f"  Component weights: {pca_raw.components_[0].round(3)}")
print(f"  (dominated by noise feature with biggest numbers)")

print("\nWith scaling:")
print(f"  Component weights: {pca_scaled.components_[0].round(3)}")
print(f"  (loads on the features with actual structure)")

Without scaling, the first component is dominated by the noise feature because it has the biggest numbers (variance around 400,000,000 vs 25 for age). With scaling, PCA correctly identifies that income and age carry the real structure. This is the same lesson we've learned with K-Means (episode #22) and SVMs (episode #20) -- any algorithm that depends on magnitude needs standardized inputs. PCA is no exception.

What PCA cannot do

PCA is powerful, but its limitations are real and worth understanding clearly.

PCA is linear. It finds linear combinations of features -- straight axes in the original space. If the meaningful structure in your data curves or twists through feature space (like the Swiss roll -- a 2D sheet rolled up into a 3D spiral), PCA will fail to unroll it. It'll project the spiral onto a flat plane, and the points that were far apart on the manifold surface will end up on top of each other in the projection. For this kind of nonlinear structure, you need methods like t-SNE and UMAP that can "unfold" curved manifolds. We'll get into those soon.

PCA maximizes variance, not class separation. The direction of maximum variance isn't necessarily the direction that best separates your classes. Imagine two classes that overlap along the high-variance direction but are completely separated along a low-variance direction. PCA will keep the high-variance direction (useless for classification) and throw away the low-variance one (the one that actually matters). Linear Discriminant Analysis (LDA) is the supervised alternative that maximizes between-class separation in stead of total variance. But LDA requires labels, which PCA doesn't -- and in unsupervised settings, PCA is often all you have.

PCA components are hard to interpret. Each component is a weighted combination of ALL original features. The first component might be "0.4 x income - 0.3 x age + 0.5 x spending + 0.2 x visits - ..." That's mathematically precise but doesn't map to a simple concept. You can't say "component 1 IS income" -- it's a mix of everything. For applications where interpretability matters (like explaining to a business stakeholder why a customer is in segment A), the loss of feature meaning can be a real drawback. Sometimes keeping the original features and doing feature selection (episode #15) is better than transforming them into opaque components.

PCA assumes linear correlations. The covariance matrix captures linear relationships between features. If features are related nonlinearly (like: feature B = feature A squared), PCA won't detect that relationship, and it might keep both features as if they're independent. Kernel PCA exists to handle nonlinear relationships, but it's more expensive and harder to tune.

Putting it all together: PCA + clustering pipeline

Let's circle back to where we started -- the connection between dimensionality reduction and clustering. In episodes #22 and #23 we built clustering pipelines for customer segmentation, density-based grouping, and hierarchical analysis. Now let's add PCA to the mix and see how it improves things on higher-dimensional data:

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Use the digits dataset -- 64 dimensions is enough to
# demonstrate the curse of dimensionality
X_scaled_digits = StandardScaler().fit_transform(X_digits)

# Clustering WITHOUT PCA
km_full = KMeans(n_clusters=10, n_init=10, random_state=42)
labels_full = km_full.fit_predict(X_scaled_digits)
sil_full = silhouette_score(X_scaled_digits, labels_full)

# Clustering WITH PCA (various component counts)
print(f"{'Components':>12s}  {'Silhouette':>12s}  {'Var explained':>14s}")
print("-" * 42)
print(f"{'64 (full)':>12s}  {sil_full:>12.3f}  {'100.0%':>14s}")

for n_comp in [5, 10, 15, 20, 30]:
    pca = PCA(n_components=n_comp)
    X_reduced = pca.fit_transform(X_scaled_digits)
    var_explained = pca.explained_variance_ratio_.sum()

    km = KMeans(n_clusters=10, n_init=10, random_state=42)
    labels_pca = km.fit_predict(X_reduced)
    sil_pca = silhouette_score(X_reduced, labels_pca)

    print(f"{n_comp:>12d}  {sil_pca:>12.3f}  {var_explained:>13.1%}")

You'll likely see that clustering quality (measured by silhouette score) actually IMPROVES when you reduce from 64 to maybe 15-20 components, even though you're throwing away information. That sounds paradoxical, but it's the curse of dimensionality at work. By removing the noisy dimensions that were diluting the distance signal, PCA lets K-Means compute more meaningful distances and find tighter, better-separated clusters.

This is the same phenomenon we observed with feature scaling. Adding StandardScaler before K-Means didn't add information -- it just made the existing information easier for the algorithm to use. PCA goes a step further: it actively removes the dimensions that were making the algorithm's job harder. Less truly can be more.

Dusssssss, wat hebben we geleerd?

We covered a LOT of ground today. Here's the full picture:

  • The curse of dimensionality makes high-dimensional data sparse and distances meaningless -- this directly undermines K-Means (episode #22), DBSCAN (episode #23), SVMs (episode #20), and any distance-based algorithm. PCA is the classic countermeasure;
  • PCA finds the directions of maximum variance in your data and projects onto them -- using eigenvectors of the covariance matrix. This is eigendecomposition (episode #8) applied to real data. The eigenvectors are the principal components; the eigenvalues tell you how much variance each direction captures;
  • The explained variance ratio tells you how much information each component carries. Keep enough for 95% coverage (safe default), or use the scree plot to find the signal-noise boundary -- same visual inspection idea as the elbow method from episode #22;
  • PCA is invaluable for visualization (reduce to 2D and actually see your clusters), denoising (project and reconstruct to strip noise), preprocessing (decorrelate features, remove noise before distance-based algorithms), and compression (fewer features = faster training);
  • Always standardize before PCA -- StandardScaler first, PCA second. Features with large scales dominate variance and bias the components. Same lesson as episode #20 (SVMs) and episode #22 (K-Means);
  • PCA is linear -- it can't unfold curved manifolds like the Swiss roll. For nonlinear dimensionality reduction, we need different tools that can capture the curved geometry of data;
  • PCA maximizes variance, not class separation. For supervised problems, consider checking cross-validation scores with different component counts rather than blindly trusting the 95% rule;
  • Distance-based models (SVMs, K-Means, KNN) benefit most from PCA preprocessing. Tree-based models (random forests, gradient boosting) already handle irrelevant features and rarely need PCA.

PCA is where unsupervised learning starts getting really interesting. We can find clusters in data, and now we can compress that data to make the clusters cleaner and visible. The natural next step is to handle the cases where PCA's linearity assumption breaks down -- data that lives on curved surfaces in high-dimensional space, where the relationships between points follow the surface, not straight lines through the ambient space. There are methods that can unfold those surfaces, and they produce visualizations that are genuinely stunning in how clearly they reveal structure PCA misses ;-)

De groeten!

@scipio



0
0
0.000
0 comments