Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending

What will I learn
- You will learn stacking -- training a meta-learner on top of base model predictions to squeeze out extra accuracy;
- blending -- a simpler stacking variant that uses a holdout set in stead of cross-validation;
- voting classifiers -- hard vs soft voting and when each helps;
- the mixture of experts concept -- routing inputs to specialized models;
- why model diversity matters more than individual model accuracy;
- practical strategies for building competition-winning ensembles;
- when ensembles are worth the added complexity (and when they're absolutely not).
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending (this post)
Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
In episodes #18 and #19, we explored two fundamental ensemble strategies: bagging (random forests) and boosting (gradient boosting). Both combine multiple models, but in fundamentally different ways -- bagging trains independent models in parallel to reduce variance, while boosting trains sequential models that correct each other's mistakes. These two ideas alone dominate most tabular ML competitions and production systems.
But there's a third ensemble strategy that often delivers the final push from "very good" to "winning": stacking. In stead of averaging or voting, stacking trains a new model on top of your base models' predictions. It learns which base models to trust in which situations, how to weight their contributions, and how to correct their systematic errors. This is the technique that appears in almost every top Kaggle solution -- the secret sauce that separates competitive entries from winning ones.
And at the very end of episode #32, I mentioned that the concept of combining multiple models is conceptually similar to maintaining uncertainty over which model is correct. Today we make that idea concrete. We'll build all three advanced ensemble approaches -- voting, blending, and stacking -- from scratch and with scikit-learn, and understand exactly when each is worth the added complexity.
Here we go!
Voting: the simplest ensemble
The most straightforward way to combine models: let them vote. Train several different classifiers on the same data and combine their predictions. Two flavors:
Hard voting: each model casts a vote for its predicted class; the majority wins. If three models predict [A, B, A], the ensemble predicts A. Simple majority rule, nothing fancy.
Soft voting: each model outputs class probabilities; you average the probabilities across models and pick the class with the highest average. This is almost always better than hard voting because it uses the models' confidence levels, not just their binary decisions. A model that says "92% class A" should count more than one that says "51% class A" -- soft voting captures that difference, hard voting doesn't.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import (VotingClassifier,
RandomForestClassifier,
GradientBoostingClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Generate a moderately challenging dataset
X, y = make_classification(
n_samples=1000, n_features=20,
n_informative=10, n_redundant=5,
random_state=42
)
# Four fundamentally different classifiers
models = [
('lr', LogisticRegression(max_iter=1000)),
('rf', RandomForestClassifier(
n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(
n_estimators=100, random_state=42)),
('svm', SVC(probability=True, random_state=42))
]
# Individual performance
print("Individual model performance (5-fold CV):")
for name, model in models:
scores = cross_val_score(model, X, y, cv=5)
print(f" {name:>4s}: {scores.mean():.4f} "
f"+/- {scores.std():.4f}")
# Hard voting
hard_vote = VotingClassifier(
estimators=models, voting='hard')
scores = cross_val_score(hard_vote, X, y, cv=5)
print(f"\n Hard voting: {scores.mean():.4f} "
f"+/- {scores.std():.4f}")
# Soft voting
soft_vote = VotingClassifier(
estimators=models, voting='soft')
scores = cross_val_score(soft_vote, X, y, cv=5)
print(f" Soft voting: {scores.mean():.4f} "
f"+/- {scores.std():.4f}")
You should see the soft voting ensemble performing at least as well as the best individual model, and often slightly better. That "slightly better" might seem underwhelming, but in competitive settings where everyone has tuned their models to the max, that 0.3% improvement is the difference between 5th place and 1st.
Voting works because different models make different errors. A logistic regression might misclassify non-linear boundaries, while a random forest might overfit on noise in certain regions. When their errors don't overlap -- when they're uncorrelated -- the ensemble corrects individual mistakes. The key insight: diversity of error matters more than individual accuracy.
Why diversity matters: the math behind it
Let me make the diversity argument rigorous, because this is the single most important idea in all of ensemble learning. Consider an extreme case: five identical logistic regression models trained on the same data. Voting gives you exactly the same predictions as a single model -- zero benefit. The models agree on everything, including their mistakes.
Now consider five models that are each 70% accurate but make mistakes on completely different samples. The majority vote is correct whenever at least three of five models are right. If their errors are truly independent, the probability of three or more independent 70%-accurate models all being right is about 84%. That's a massive jump from 70%.
from scipy.stats import binom
# Probability that majority of n models with accuracy p
# are correct (assuming independent errors)
def majority_vote_accuracy(n_models, individual_accuracy):
"""Compute ensemble accuracy under independent errors."""
majority = n_models // 2 + 1
# P(at least majority correct)
return 1 - binom.cdf(majority - 1, n_models, individual_accuracy)
accuracies = [0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90]
ensemble_sizes = [3, 5, 7, 11, 21]
print("Majority vote accuracy (independent errors):\n")
header = f"{'p':>6s}" + "".join(
f" n={n:>2d}" for n in ensemble_sizes)
print(header)
print("-" * len(header))
for p in accuracies:
row = f"{p:>6.2f}"
for n in ensemble_sizes:
ens_acc = majority_vote_accuracy(n, p)
row += f" {ens_acc:.3f}"
print(row)
print("\nNotice: if individual accuracy > 50%, "
"adding models ALWAYS helps")
print("But the gains diminish -- going from 3 to 5 "
"helps more than 11 to 21")
Three things jump out of this table. First, if individual models are better than a coin flip (>50% accuracy), ensemble accuracy improves with more models. Always. Second, the gains are largest when individual accuracy is moderate -- a 70% accurate model benefits enormously from ensembling, while a 95% accurate model barely improves. Third, there are diminishing returns: the jump from 1 to 3 models is bigger than from 3 to 5, which is bigger than from 5 to 7. In practice, 3-7 diverse base models captures most of the benefit.
(Having said that, these numbers assume perfectly independent errors, which is never true in practice. Real models trained on the same data share some error patterns. The practical gain is always less than the theoretical maximum -- but diversity still helps, which is why mixing fundamentally different algorithms is critical.)
This is why the best ensembles combine fundamentally different algorithms. Linear models, tree-based models, SVMs, and (later) neural networks all have different inductive biases -- they make different assumptions about data structure. A tree-based model sees axis-aligned decision boundaries. An SVM sees margin-maximizing hyperplanes. A linear model sees, well, linear relationships. Their errors are naturally uncorrelated because they approach the problem from completly different angles.
The practical corollary: adding a fifth gradient boosting variant to an ensemble of four gradient boosting variants helps far less than adding a logistic regression or SVM, even if that logistic regression performs worse individually. Diversity trumps individual performance. Every single time.
from sklearn.ensemble import (ExtraTreesClassifier,
AdaBoostClassifier)
from sklearn.neighbors import KNeighborsClassifier
# Demonstrate: diverse models > strong homogeneous models
# Ensemble A: four tree-based models (low diversity)
trees_only = VotingClassifier(
estimators=[
('rf1', RandomForestClassifier(
n_estimators=100, random_state=42)),
('rf2', RandomForestClassifier(
n_estimators=200, max_depth=10, random_state=7)),
('gb', GradientBoostingClassifier(
n_estimators=100, random_state=42)),
('et', ExtraTreesClassifier(
n_estimators=100, random_state=42)),
],
voting='soft'
)
# Ensemble B: four fundamentally different algorithms (high diversity)
diverse_mix = VotingClassifier(
estimators=[
('lr', LogisticRegression(max_iter=1000)),
('rf', RandomForestClassifier(
n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(
n_estimators=100, random_state=42)),
('knn', KNeighborsClassifier(n_neighbors=7)),
],
voting='soft'
)
# Compare
for name, ens in [("Tree-only ensemble", trees_only),
("Diverse ensemble", diverse_mix)]:
scores = cross_val_score(ens, X, y, cv=5)
print(f"{name:>22s}: {scores.mean():.4f} "
f"+/- {scores.std():.4f}")
Blending: stacking's simpler cousin
Now we get to something more interesting than simple voting. What if, in stead of giving each model an equal voice (or even a fixed weight), we trained a new model to learn the optimal combination? This is the central idea behind both blending and stacking.
Blending is the simpler version. The process:
- Split your training data into a training portion and a blending holdout
- Train base models on the training portion only
- Generate predictions on the blending holdout -- these predictions become features for the meta-model
- Train a meta-model (typically logistic regression) on those prediction features
- At inference time: base models predict -> meta-model combines their predictions
The meta-model learns which base models to trust and how much. If gradient boosting consistently outperforms logistic regression on this dataset, the meta-model will assign it a higher weight. But it can also learn more subtle patterns: "trust the SVM when the base predictions are confident, but trust the random forest when predictions are uncertain."
from sklearn.model_selection import train_test_split
# Split: 70% for training base models, 30% for blending
X_train, X_blend, y_train, y_blend = train_test_split(
X, y, test_size=0.3, random_state=42
)
base_models = [
('lr', LogisticRegression(max_iter=1000)),
('rf', RandomForestClassifier(
n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(
n_estimators=100, random_state=42)),
('svm', SVC(probability=True, random_state=42)),
]
# Step 1: Train base models on training portion
print(f"Training set: {len(X_train)} samples")
print(f"Blend set: {len(X_blend)} samples\n")
trained_base = []
for name, model in base_models:
model.fit(X_train, y_train)
trained_base.append((name, model))
acc = model.score(X_blend, y_blend)
print(f" {name:>4s} accuracy on blend set: {acc:.4f}")
# Step 2: Generate blend features (probability predictions)
blend_features = np.zeros((len(X_blend), len(base_models)))
for i, (name, model) in enumerate(trained_base):
blend_features[:, i] = model.predict_proba(X_blend)[:, 1]
print(f"\nBlend features shape: {blend_features.shape}")
print(f" (each column = one base model's probability output)")
# Step 3: Train meta-model on blend features
meta_model = LogisticRegression()
meta_model.fit(blend_features, y_blend)
print(f"\nMeta-model learned weights:")
for i, (name, _) in enumerate(base_models):
print(f" {name:>4s}: coefficient = "
f"{meta_model.coef_[0][i]:>+.3f}")
print(f" Intercept: {meta_model.intercept_[0]:>+.3f}")
print(f"\nBlend accuracy: "
f"{meta_model.score(blend_features, y_blend):.4f}")
The meta-model's coefficients tell you which base models it trusts most. If the gradient boosting coefficient is 1.8 while logistic regression's is 0.3, the meta-model has learned that gradient boosting's predictions are more informative for this particular dataset. This is data-driven weighting, not guesswork -- and it automatically adapts to whatever dataset you throw at it.
The blending tradeoff
The downside of blending is real: you sacrifice part of your training data for the blending holdout. With 1,000 samples, losing 300 to blending might hurt base model performance. With 100 samples, losing 30 would be devastating. The base models see less data, so they learn less, and the meta-model is trained on a relatively small holdout. This is the classic bias-variance tradeoff from episode #13 showing up again in a new disguise.
On the other hand, blending is dead simple to implement, fast to run, and easy to debug. No nested cross-validation, no complex out-of-fold bookkeeping. For prototyping and quick experiments, it's often the pragmatic first choice.
Stacking: the full treatment
Stacking improves on blending by using cross-validation in stead of a holdout set, so every training sample gets used both for training base models and for generating meta-features. No data wasted. The process:
- For each base model, run K-fold cross-validation on the training data
- In each fold, the model is trained on K-1 folds and predicts on the held-out fold
- After all K folds, every training sample has an out-of-fold prediction from each base model
- These out-of-fold predictions become features for the meta-model
- Retrain each base model on the full training data for final inference
This is more complex than blending, but it uses all your data efficiently. Every sample contributes to training AND to generating meta-features, just never at the same time (which prevents leakage).
from sklearn.model_selection import cross_val_predict
# Build out-of-fold predictions manually
# (to show exactly what StackingClassifier does internally)
from sklearn.model_selection import StratifiedKFold
n_folds = 5
kf = StratifiedKFold(n_splits=n_folds, shuffle=True,
random_state=42)
base_classifiers = [
('lr', LogisticRegression(max_iter=1000)),
('rf', RandomForestClassifier(
n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(
n_estimators=100, random_state=42)),
('svm', SVC(probability=True, random_state=42)),
]
# Out-of-fold predictions for each base model
oof_predictions = np.zeros((len(X), len(base_classifiers)))
for model_idx, (name, clf) in enumerate(base_classifiers):
print(f"Generating OOF predictions for {name}...")
for fold, (train_idx, val_idx) in enumerate(
kf.split(X, y)):
# Clone the classifier for each fold
from sklearn.base import clone
fold_clf = clone(clf)
fold_clf.fit(X[train_idx], y[train_idx])
# Predict probabilities on the held-out fold
oof_predictions[val_idx, model_idx] = (
fold_clf.predict_proba(X[val_idx])[:, 1]
)
print(f" OOF accuracy: "
f"{((oof_predictions[:, model_idx] > 0.5) == y).mean():.4f}")
# Train meta-model on out-of-fold predictions
meta = LogisticRegression()
meta.fit(oof_predictions, y)
meta_preds = (meta.predict_proba(oof_predictions)[:, 1] > 0.5
).astype(int)
print(f"\nStacking (manual) accuracy: "
f"{(meta_preds == y).mean():.4f}")
# Compare: scikit-learn does this in one line
from sklearn.ensemble import StackingClassifier
stacking = StackingClassifier(
estimators=[
('lr', LogisticRegression(max_iter=1000)),
('rf', RandomForestClassifier(
n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(
n_estimators=100, random_state=42)),
('svm', SVC(probability=True, random_state=42)),
],
final_estimator=LogisticRegression(),
cv=5
)
scores = cross_val_score(stacking, X, y, cv=5)
print(f"Stacking (sklearn): {scores.mean():.4f} "
f"+/- {scores.std():.4f}")
Scikit-learn's StackingClassifier handles the entire out-of-fold prediction process internally -- it runs cross-validation for each base model, collects the out-of-fold predictions as features, and trains the meta-model on them. The cv=5 parameter controls the number of folds used for generating meta-features. Under the hood, it's doing exactly what we just built by hand.
A common question: what should the meta-model be? Usually something simple -- logistic regression or a small ridge regression. The reasoning connects directly to the bias-variance tradeoff from episode #13: your base models already capture the complex patterns in the data. The meta-model's job is just to learn the optimal weighting and correction of those patterns. A complex meta-model risks overfitting to the out-of-fold predictions, especially if you have few base models (which means few meta-features). With 4 base models, you have 4 meta-features -- logistic regression handles that perfectly. Using a gradient boosting meta-model on 4 features would be like bringing a machine gun to a thumb wrestling match ;-)
Stacking best practices
Let me distill the patterns that actually matter in practice. I've seen quite some people get stacking wrong, and the mistakes are usually the same:
Use probability outputs, not class predictions. When generating meta-features, use predict_proba in stead of predict. Probabilities carry far more information -- a model that predicts 0.51 positive is far less confident than one that predicts 0.99 positive, but predict treats both as "positive" and throws away that nuance. The meta-model needs that confidence signal to do its job well.
# Demonstrate: probabilities vs class predictions as meta-features
from sklearn.ensemble import StackingClassifier
# Stacking with probability meta-features (default)
stack_proba = StackingClassifier(
estimators=[
('lr', LogisticRegression(max_iter=1000)),
('rf', RandomForestClassifier(
n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(
n_estimators=100, random_state=42)),
],
final_estimator=LogisticRegression(),
cv=5,
stack_method='predict_proba' # the default
)
# Stacking with class prediction meta-features
stack_class = StackingClassifier(
estimators=[
('lr', LogisticRegression(max_iter=1000)),
('rf', RandomForestClassifier(
n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(
n_estimators=100, random_state=42)),
],
final_estimator=LogisticRegression(),
cv=5,
stack_method='predict' # class labels only
)
scores_proba = cross_val_score(stack_proba, X, y, cv=5)
scores_class = cross_val_score(stack_class, X, y, cv=5)
print(f"Meta-features from probabilities: "
f"{scores_proba.mean():.4f}")
print(f"Meta-features from class labels: "
f"{scores_class.mean():.4f}")
print(f" --> Probabilities almost always win "
f"because they preserve confidence info")
Include original features in the meta-model. Sometimes passing both the base model predictions AND the original features to the meta-model helps. This lets the meta-model learn context-dependent patterns like "trust the SVM when feature X is high, but trust the random forest when feature X is low." Scikit-learn supports this via passthrough=True:
# With passthrough: meta-model sees base predictions + original features
stack_passthrough = StackingClassifier(
estimators=[
('lr', LogisticRegression(max_iter=1000)),
('rf', RandomForestClassifier(
n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(
n_estimators=100, random_state=42)),
],
final_estimator=LogisticRegression(max_iter=1000),
cv=5,
passthrough=True # include original X in meta-features
)
scores_pass = cross_val_score(stack_passthrough, X, y, cv=5)
print(f"Without passthrough: {scores_proba.mean():.4f}")
print(f"With passthrough: {scores_pass.mean():.4f}")
print(f"Meta-feature count without: "
f"{len(stack_proba.estimators) * 2}")
print(f"Meta-feature count with: "
f"{len(stack_passthrough.estimators) * 2 + X.shape[1]}")
Watch for leakage. The out-of-fold prediction scheme prevents the most obvious form of leakage (the meta-model seeing predictions on samples the base model was trained on). But if your base models share preprocessing that was fitted on the full dataset -- like scaling or PCA from episodes #14 and #24 -- you've already leaked information. Each fold's preprocessing must be fitted only on that fold's training portion. Using scikit-learn Pipeline objects as base estimators handles this correctly:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# CORRECT: scaler is fitted per fold inside the pipeline
safe_estimator = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(probability=True, random_state=42))
])
# WRONG: fitting scaler on all data then using in stacking
# scaler = StandardScaler().fit(X) # <-- leakage!
# X_scaled = scaler.transform(X) # <-- all models see all data
# DON'T do this ^
stack_safe = StackingClassifier(
estimators=[
('lr', Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(max_iter=1000))
])),
('rf', RandomForestClassifier(
n_estimators=100, random_state=42)),
('svm', safe_estimator),
],
final_estimator=LogisticRegression(),
cv=5
)
scores_safe = cross_val_score(stack_safe, X, y, cv=5)
print(f"Stacking with leak-free pipelines: "
f"{scores_safe.mean():.4f}")
Don't stack too deep. You can stack stacks -- train meta-models on meta-model outputs, creating a layer cake of models. In theory, each layer can correct the previous layer's remaining errors. In practice, one level of stacking captures most of the benefit. Two levels occasionally help on very large datasets. Three or more almost never justify the complexity and overfitting risk. The deeper you stack, the more opportunities for the meta-models to memorize quirks of the training data rather than learning genuine patterns.
A complete stacking example: classification with real structure
Let's put everything together in a complete, realistic example. We'll create a dataset where different models genuinely struggle with different regions, making stacking worthwhile:
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomForestClassifier,
GradientBoostingClassifier,
ExtraTreesClassifier)
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
# Challenging dataset: many features, some informative, some noise
X_hard, y_hard = make_classification(
n_samples=2000, n_features=30,
n_informative=15, n_redundant=5,
n_clusters_per_class=3, # multi-modal classes
flip_y=0.05, # 5% label noise
random_state=42
)
# Compare: individual models vs voting vs stacking
print("=== Individual Models ===")
individuals = [
('LogReg', LogisticRegression(max_iter=1000)),
('RF', RandomForestClassifier(
n_estimators=200, random_state=42)),
('GBM', GradientBoostingClassifier(
n_estimators=200, random_state=42)),
('SVM', SVC(probability=True, random_state=42)),
('KNN', KNeighborsClassifier(n_neighbors=7)),
('MLP', MLPClassifier(
hidden_layer_sizes=(64, 32),
max_iter=500, random_state=42)),
]
for name, clf in individuals:
scores = cross_val_score(clf, X_hard, y_hard, cv=5)
print(f" {name:>7s}: {scores.mean():.4f} "
f"+/- {scores.std():.4f}")
# Voting ensemble
print("\n=== Voting ===")
voting = VotingClassifier(
estimators=individuals, voting='soft')
scores = cross_val_score(voting, X_hard, y_hard, cv=5)
print(f" Soft vote: {scores.mean():.4f} "
f"+/- {scores.std():.4f}")
# Stacking ensemble
print("\n=== Stacking ===")
stacker = StackingClassifier(
estimators=individuals,
final_estimator=LogisticRegression(max_iter=1000),
cv=5
)
scores = cross_val_score(stacker, X_hard, y_hard, cv=5)
print(f" Stacking: {scores.mean():.4f} "
f"+/- {scores.std():.4f}")
On a well-constructed challenging dataset like this one, you should see the stacking ensemble at the top -- not by a huge margin (we're talking fractions of a percent on this synthetic data), but consistently. On real-world competition datasets with more complex structure, the improvement from stacking can be substantially larger. The 2015 Netflix Prize-winning solution was a stacked ensemble. Most Kaggle competition winners use stacking. The pattern is consistent: when you need every last bit of accuracy, stacking delivers.
Mixture of experts: conditional ensembles
Everything we've built so far treats every input the same way -- all base models predict on every input, and their predictions are combined uniformly. But what if different inputs need different experts? What if urban apartments follow different patterns than rural farmhouses, and you could route each input to the specialist that handles it best?
This is the mixture of experts idea. In stead of combining all models for every input, you train a gating network (or routing mechanism) that looks at each input and decides which expert model should handle it. The urban expert specializes in factors like floor level, building age, and proximity to transit. The rural expert specializes in land area, crop potential, and water access.
from sklearn.cluster import KMeans
# Create a dataset with natural subgroups
# (mixing linear and non-linear regions)
np.random.seed(42)
n = 1500
# Group 1: linearly separable (LR is best)
X1 = np.random.randn(n // 3, 20)
y1 = (X1[:, 0] + X1[:, 1] > 0).astype(int)
# Group 2: non-linear boundary (trees are best)
X2 = np.random.randn(n // 3, 20)
y2 = ((X2[:, 0] ** 2 + X2[:, 1] ** 2) > 1.5).astype(int)
# Group 3: complex interactions (KNN might win)
X3 = np.random.randn(n // 3, 20) * 2
y3 = ((np.sin(X3[:, 0]) + X3[:, 1]) > 0).astype(int)
# Shift the groups so they occupy different regions
X2 += 3
X3 -= 3
X_moe = np.vstack([X1, X2, X3])
y_moe = np.concatenate([y1, y2, y3])
# Shuffle
idx = np.random.permutation(len(X_moe))
X_moe, y_moe = X_moe[idx], y_moe[idx]
# Split
X_tr, X_te, y_tr, y_te = train_test_split(
X_moe, y_moe, test_size=0.3, random_state=42)
# Simple MoE: cluster data, train a specialist per cluster
n_experts = 3
router = KMeans(n_clusters=n_experts,
random_state=42, n_init=10)
router.fit(X_tr)
experts = {}
for cid in range(n_experts):
mask = router.labels_ == cid
n_samples = mask.sum()
expert = GradientBoostingClassifier(
n_estimators=50, random_state=42)
expert.fit(X_tr[mask], y_tr[mask])
train_acc = expert.score(X_tr[mask], y_tr[mask])
experts[cid] = expert
print(f"Expert {cid}: {n_samples} training samples, "
f"train acc = {train_acc:.4f}")
# Route test samples to their cluster's expert
assignments = router.predict(X_te)
moe_preds = np.zeros(len(X_te), dtype=int)
for i, cid in enumerate(assignments):
moe_preds[i] = experts[cid].predict(
X_te[i:i+1])[0]
moe_acc = (moe_preds == y_te).mean()
# Compare with a single global model
global_gb = GradientBoostingClassifier(
n_estimators=150, random_state=42)
global_gb.fit(X_tr, y_tr)
global_acc = global_gb.score(X_te, y_te)
print(f"\nMixture of Experts accuracy: {moe_acc:.4f}")
print(f"Single global model accuracy: {global_acc:.4f}")
In its simplest form (what we built above), this is just training separate models for different data segments -- a strategy you've probably already done informally by splitting data by category before modeling. The modern deep learning version of MoE uses learned routing with differentiable gating functions and is the architecture behind several cutting-edge large language models. We'll revisit MoE when we cover transformer architectures later in the series -- the concept stays the same, but the routing becomes much more sophisticated.
For classical ML, the practical takeaway: if your data has natural segments with fundamentally different underlying patterns, training separate specialized models often outperforms a single general model. The trick is knowing when your data has those natural segments -- and K-Means clustering (episode #22) gives you a simple way to discover them automatically.
Comparing all ensemble strategies head to head
Let's run every strategy on the same dataset so you can see exactly how they compare:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import (VotingClassifier,
StackingClassifier,
BaggingClassifier,
AdaBoostClassifier)
# Same challenging dataset
X_cmp, y_cmp = make_classification(
n_samples=2000, n_features=25,
n_informative=12, n_redundant=5,
n_clusters_per_class=2,
flip_y=0.03, random_state=42
)
base = [
('lr', LogisticRegression(max_iter=1000)),
('rf', RandomForestClassifier(
n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(
n_estimators=100, random_state=42)),
('svm', SVC(probability=True, random_state=42)),
]
strategies = {
'Best single (GBM)': GradientBoostingClassifier(
n_estimators=200, random_state=42),
'Bagging (ep #18)': BaggingClassifier(
n_estimators=50, random_state=42),
'AdaBoost (ep #19)': AdaBoostClassifier(
n_estimators=100, random_state=42),
'Hard Voting': VotingClassifier(
estimators=base, voting='hard'),
'Soft Voting': VotingClassifier(
estimators=base, voting='soft'),
'Stacking': StackingClassifier(
estimators=base,
final_estimator=LogisticRegression(),
cv=5),
}
print("Ensemble strategy comparison (5-fold CV):\n")
print(f" {'Strategy':>22s} {'Accuracy':>12s}")
print("-" * 38)
for name, model in strategies.items():
scores = cross_val_score(model, X_cmp, y_cmp, cv=5)
print(f" {name:>22s} {scores.mean():.4f} "
f"+/- {scores.std():.4f}")
When ensembles are worth it (and when they're absolutely not)
This is the part where I have to be honest with you -- and it connects to a theme I keep hammering on since episode #13: understanding when NOT to use a technique is as important as knowing the technique itself.
Worth it:
- Competitions: the difference between 5th and 1st place is often a well-constructed ensemble. Every top Kaggle solution uses stacking. If you're optimizing for the leaderboard, ensemble everything.
- High-stakes production: when a 0.5% accuracy improvement translates to real business value -- fraud detection, medical diagnosis, ad ranking. At scale, tiny accuracy gains compound into millions of dollars.
- When you already have diverse models: if your pipeline already trains multiple model types for comparison (which it should, per episode #16), ensembling them costs almost nothing extra.
Not worth it:
- Prototyping: the added complexity slows iteration. Use a single good model until you actually need that last fraction of performance.
- When latency matters: ensembles multiply inference time by the number of base models. A real-time API serving 5 models takes roughly 5x the compute. For a 50ms SLA, that's brutal.
- When models aren't diverse: ensembling five XGBoost variants with slightly different hyperparameters gives minimal improvement. The models make correlated errors. Diversity is the prerequisite.
- Small datasets: stacking with cross-validation on 200 samples creates unstable meta-features because each fold has very few samples. Simpler approaches (single well-regularized model) are more robust.
The honest truth about production ML: most production ML systems use a single well-tuned model (usually gradient boosting for tabular data, a neural network for unstructured data). Ensembles are a competition technique and a last-resort accuracy booster. The engineering cost of maintaining, monitoring, and serving multiple models in production is substantial -- you need to version, test, and deploy each base model independently, monitor for drift in each, and handle the increased latency and compute cost. Know how to build ensembles, absolutely. But also know when a single well-tuned gradient boosting model with good feature engineering (episode #15) gets you 95% of the way there with 20% of the complexity.
Zo, wat hebben we geleerd?
We've gone from simple majority voting to sophisticated meta-learning, and connected it all back to the foundational concepts we've been building since the beginning of this series. Here's the full picture:
- Voting combines model predictions by majority (hard) or averaged probability (soft). Soft voting almost always wins because it preserves confidence information. The only requirement: your base models must support probability outputs;
- Model diversity matters more than individual accuracy. Combine fundamentally different algorithm types -- linear models, tree-based models, SVMs, distance-based models. Five variants of the same algorithm gives you almost nothing;
- Blending trains a meta-model on a holdout set to learn optimal model weights. Simple to implement but wastes training data. Good for quick experiments and prototyping;
- Stacking uses cross-validated out-of-fold predictions as meta-features. Uses all data, no waste. This is the standard approach in competition settings and the one you should reach for when accuracy matters;
- The meta-model should be simple (logistic regression or ridge). Its job is to weight and correct the base models, not to learn new patterns. Complex meta-models overfit on the small number of meta-features;
- Use probability outputs as meta-features, not class predictions. Probabilities carry confidence information that class labels discard;
- Mixture of experts routes inputs to specialized models in stead of combining all models for every input. Powerful when your data has natural subgroups with different underlying patterns. The modern deep learning version (MoE) powers several state-of-the-art language models;
- Ensembles shine in competitions and high-stakes production, but add engineering complexity that's often NOT worth it for most real-world applications. A single well-tuned model with good feature engineering is usually the right call.
This episode wraps up the classical supervised ML toolkit. Over 33 episodes, we've gone from "what is ML?" to building and combining sophisticated models -- linear models, trees, SVMs, clustering, dimensionality reduction, time series, NLP, Bayesian methods, and now advanced ensembles. That's a LOT of ground covered, and every concept connects to the others in ways that keep revealing themselves as you gain more experience.
The path forward from here leads into territory that connects everything we've learned so far -- how real ML systems go from notebook experiments to production deployments, how to think about ethics and bias in the models we build, and how to architect complete ML pipelines that tie all these individual techniques together. The tools from this classical ML foundation aren't going away when we eventually enter the deep learning world -- they're the vocabulary you need to understand WHY the modern approaches work and WHEN the simpler methods are still the better choice.