Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next

What will I learn
- You will learn ARIMA -- the classical time series workhorse that has been the gold standard for decades;
- SARIMA -- extending ARIMA with seasonal components for periodic data;
- feature engineering for ML-based time series forecasting -- turning temporal data into tabular features;
- walk-forward validation done right -- the only honest way to evaluate time series models;
- when classical methods beat ML (and it happens more often than you'd think);
- multi-step forecasting strategies -- recursive vs direct approaches;
- practical forecasting wisdom from real-world experience.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next (this post)
Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
Last episode we built the foundation for working with temporal data: autocorrelation, stationarity, decomposition, differencing, and the golden rules of time series. We learned WHY time series data is special, HOW to diagnose its structure with ACF and PACF plots, and WHAT transformations (differencing, exponential smoothing) prepare it for modeling. If you haven't read episode #28, go back and read it first -- everything today builds directly on those concepts.
Today we put that foundation to work and actually forecast -- predicting future values from historical patterns. We'll cover two families of approaches: classical statistical models (ARIMA and its seasonal extension SARIMA) that have been the gold standard for decades, and ML-based approaches that reframe time series as a supervised learning problem using the tools from episodes #10-20. The results might genuinely surprise you: on many real-world forecasting tasks, ARIMA and exponential smoothing outperform XGBoost, random forests, and even neural networks. Understanding why is as important as knowing how.
Here we go!
ARIMA: the statistical workhorse
ARIMA stands for AutoRegressive Integrated Moving Average. It's three models in a trenchcoat, and each component handles a different aspect of temporal structure. Let's break it down:
AR (AutoRegressive, order p): the current value depends linearly on the previous p values. An AR(2) model says: y(t) = c + phi_1 * y(t-1) + phi_2 * y(t-2) + e(t). This is literally linear regression (episode #10!) where the features are past values of the series itself. If you squint at it, it's the same normal equation we derived from scratch -- just with lag values as X columns in stead of external features.
I (Integrated, order d): how many times to difference the series to make it stationary. Remember the Augmented Dickey-Fuller test from episode #28? That tells you whether differencing is needed. d=1 means first differencing (subtract the previous value); d=2 means differencing twice. In practice, d is almost always 0 or 1.
MA (Moving Average, order q): the current value depends linearly on the previous q forecast errors. MA(1) means: y(t) = c + e(t) + theta_1 * e(t-1). This captures short-term shocks -- if the model underestimated yesterday, the MA component corrects for that pattern in today's prediction.
Together, ARIMA(p, d, q) is a surprisingly flexible framework. The AR component captures momentum (past values predicting current values), the I component handles trend (by differencing it away), and the MA component captures error correction (learning from recent mistakes). Let's build AR from scratch first to see the connection to linear regression:
import numpy as np
import pandas as pd
np.random.seed(42)
# Same synthetic series from episode #28: trend + seasonality + noise
n = 365
trend = np.linspace(100, 150, n)
seasonality = 10 * np.sin(np.arange(n) * 2 * np.pi / 30)
noise = np.random.randn(n) * 3
ts = trend + seasonality + noise
dates = pd.date_range('2024-01-01', periods=n, freq='D')
series = pd.Series(ts, index=dates)
# Manual AR(1) model from scratch -- literally linear regression
# on the previous value
def ar1_forecast(series, train_size):
train = series[:train_size].values
# Fit: y(t) = c + phi * y(t-1)
# This is the normal equation from episode #10!
y = train[1:]
X = train[:-1].reshape(-1, 1)
X_with_const = np.column_stack([np.ones(len(X)), X])
w = np.linalg.lstsq(X_with_const, y, rcond=None)[0]
c, phi = w
# Forecast one step at a time
forecasts = []
last_val = train[-1]
for i in range(len(series) - train_size):
pred = c + phi * last_val
forecasts.append(pred)
last_val = series.values[train_size + i] # use actual value
return np.array(forecasts), c, phi
train_size = 300
forecasts, c, phi = ar1_forecast(series, train_size)
actual = series.values[train_size:]
rmse = np.sqrt(np.mean((forecasts - actual) ** 2))
print(f"AR(1) from scratch:")
print(f" c = {c:.2f}, phi = {phi:.4f}")
print(f" RMSE = {rmse:.2f}")
print(f" phi close to 1.0 means strong persistence")
print(f" (today will be very similar to yesterday)")
The AR(1) coefficient phi tells you the strength of the day-to-day dependency. Values close to 1.0 mean the series has strong persistence -- today's value is nearly the same as yesterday's plus some small adjustment. Values close to 0 mean consecutive values are nearly independent (and you might as well predict the mean). For our trended+seasonal series, phi will be close to 1.0 because both the trend and the seasonal component change slowly from day to day.
Extending to AR(p): multiple lags
AR(1) only looks one step back. But our PACF analysis from episode #28 might have shown significant partial autocorrelation at lags 2 or 3, meaning those lags carry additional information beyond what lag 1 provides. Extending to AR(p) is straightforward -- just add more lag columns:
def ar_p_forecast(series, p, train_size):
"""AR(p) model: use the previous p values as features."""
train = series[:train_size].values
# Build feature matrix: each row has p lagged values
X = np.column_stack([
train[p-i-1:train_size-i-1] for i in range(p)
])
y = train[p:]
X_with_const = np.column_stack([np.ones(len(X)), X])
w = np.linalg.lstsq(X_with_const, y, rcond=None)[0]
# Forecast
forecasts = []
history = list(train[-p:])
for i in range(len(series) - train_size):
features = np.array([1.0] + history[-p:])
pred = features @ w
forecasts.append(pred)
history.append(series.values[train_size + i])
return np.array(forecasts), w
print(f"\nAR(p) comparison:")
print(f" {'p':>3s} {'RMSE':>8s} {'Parameters':>12s}")
print("-" * 28)
for p in [1, 2, 3, 5, 7, 14]:
preds, weights = ar_p_forecast(series, p, train_size)
rmse = np.sqrt(np.mean((preds - actual) ** 2))
print(f" {p:>3d} {rmse:>8.2f} {len(weights):>12d}")
Notice something important: more lags doesn't always mean better RMSE. Beyond a certain point, the extra lag coefficients are fitting noise in the training data rather than genuine temporal patterns. This is the same overfitting dynamic we've seen throughout the series (remember the bias-variance tradeoff from episode #13?) -- more model complexity isn't free. In ARIMA, the AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) provide a principled way to choose: they penalize model complexity, so a model with 14 parameters needs to fit substantially better than one with 3 parameters to be preferred.
Choosing p, d, q -- the traditional approach
The classical recipe for selecting ARIMA orders uses the ACF and PACF plots we built in episode #28:
- d (differencing order): difference until the series passes the ADF test (p-value < 0.05). Usually d=0 or d=1. For our trended series, d=1 removes the trend.
- p (AR order): look at the PACF of the (differenced) series. The lag where it cuts off to near-zero suggests p. Sharp cutoff after lag 2? Use p=2.
- q (MA order): look at the ACF of the (differenced) series. The lag where it cuts off suggests q.
In practice, many people (myself included, I'll be honest) skip the manual ACF/PACF analysis and use automated order selection. The pmdarima package provides auto_arima which fits dozens of candidate (p,d,q) combinations and returns the one with the best AIC. For production work, this is far more robust than eyeballing plots:
# Grid search over ARIMA orders using AIC
from statsmodels.tsa.stattools import adfuller
# First: determine d by differencing until stationary
train_data = series[:train_size].values
d = 0
temp = train_data.copy()
for attempt in range(3):
result = adfuller(temp, autolag='AIC')
if result[1] < 0.05:
print(f" Stationary after d={d} differencing "
f"(ADF p-value: {result[1]:.4f})")
break
temp = np.diff(temp)
d += 1
# Now grid search over p and q
from statsmodels.tsa.arima.model import ARIMA
import warnings
best_aic = np.inf
best_order = None
print(f"\nARIMA order selection (d={d}):")
print(f" {'Order':>12s} {'AIC':>10s} {'BIC':>10s}")
print("-" * 36)
for p in range(1, 5):
for q in range(0, 3):
try:
with warnings.catch_warnings():
warnings.simplefilter("ignore")
model = ARIMA(train_data, order=(p, d, q))
fit = model.fit()
marker = ""
if fit.aic < best_aic:
best_aic = fit.aic
best_order = (p, d, q)
marker = " <-- best"
print(f" ({p},{d},{q}){'':<7s} "
f"{fit.aic:>10.1f} {fit.bic:>10.1f}{marker}")
except Exception:
pass
print(f"\nBest order: ARIMA{best_order}")
SARIMA: adding seasonality
Our series has a 30-day seasonal cycle, but basic ARIMA doesn't explicitly model periodicity. SARIMA (Seasonal ARIMA) extends ARIMA with seasonal AR, differencing, and MA components at the seasonal period. The notation is SARIMA(p,d,q)(P,D,Q,s) where the uppercase parameters are the seasonal versions and s is the seasonal period.
For our data with a 30-day cycle: s=30, and the seasonal components capture the repeated pattern at that interval. A seasonal AR(1) at period 30 means today's value depends on the value from 30 days ago (same point in the cycle). Seasonal differencing (D=1) subtracts the value from one full period ago -- exactly the seasonal differencing we built from scratch in episode #28.
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Fit SARIMA with the best non-seasonal order
# and seasonal order (1,1,0,30)
print(f"Fitting SARIMA{best_order}(1,1,0,30)...")
with warnings.catch_warnings():
warnings.simplefilter("ignore")
sarima = SARIMAX(
train_data,
order=best_order,
seasonal_order=(1, 1, 0, 30),
enforce_stationarity=False,
enforce_invertibility=False
)
sarima_fit = sarima.fit(disp=False)
# Forecast the test period
sarima_forecast = sarima_fit.forecast(steps=len(actual))
sarima_rmse = np.sqrt(np.mean((sarima_forecast - actual) ** 2))
# Compare with AR(1) from scratch
print(f"\nModel comparison:")
print(f" AR(1) from scratch: RMSE = {rmse:.2f}")
print(f" SARIMA{best_order}(1,1,0,30): RMSE = {sarima_rmse:.2f}")
improvement = (rmse - sarima_rmse) / rmse * 100
print(f" Improvement: {improvement:.1f}%")
SARIMA should outperform our simple AR(1) significantly on this data because it explicitly models both the trend (through differencing) and the seasonal cycle (through the seasonal component). The seasonal AR term captures "today's value relative to 30 days ago," which is exactly the pattern our synthetic data has.
ML-based forecasting: time series as supervised learning
The alternative to statistical models: convert the time series into a tabular dataset where each row uses past values as features and the future value as the target. Then apply any supervised learning algorithm -- random forests (episode #18), gradient boosting (episode #19), or anything else from our supervised learning toolkit.
This is the bridge between time series and everything we learned in episodes #10-20. The key insight is that feature engineering makes or breaks this approach:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
def create_features(series, n_lags=14):
"""Convert time series to supervised learning format.
This is where feature engineering (episode #15) meets
time series."""
values = series.values
X, y, idx = [], [], []
for t in range(n_lags, len(values)):
# Lag features (the raw material)
features = list(values[t-n_lags:t])
# Calendar features (domain knowledge!)
dt = series.index[t]
features.extend([dt.dayofweek, dt.month, dt.day])
# Rolling statistics (smoothed summaries)
window = values[max(0, t-30):t]
features.extend([
window.mean(), # 30-day average
window.std(), # 30-day volatility
window[-7:].mean(), # 7-day average
values[t-1] - values[t-2], # recent momentum
])
# Ratio features (relative position)
if window.mean() > 0:
features.append(values[t-1] / window.mean())
else:
features.append(1.0)
X.append(features)
y.append(values[t])
idx.append(series.index[t])
return np.array(X), np.array(y), idx
X, y, idx = create_features(series)
split = train_size - 14 # adjust for lag window
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
gbm = GradientBoostingRegressor(
n_estimators=200, max_depth=4,
learning_rate=0.1, random_state=42
)
gbm.fit(X_train, y_train)
preds_gbm = gbm.predict(X_test)
rmse_gbm = np.sqrt(mean_squared_error(y_test, preds_gbm))
print(f"GBM with engineered features: RMSE = {rmse_gbm:.2f}")
# Feature importance -- which features matter most?
n_lag_features = 14
n_calendar = 3
n_rolling = 5
feature_groups = {
'Lag values': slice(0, n_lag_features),
'Calendar': slice(n_lag_features, n_lag_features + n_calendar),
'Rolling stats': slice(n_lag_features + n_calendar,
n_lag_features + n_calendar + n_rolling),
}
importances = gbm.feature_importances_
print(f"\nFeature group importances:")
for name, slc in feature_groups.items():
group_imp = importances[slc].sum()
print(f" {name:>15s}: {group_imp:.3f} "
f"({group_imp / importances.sum() * 100:.1f}%)")
The feature engineering here is critical -- and directly connects to episode #15 where we covered feature engineering in general. Raw lag values alone give the ML model essentially the same information as an AR model. But adding rolling statistics (mean, std of recent windows), calendar features (day of week, month), and derived features (momentum, price-to-average ratios) gives it access to patterns that ARIMA can't capture: nonlinear relationships, interaction effects between features, and temporal context that the linear model structure simply cannot represent.
Walk-forward validation: the only honest evaluation
We covered this principle in episode #28 (and previewed it in episode #21 with the crypto regime predictor), but now let's implement a proper expanding-window walk-forward evaluation that compares multiple models head-to-head:
from sklearn.model_selection import TimeSeriesSplit
def walk_forward_compare(X, y, series_values, n_splits=5):
"""Compare ARIMA vs GBM using proper walk-forward validation."""
tscv = TimeSeriesSplit(n_splits=n_splits)
results = {'GBM': [], 'Naive': []}
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
# GBM
gbm_fold = GradientBoostingRegressor(
n_estimators=200, max_depth=4, random_state=42
)
gbm_fold.fit(X[train_idx], y[train_idx])
preds = gbm_fold.predict(X[test_idx])
rmse = np.sqrt(mean_squared_error(y[test_idx], preds))
results['GBM'].append(rmse)
# Naive baseline: predict tomorrow = today
naive_preds = np.roll(y[test_idx], 1)
naive_preds[0] = y[train_idx[-1]]
rmse_naive = np.sqrt(
mean_squared_error(y[test_idx], naive_preds)
)
results['Naive'].append(rmse_naive)
print(f" Fold {fold+1}: GBM={rmse:.2f}, "
f"Naive={rmse_naive:.2f}")
print(f"\nSummary:")
for name, scores in results.items():
mean = np.mean(scores)
std = np.std(scores)
print(f" {name:>6s}: {mean:.2f} +/- {std:.2f}")
print("Walk-forward validation (5 folds):")
walk_forward_compare(X, y, series.values)
Every fold trains on the past and tests on the future. The training set grows with each fold (expanding window). This is the ONLY correct way to evaluate time series models -- and remember, random k-fold cross-validation is invalid here because it lets the model peek at future data points during training. We established this rule in episode #28 and it's worth repeating: if your time series model achieves amazing results with random cross-validation, it's almost certainly cheating ;-)
When classical methods beat ML
Here's the uncomfortable truth that many ML practitioners don't want to hear: on many standard forecasting benchmarks (the M3, M4, and M5 competitions -- the "ImageNet of forecasting"), simple statistical methods regularly outperform complex ML models. The M4 competition (2018) tested 100,000 real time series across multiple domains, and statistical methods dominated the leaderboard. Why?
Temporal inductive bias. ARIMA and exponential smoothing are designed for time series. They encode assumptions about autocorrelation, trend, and seasonality directly into their mathematical structure. The model "knows" that temporal dependencies matter before seeing a single data point. ML models must learn these patterns entirely from data, which requires more samples and more careful feature engineering to achive comparable performance.
Sample efficiency. Most individual time series are short -- hundreds or at most thousands of observations. Statistical models have very few parameters (3-6 for a typical ARIMA) and extract maximum information from limited data. A gradient boosting model with 200 trees of depth 4 has thousands of effective parameters. On a 365-day series, that's a recipe for overfitting (episode #13 all over again).
Stationarity assumptions help. ARIMA's requirement to difference to stationarity before modeling isn't a limitation -- it's a regularization. By removing trend through differencing, the model focuses on predictable short-term patterns rather than fitting long-term drift that might not continue. It's like building in a prior belief that the future won't just be an extrapolation of the past trend. (Having said that, if your trend IS going to continue -- population growth, Moore's Law-style exponential trends -- this same assumption can hurt you.)
# Demonstration: ARIMA vs GBM on a short series
from statsmodels.tsa.arima.model import ARIMA
# Simulate a short series (100 points) -- very common in practice
np.random.seed(123)
n_short = 100
short_trend = np.linspace(50, 70, n_short)
short_season = 5 * np.sin(np.arange(n_short) * 2 * np.pi / 7)
short_noise = np.random.randn(n_short) * 2
short_ts = short_trend + short_season + short_noise
short_series = pd.Series(
short_ts,
index=pd.date_range('2024-01-01', periods=n_short, freq='D')
)
train_short = 70
test_short = short_ts[train_short:]
# ARIMA
with warnings.catch_warnings():
warnings.simplefilter("ignore")
arima_short = ARIMA(short_ts[:train_short], order=(2, 1, 1))
arima_fit_short = arima_short.fit()
arima_pred_short = arima_fit_short.forecast(steps=len(test_short))
arima_rmse_short = np.sqrt(
np.mean((arima_pred_short - test_short) ** 2)
)
# GBM (needs feature engineering on the short series)
X_short, y_short, _ = create_features(short_series, n_lags=7)
split_short = train_short - 7
gbm_short = GradientBoostingRegressor(
n_estimators=100, max_depth=3, random_state=42
)
gbm_short.fit(X_short[:split_short], y_short[:split_short])
gbm_pred_short = gbm_short.predict(X_short[split_short:])
gbm_rmse_short = np.sqrt(
mean_squared_error(y_short[split_short:], gbm_pred_short)
)
print(f"Short series (n=100, train=70):")
print(f" ARIMA(2,1,1): RMSE = {arima_rmse_short:.2f}")
print(f" GBM: RMSE = {gbm_rmse_short:.2f}")
winner = "ARIMA" if arima_rmse_short < gbm_rmse_short else "GBM"
print(f" Winner: {winner}")
When ML wins: you have very long series (thousands or millions of observations), many related series to learn from simultaneously (cross-series patterns), rich external features (weather data, promotions, events, economic indicators), or strongly nonlinear dynamics that linear models fundamentally cannot capture. In these cases, gradient boosting and neural networks shine because they can model complex interactions that ARIMA's linear structure simply cannot represent. The key is having enough data to justify the model's complexity.
Multi-step forecasting: the horizon problem
So far we've been forecasting one step ahead. But real applications often need predictions 7, 30, or even 90 days into the future. Multi-step forecasting is dramatically harder than single-step because errors compound with each step.
There are two fundamental strategies:
Recursive forecasting: predict one step ahead, feed that prediction back as input, predict the next step, repeat. Elegant and uses a single model, but errors accumulate -- an overestimate at step 1 inflates step 2, which inflates step 3, and by step 30 you're forecasting fantasy land.
Direct forecasting: train a separate model for each forecast horizon. Model 1 predicts 1 step ahead, model 7 predicts 7 steps ahead, model 30 predicts 30 steps ahead. More robust (no error propagation) but requires training and maintaining multiple models.
# Compare recursive vs direct multi-step forecasting
def recursive_forecast(model, X_last, y_history, n_steps, n_lags):
"""Predict n_steps ahead by feeding predictions back."""
predictions = []
current_features = X_last.copy()
history = list(y_history[-30:])
for step in range(n_steps):
pred = model.predict(current_features.reshape(1, -1))[0]
predictions.append(pred)
history.append(pred)
# Rebuild features with the predicted value
lag_feats = history[-n_lags:]
window = history[-30:]
new_features = list(lag_feats)
new_features.extend([0, 0, 0]) # placeholder calendar
new_features.extend([
np.mean(window),
np.std(window),
np.mean(window[-7:]),
history[-1] - history[-2],
])
mean_w = np.mean(window)
if mean_w > 0:
new_features.append(history[-1] / mean_w)
else:
new_features.append(1.0)
current_features = np.array(new_features)
return np.array(predictions)
# Direct: train separate models for each horizon
horizons = [1, 3, 7, 14, 30]
print("Multi-step forecasting comparison:")
print(f" {'Horizon':>8s} {'Direct RMSE':>12s}")
print("-" * 24)
for h in horizons:
# Create direct features: predict y(t+h) from x(t)
X_direct = X[:-h] if h > 0 else X
y_direct = y[h:] if h > 0 else y
split_d = max(1, split - h)
if split_d >= len(X_direct):
continue
gbm_direct = GradientBoostingRegressor(
n_estimators=200, max_depth=4, random_state=42
)
gbm_direct.fit(X_direct[:split_d], y_direct[:split_d])
preds_d = gbm_direct.predict(X_direct[split_d:])
rmse_d = np.sqrt(
mean_squared_error(y_direct[split_d:], preds_d)
)
print(f" {h:>5d}d {rmse_d:>12.2f}")
Watch the RMSE grow with the forecast horizon -- this is the fundamental challenge of multi-step forecasting. Predicting 1 day ahead might have an RMSE of 3. Predicting 30 days ahead might have an RMSE of 15 or more. Each step into the future adds uncertainty. The further out you go, the closer your forecast gets to just predicting the mean -- which is another way of saying "I don't know."
This is why honest forecasting systems always report prediction intervals, not just point forecasts. "Sales will be 1,000 units next month" is much less useful than "sales will be between 800 and 1,200 units with 80% confidence." The width of that interval grows with the forecast horizon, and it tells you exactly how much trust to place in the prediction.
Practical forecasting wisdom
After working with time series data across quite some domains, here are the lessons that keep showing up:
Always start with a naive baseline. Before building any model, compute the RMSE of two trivially simple forecasts: "tomorrow = today" (persistence) and "tomorrow = same day last year" (seasonal persistence). If your fancy model can't beat these baselines, it's not capturing any useful signal. I've seen people spend weeks tuning XGBoost hyperparameters only to discover that a seasonal naive forecast was just as good. The baseline keeps you honest.
Feature engineering matters more than model selection. The difference between a gradient boosting model with raw lag features and the same model with rolling statistics, calendar features, and domain-specific indicators is usually much larger than the difference between gradient boosting and random forests. Spend 80% of your time on features, 20% on model selection. This is the exact same lesson from episode #15, now applied to temporal data.
Don't forecast what can't be forecasted. Some series are fundamentally unpredictable beyond very short horizons -- minute-by-minute stock returns being the canonical example. The autocorrelation structure tells you how much predictable signal exists. If the ACF drops to near-zero after lag 1, you're dealing with a near-random walk, and no model (no matter how complex) will produce meaningful forecasts beyond one step ahead. Accepting this saves enormous time and prevents deploying forecasting systems that give a false sense of precision.
Domain knowledge beats algorithms. Knowing that "sales spike before Christmas" or "website traffic drops on weekends" or "electricity demand peaks at 18:00" lets you build features that capture real patterns. No algorithm will discover these automatically without enough data covering enough seasonal cycles. One domain expert's calendar of important dates is worth more than a thousand estimators in your gradient boosting model.
The big comparison: all methods on one dataset
Let's bring everything together and compare our approaches on the original 365-day series:
# Final comparison: all methods
print("=" * 50)
print("FINAL COMPARISON -- 365 day series")
print("=" * 50)
print(f" Train: first {train_size} days")
print(f" Test: last {len(actual)} days")
print()
# 1. Naive baseline (persistence)
naive_preds = np.roll(actual, 1)
naive_preds[0] = series.values[train_size - 1]
rmse_naive = np.sqrt(np.mean((naive_preds - actual) ** 2))
# 2. AR(1) from scratch
_, _, _ = ar1_forecast(series, train_size)
# (already computed above)
# 3. Exponential smoothing (from episode #28)
def exp_smooth_forecast(train_data, test_len, alpha=0.3):
smoothed = np.zeros(len(train_data))
smoothed[0] = train_data[0]
for t in range(1, len(train_data)):
smoothed[t] = alpha * train_data[t] + (1 - alpha) * smoothed[t-1]
# Simple forecast: repeat last smoothed value
return np.full(test_len, smoothed[-1])
es_preds = exp_smooth_forecast(series.values[:train_size], len(actual))
rmse_es = np.sqrt(np.mean((es_preds - actual) ** 2))
# Collect all results
all_results = {
'Naive (persistence)': rmse_naive,
'Exp. Smoothing': rmse_es,
'AR(1) from scratch': rmse,
'GBM + features': rmse_gbm,
}
# Add SARIMA if it ran successfully
try:
all_results['SARIMA'] = sarima_rmse
except NameError:
pass
print(f" {'Method':>22s} {'RMSE':>8s} {'vs Naive':>10s}")
print("-" * 44)
for name, r in sorted(all_results.items(), key=lambda x: x[1]):
improvement = (rmse_naive - r) / rmse_naive * 100
marker = " <-- best" if r == min(all_results.values()) else ""
print(f" {name:>22s} {r:>8.2f} "
f"{improvement:>+9.1f}%{marker}")
The results tell a story we've seen throughout this series: model complexity should be justified by performance improvement. If SARIMA only marginally beats a simple exponential smoother on your specific data, the added complexity (more parameters, harder to debug, slower to fit) might not be worth it. Always compare against the simplest baselines first, and only step up in complexity when the simpler models genuinely fail to capture important patterns.
Zo, wat hebben we geleerd?
Today we went from understanding time series structure (episode #28) to actually producing forecasts. Here's the full picture:
- ARIMA(p,d,q) combines autoregression (past values), differencing (stationarity), and moving average (error correction) into a single flexible statistical framework. It's linear regression on lag features plus error terms -- directly connected to what we built in episodes #10-11;
- SARIMA extends ARIMA with seasonal AR, differencing, and MA components at a fixed period. For data with clear seasonality (and most real-world data HAS seasonality of some kind), SARIMA captures both the trend and the cycle;
- Choose model orders using the ACF/PACF diagnostics from episode #28, or automate with grid search over AIC. Lower AIC means better balance between fit and complexity;
- ML-based forecasting reframes time series as tabular data with lag features, rolling statistics, and calendar features -- then applies any supervised model from our toolkit (episodes #17-19). Feature engineering (episode #15) is the bottleneck, not model selection;
- Walk-forward validation is mandatory -- always train on the past, test on the future, use expanding windows. Random k-fold is invalid for time series because it lets the model peek at future data;
- Classical methods often beat ML on short, single-series forecasting tasks due to strong temporal inductive biases and better sample efficiency. ML wins with long series, rich external features, and nonlinear dynamics;
- Multi-step forecasting is hard because errors compound. Use direct models (separate model per horizon) for robustness, or recursive models (feed predictions back) for elegance. Always report prediction intervals, not just point forecasts;
- Start with naive baselines, invest in feature engineering over model tuning, and accept that some series are fundamentally unpredictable beyond short horizons.
This wraps up our two-part deep dive into time series. The foundation from episode #28 (autocorrelation, stationarity, decomposition) combined with today's forecasting techniques (ARIMA, SARIMA, ML-based approaches) gives you a complete toolkit for temporal data. We've now covered both the supervised learning world (episodes #10-21) and the unsupervised learning world (episodes #22-27), and with time series (episodes #28-29) we've seen how temporal structure adds a whole new dimension to both.
The data we've been working with throughout this entire series -- numbers, categories, measurements, ratings, time stamps -- has one thing in common: it's structured and quantitative. But some of the most interesting data in the world isn't numbers at all. It's text. Sentences, paragraphs, documents, conversations -- unstructured, messy, and full of meaning that no feature vector can trivially capture. Working with text requires an entirely different set of tools, and those tools connect back to everything we've learned in ways that might surprise you.