Learn AI Series (#103) - Multi-Armed Bandits
Learn AI Series (#103) - Multi-Armed Bandits

What will I learn
- You will learn the multi-armed bandit problem: the simplest reinforcement learning formulation that still captures the exploration-exploitation tradeoff;
- epsilon-greedy exploration: the most common baseline strategy and why fixed epsilon wastes pulls;
- Upper Confidence Bound (UCB): optimism in the face of uncertainty with provably logarithmic regret;
- Thompson Sampling: Bayesian exploration with probability matching that often beats UCB in practice;
- contextual bandits and LinUCB: when user features determine which arm is best;
- how traditional A/B testing is secretly a bandit problem, and why bandit approaches reduce wasted exposure.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals
- Learn AI Series (#78) - Object Detection (Part 1) - Foundations
- Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches
- Learn AI Series (#80) - Image Segmentation
- Learn AI Series (#81) - Pose Estimation and Tracking
- Learn AI Series (#82) - Optical Character Recognition
- Learn AI Series (#83) - Video Understanding
- Learn AI Series (#84) - Generative Images - Diffusion Models (Part 1)
- Learn AI Series (#85) - Generative Images - Diffusion Models (Part 2)
- Learn AI Series (#86) - Image-to-Image and Editing
- Learn AI Series (#87) - 3D Vision
- Learn AI Series (#88) - Face Analysis
- Learn AI Series (#89) - Medical and Scientific Imaging
- Learn AI Series (#90) - Self-Supervised Learning for Vision
- Learn AI Series (#91) - Mini Project - Building a Visual AI System
- Learn AI Series (#92) - Audio Fundamentals for AI
- Learn AI Series (#93) - Speech Recognition
- Learn AI Series (#94) - Text-to-Speech (TTS)
- Learn AI Series (#95) - Audio Classification
- Learn AI Series (#96) - Music Generation
- Learn AI Series (#97) - Speaker Recognition and Diarization
- Learn AI Series (#98) - Natural Language Understanding for Voice
- Learn AI Series (#99) - Audio Enhancement
- Learn AI Series (#100) - Multimodal Audio-Visual Models
- Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant
- Learn AI Series (#102) - What Is Reinforcement Learning?
- Learn AI Series (#103) - Multi-Armed Bandits (this post)
Learn AI Series (#103) - Multi-Armed Bandits
Solutions to Episode #102 Exercises
Exercise 1: Windy grid world -- create a WindyGridWorld class (7x10, start=(3,0), goal=(3,7)) where columns 3-8 have upward wind of 1 cell (columns 6-7 push by 2). Train Q-learning for 500 episodes.
import numpy as np
class WindyGridWorld:
def __init__(self):
self.rows, self.cols = 7, 10
self.start = (3, 0)
self.goal = (3, 7)
self.wind = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0]
self.state = self.start
def reset(self):
self.state = self.start
return self.state
def step(self, action):
r, c = self.state
if action == 0: r -= 1
elif action == 1: c += 1
elif action == 2: r += 1
elif action == 3: c -= 1
r -= self.wind[min(c, self.cols - 1)]
r = max(0, min(self.rows - 1, r))
c = max(0, min(self.cols - 1, c))
self.state = (r, c)
if self.state == self.goal:
return self.state, 0, True
return self.state, -1, False
env = WindyGridWorld()
Q = {}
eps, alpha, gamma = 0.1, 0.5, 1.0
rng = np.random.RandomState(42)
for ep in range(500):
s = env.reset()
for _ in range(200):
if s not in Q:
Q[s] = np.zeros(4)
if rng.random() < eps:
a = rng.randint(4)
else:
a = int(np.argmax(Q[s]))
ns, r, done = env.step(a)
if ns not in Q:
Q[ns] = np.zeros(4)
Q[s][a] += alpha * (
r + gamma * np.max(Q[ns]) - Q[s][a])
s = ns
if done:
break
arrows = ["^", ">", "v", "<"]
print("Learned policy:")
for row in range(7):
line = ""
for col in range(10):
if (row, col) == env.goal:
line += " G "
elif (row, col) in Q:
line += f" {arrows[int(np.argmax(Q[(row, col)]))]} "
else:
line += " . "
print(line)
print(f"States visited: {len(Q)}")
The learned policy compensates for the wind by biasing downward in windy columns. The agent discovers it needs to "aim low" in the middle section because the wind will push it upward anyway. Average steps-to-goal in the last 50 episodes is typically 15-20 (compared to 6 for the no-wind optimal path), because the agent must navigate around the strong wind in columns 6-7.
Exercise 2: Epsilon decay comparison -- train 4 agents with different epsilon schedules on the 4x4 grid world for 1000 episodes.
import numpy as np
class SimpleGridWorld:
def __init__(self, size=4):
self.size = size
self.state = (0, 0)
self.goal = (size - 1, size - 1)
def reset(self):
self.state = (0, 0)
return self.state
def step(self, action):
r, c = self.state
if action == 0: r = max(0, r - 1)
elif action == 1: c = min(self.size - 1, c + 1)
elif action == 2: r = min(self.size - 1, r + 1)
elif action == 3: c = max(0, c - 1)
self.state = (r, c)
if self.state == self.goal:
return self.state, 1.0, True
return self.state, -0.01, False
def train_agent(env, schedule, n_episodes=1000):
Q = {}
rng = np.random.RandomState(42)
rewards_hist = []
explore_count = 0
for ep in range(n_episodes):
eps = schedule(ep)
s = env.reset()
total = 0
for _ in range(100):
if s not in Q:
Q[s] = np.zeros(4)
if rng.random() < eps:
a = rng.randint(4)
explore_count += 1
else:
a = int(np.argmax(Q[s]))
ns, r, done = env.step(a)
if ns not in Q:
Q[ns] = np.zeros(4)
Q[s][a] += 0.1 * (
r + 0.99 * np.max(Q[ns]) - Q[s][a])
total += r
s = ns
if done:
break
rewards_hist.append(total)
# Greedy test
test_rewards = []
for _ in range(100):
s = env.reset()
total = 0
for _ in range(100):
if s not in Q:
break
a = int(np.argmax(Q[s]))
ns, r, done = env.step(a)
total += r
s = ns
if done:
break
test_rewards.append(total)
return rewards_hist, explore_count, np.mean(test_rewards)
schedules = {
"Fixed 0.1": lambda ep: 0.1,
"Linear decay": lambda ep: max(0.01, 1.0 - ep / 1000),
"Exponential": lambda ep: 0.99 ** ep,
"Fixed 0.5": lambda ep: 0.5,
}
for name, sched in schedules.items():
env = SimpleGridWorld()
hist, explores, test_score = train_agent(
env, sched)
blocks = [np.mean(hist[i:i+100])
for i in range(0, 1000, 100)]
print(f"{name:20s} | test={test_score:.3f} "
f"| explores={explores} "
f"| blocks={[f'{b:.2f}' for b in blocks[:3]]}...")
The linear and exponential decay agents explore heavily early (discovering the state space) then converge to greedy exploitation, producing the best final test scores. Fixed 0.1 works reasonably but keeps wasting 10% of actions on random exploration even after it knows the optimal path. Fixed 0.5 explores too much throughout and never fully commits.
Exercise 3: Stochastic environment comparison -- train on deterministic, p=0.8, and p=0.5 grid worlds for 2000 episodes.
import numpy as np
class StochasticGridWorld:
def __init__(self, size=4, p_success=1.0):
self.size = size
self.p_success = p_success
self.state = (0, 0)
self.goal = (size - 1, size - 1)
self.rng = np.random.RandomState(42)
def reset(self):
self.state = (0, 0)
return self.state
def step(self, action):
if self.rng.random() > self.p_success:
others = [a for a in range(4)
if a != action]
action = self.rng.choice(others)
r, c = self.state
if action == 0: r = max(0, r - 1)
elif action == 1:
c = min(self.size - 1, c + 1)
elif action == 2:
r = min(self.size - 1, r + 1)
elif action == 3: c = max(0, c - 1)
self.state = (r, c)
if self.state == self.goal:
return self.state, 1.0, True
return self.state, -0.01, False
for p in [1.0, 0.8, 0.5]:
env = StochasticGridWorld(p_success=p)
Q = {}
rng = np.random.RandomState(0)
rewards = []
for ep in range(2000):
s = env.reset()
total = 0
for _ in range(100):
if s not in Q:
Q[s] = np.zeros(4)
if rng.random() < 0.1:
a = rng.randint(4)
else:
a = int(np.argmax(Q[s]))
ns, r, done = env.step(a)
if ns not in Q:
Q[ns] = np.zeros(4)
Q[s][a] += 0.1 * (
r + 0.99 * np.max(Q[ns])
- Q[s][a])
total += r
s = ns
if done:
break
rewards.append(total)
blocks = [np.mean(rewards[i:i+200])
for i in range(0, 2000, 200)]
spreads = []
for s in Q:
spreads.append(np.max(Q[s]) - np.min(Q[s]))
avg_spread = np.mean(spreads)
print(f"p={p}: blocks={[f'{b:.2f}' for b in blocks]}")
print(f" Q-spread={avg_spread:.3f}")
Deterministic (p=1.0) converges fastest with the largest Q-value spread because the agent's actions directly control outcomes -- so the difference between a good action and a bad one is clear. At p=0.8, convergence is slower and Q-spreads compress because even the "correct" action sometimes fails. At p=0.5, the environment is nearly random and Q-values barely differentiate between actions, because the agent controls very little of its trajectory.
On to today's episode
Here we go! Episode 103, and we're staying squarely in reinforcement learning territory. Last time (episode #102) we set up the whole RL framework -- agent, environment, state, action, reward, policies, value functions, the complete picture. We built a Q-learning agent that taught itself to navigate a grid world purely from trial and error. Beautiful stuff.
But here's the thing -- that grid world had states and transitions and all sorts of sequential structure. What if we strip ALL of that away? What if there are no states, no transitions, no delayed rewards? Just: pick an option, get immediate feedback, repeat. What's the simplest possible RL problem?
That's the multi-armed bandit. And despite being dead simple, it captures the core challange of reinforcement learning -- the exploration-exploitation tradeoff -- in its purest, most distilled form. It's also surprisingly practical. Every time you see a website showing you "version A" or "version B" of a page, every time an ad platform picks which ad to show you, every time a clinical trial decides which treatment the next patient should receive -- there's a bandit algorithm behind it (or there should be).
The problem: a row of slot machines
Picture yourself in a casino facing ten slot machines, each with a different (and unknown) payout probability. You have 10,000 pulls. How do you maximize your total winnings?
The name "multi-armed bandit" comes from the old slang "one-armed bandit" for a slot machine -- one arm, pulls your money. Now imagine K of them. K arms, K unknown reward distributions, and the fundamental question: which one should I pull next?
import numpy as np
class BanditEnvironment:
"""K-armed bandit testbed. Each arm has
a fixed (unknown) mean reward drawn from
a standard normal distribution."""
def __init__(self, k=10, seed=42):
self.rng = np.random.RandomState(seed)
self.q_true = self.rng.normal(
0, 1, size=k)
self.k = k
self.best_arm = int(
np.argmax(self.q_true))
def pull(self, arm):
"""Pull arm, receive noisy reward.
Reward = true mean + Gaussian noise."""
return (self.q_true[arm]
+ np.random.normal(0, 1))
The agent doesn't know q_true. It only sees the noisy reward from each pull. Each arm's reward is drawn from a Gaussian centered on some unknown mean -- so even pulling the same arm twice gives different results. The challenge: figure out which arm is best while simultaneously trying to maximize total reward. If you explore too much, you waste pulls on bad arms. If you exploit too early, you might miss a better arm that was simply unlucky in its first few pulls ;-)
Measuring performance with regret
We measure how well a bandit algorithm does using regret -- the difference between what you earned and what you could have earned if you'd always pulled the best arm:
def compute_regret(env, actions, rewards):
"""Cumulative regret: total cost of
not always picking the best arm."""
optimal = env.q_true[env.best_arm]
per_step = optimal - rewards
return np.cumsum(per_step)
A perfect oracle agent has zero regret (it always pulls the best arm). A completely random agent has regret that grows linearly -- it never learns, so every pull wastes the same expected amount. A good bandit algorithm has logarithmic regret -- it makes fewer and fewer mistakes over time as it identifies the best arm. The difference between linear and logarithmic is HUGE when you have thousands or millions of pulls, which is exactly the regime where bandits shine in production.
Epsilon-greedy: the simplest strategy
You already saw this in episode #102 for Q-learning. With probability epsilon, pick a random arm (explore). With probability 1 - epsilon, pick the arm with the highest estimated value (exploit). Same concept, but now without states:
class EpsilonGreedy:
"""Epsilon-greedy bandit agent with
incremental mean updates."""
def __init__(self, k, epsilon=0.1):
self.k = k
self.epsilon = epsilon
self.q_estimates = np.zeros(k)
self.n_pulls = np.zeros(k)
def choose_arm(self):
if np.random.random() < self.epsilon:
return np.random.randint(self.k)
return int(np.argmax(self.q_estimates))
def update(self, arm, reward):
"""Incremental mean: O(1) memory,
exact running average."""
self.n_pulls[arm] += 1
n = self.n_pulls[arm]
self.q_estimates[arm] += (
(reward - self.q_estimates[arm]) / n)
The incremental mean update is a standard trick from statistics: in stead of storing all rewards and recomputing the mean every time, you keep a running average that adjusts with each new observation. new_estimate = old_estimate + (1/n)(reward - old_estimate). This uses O(1) memory per arm regardless of how many thousands of pulls you've done.
The problem with fixed epsilon? Early on, you know nothing about any arm, so random exploration is genuinely useful -- you need data. But after 5,000 pulls, you've almost certainly identified the best arm, and that 10% random exploration is just throwing away pulls on arms you already know are bad. Enter decaying epsilon:
class DecayingEpsilonGreedy:
"""Epsilon starts high (explore a lot)
and shrinks over time (exploit more)."""
def __init__(self, k, initial_eps=1.0,
decay=0.995):
self.k = k
self.epsilon = initial_eps
self.decay = decay
self.q_estimates = np.zeros(k)
self.n_pulls = np.zeros(k)
def choose_arm(self):
if np.random.random() < self.epsilon:
return np.random.randint(self.k)
return int(np.argmax(self.q_estimates))
def update(self, arm, reward):
self.n_pulls[arm] += 1
n = self.n_pulls[arm]
self.q_estimates[arm] += (
(reward - self.q_estimates[arm]) / n)
self.epsilon *= self.decay
Start at epsilon=1.0 (pure exploration), multiply by 0.995 after each step. After 1000 steps, epsilon is about 0.007 -- almost pure exploitation. This schedule works well in practice, though the optimal decay rate depends on how many arms you have and how different their means are.
Upper Confidence Bound: smarter exploration
Here's what bugs me about epsilon-greedy: when it explores, it picks a random arm. It doesn't consider which arms actually need more exploration. If you've pulled arm 3 a thousand times and arm 7 only twice, maybe you should focus your exploration budget on arm 7 -- you know very little about it, and it might be better than it currently looks.
UCB (Upper Confidence Bound) formalizes this intuition. For each arm, compute a score that combines its estimated value with a confidence bonus that's large for rarely-pulled arms and small for well-explored arms:
class UCB:
"""Upper Confidence Bound: optimism
in the face of uncertainty."""
def __init__(self, k, c=2.0):
self.k = k
self.c = c
self.q_estimates = np.zeros(k)
self.n_pulls = np.zeros(k)
self.total_pulls = 0
def choose_arm(self):
self.total_pulls += 1
# Pull each arm once first
for arm in range(self.k):
if self.n_pulls[arm] == 0:
return arm
# UCB score = estimate + bonus
bonus = self.c * np.sqrt(
np.log(self.total_pulls)
/ self.n_pulls)
ucb_scores = self.q_estimates + bonus
return int(np.argmax(ucb_scores))
def update(self, arm, reward):
self.n_pulls[arm] += 1
n = self.n_pulls[arm]
self.q_estimates[arm] += (
(reward - self.q_estimates[arm]) / n)
The confidence bonus c * sqrt(ln(t) / n_i) comes from concentration inequalities in probability theory (specifically Hoeffding's inequality). It represents a statistical upper bound on how much the true value could plausibly exceed our estimate given the number of samples we've taken. The c parameter controls how optimistic we are -- higher c means wider confidence intervals and more exploration.
The principle is called "optimism in the face of uncertainty": if you don't know much about an arm, assume it might be great and try it. If it turns out to be bad, the confidence bonus shrinks (because you've now pulled it more, so n_i grew) and you'll stop selecting it. If it turns out to be genuinely good, you keep pulling it (because the estimated value grew). Either way, you reduce uncertainty. UCB has provably logarithmic regret -- it's theoretically near-optimal for the stationary bandit problem.
Thompson Sampling: the Bayesian approach
If you remember episode #32 on Bayesian methods, you know the core idea: maintain a full probability distribution over unknown quantities, update it as evidence arrives. Thompson Sampling applies this directly to bandits. Instead of keeping a single point estimate for each arm's value, maintain a posterior distribution. To choose an arm, sample from each arm's posterior and pick the arm with the highest sample:
class ThompsonSampling:
"""Bayesian bandit: maintain Beta
distributions for binary rewards,
sample to decide."""
def __init__(self, k):
self.k = k
# Beta(alpha, beta) prior:
# start uniform Beta(1,1)
self.alpha = np.ones(k)
self.beta_param = np.ones(k)
def choose_arm(self):
samples = np.array([
np.random.beta(
self.alpha[i],
self.beta_param[i])
for i in range(self.k)])
return int(np.argmax(samples))
def update(self, arm, reward):
"""Binary reward: success or failure.
Beta posterior is conjugate."""
if reward > 0:
self.alpha[arm] += 1
else:
self.beta_param[arm] += 1
The beauty here -- and I genuinely think this is one of the most elegant ideas in all of machine learning -- is that Thompson Sampling naturally balances exploration and exploitation without any explicit parameter. No epsilon to tune, no c to set. Arms with high uncertainty produce widely varying samples: sometimes the sample is high (triggering exploration), sometimes low (so some other arm wins). As you collect more data, the posterior narrows and the algorithm increasingly commits to the best arm. The exploration schedule emerges from the Bayesian update rule itself.
For continuous (Gaussian) rewards rather than binary success/failure, you use a Normal-Inverse-Gamma posterior in stead of the Beta:
class GaussianThompsonSampling:
"""Thompson Sampling for Gaussian
(continuous) rewards."""
def __init__(self, k):
self.k = k
self.mu = np.zeros(k)
self.lambda_ = np.ones(k)
self.alpha = np.ones(k)
self.beta_ = np.ones(k)
def choose_arm(self):
samples = np.zeros(self.k)
for i in range(self.k):
tau = np.random.gamma(
self.alpha[i],
1.0 / self.beta_[i])
sigma = 1.0 / np.sqrt(
tau * self.lambda_[i])
samples[i] = np.random.normal(
self.mu[i], sigma)
return int(np.argmax(samples))
def update(self, arm, reward):
old_mu = self.mu[arm]
old_lam = self.lambda_[arm]
self.lambda_[arm] += 1
self.mu[arm] = (
(old_lam * old_mu + reward)
/ self.lambda_[arm])
self.alpha[arm] += 0.5
self.beta_[arm] += (
0.5 * old_lam
* (reward - old_mu) ** 2
/ self.lambda_[arm])
The math here is the conjugate update for the Normal-Inverse-Gamma distribution -- the posterior stays in the same family as the prior, so each update is a simple parameter adjustment rather than a full Bayesian integral. The key insight: we're sampling both the mean AND the variance of each arm's reward distribution, which gives us a richer uncertainty model than UCB's single confidence bonus.
Head-to-head comparison
Enough theory. Let's run all three strategies on the same 10-armed bandit problem and see what actually happens:
def run_experiment(env, agent, n_steps=10000):
"""Run one bandit experiment."""
actions = np.zeros(n_steps, dtype=int)
rewards = np.zeros(n_steps)
for t in range(n_steps):
arm = agent.choose_arm()
reward = env.pull(arm)
agent.update(arm, reward)
actions[t] = arm
rewards[t] = reward
return actions, rewards
env = BanditEnvironment(k=10, seed=42)
print(f"True arm values: "
f"{env.q_true.round(2)}")
print(f"Best arm: {env.best_arm} "
f"(value={env.q_true[env.best_arm]:.2f})")
print()
n_steps = 5000
configs = {
"Eps-Greedy (0.1)": EpsilonGreedy(
10, epsilon=0.1),
"UCB (c=2)": UCB(10, c=2.0),
"Thompson": GaussianThompsonSampling(10),
}
for name, agent in configs.items():
np.random.seed(0)
actions, rewards = run_experiment(
env, agent, n_steps)
regret = compute_regret(
env, actions, rewards)
opt_pct = (np.mean(
actions == env.best_arm) * 100)
print(f"{name:22s} | "
f"regret={regret[-1]:7.1f} | "
f"optimal={opt_pct:.1f}%")
You'll typically see Thompson Sampling and UCB outperforming epsilon-greedy by a clear margin. Epsilon-greedy keeps wasting that 10% of its budget on random exploration forever, while UCB and Thompson Sampling progressivly concentrate on the best arm as uncertainty shrinks. Between UCB and Thompson Sampling, empirical results often favor Thompson -- it tends to be less sensitive to its parameters and adapts more gracefully when arm values are close together.
Having said that, the relative performance depends on the specific problem. With well-separated arm means, all algorithms converge quickly and the differences are small. With arms that have very similar means, the smarter exploration strategies of UCB and Thompson really shine because they can distinguish between close alternatives more efficiently.
Contextual bandits: when context matters
So far, the best arm is always the same -- there's one globally optimal choice. But in many real problems, the best choice depends on who's asking. Showing a tech article to a software developer is great; showing the same article to someone who wants cooking recipes is a wasted impression.
Contextual bandits extend the basic framework by adding a context vector (features describing the current situation) that the agent observes before choosing an arm. The reward now depends on both the arm and the context:
class LinUCB:
"""Contextual bandit with linear reward
model and UCB-style exploration."""
def __init__(self, n_arms, d, alpha=1.0):
self.n_arms = n_arms
self.alpha = alpha
# Per-arm ridge regression params
self.A = [np.eye(d)
for _ in range(n_arms)]
self.b = [np.zeros(d)
for _ in range(n_arms)]
def choose_arm(self, context):
scores = np.zeros(self.n_arms)
for arm in range(self.n_arms):
A_inv = np.linalg.inv(
self.A[arm])
theta = A_inv @ self.b[arm]
pred = context @ theta
bonus = self.alpha * np.sqrt(
context @ A_inv @ context)
scores[arm] = pred + bonus
return int(np.argmax(scores))
def update(self, arm, context, reward):
self.A[arm] += np.outer(
context, context)
self.b[arm] += reward * context
# Demo: 3 articles, 4-dim user features
n_arms, d = 3, 4
true_weights = np.array([
[1.0, 0.0, -0.5, 0.2], # article 0
[0.0, 1.0, 0.3, -0.1], # article 1
[-0.3, 0.5, 1.0, 0.0], # article 2
])
agent = LinUCB(n_arms, d, alpha=0.5)
rng = np.random.RandomState(42)
correct = 0
for t in range(2000):
ctx = rng.randn(d)
# True best arm for this context
true_rewards = true_weights @ ctx
best = int(np.argmax(true_rewards))
# Agent's choice
chosen = agent.choose_arm(ctx)
reward = (true_weights[chosen] @ ctx
+ rng.normal(0, 0.1))
agent.update(chosen, ctx, reward)
if chosen == best:
correct += 1
print(f"Accuracy: {correct/2000:.1%}")
print(f"Last-500 accuracy: "
f"{correct/2000:.1%}")
This is LinUCB -- arguably the most important contextual bandit algorithm in industry. It models the expected reward of each arm as a linear function of the context vector: E[reward | arm, context] = theta_arm . context. The theta vectors are learned via ridge regression, and the confidence bonus comes from the regression uncertainty (specifically, the Mahalanobis distance of the context vector with respect to the inverse design matrix).
LinUCB was developed at Yahoo for personalized news recommendation (the famous 2010 paper by Li et al.) and variants of it power recommendation systems at Netflix, Spotify, and basically every ad platform you interact with. The context might include user demographics, browsing history, time of day, device type -- anything that helps predict which option will work best for this specific user in this specific moment.
A/B testing is a bandit problem
Traditional A/B testing works like this: show version A to 50% of users and version B to the other 50%, wait until you have enough data for statistical significance, then pick the winner. The problem? If version A is clearly better after 1,000 users, you're still forcing the remaining 49,000 users to see the inferior version B "for statistical rigor."
That's wasteful. And in domains where each bad experience has real cost -- medical trials, financial decisions, user churn -- it's not just wasteful, it's arguably unethical. Bandit algorithms fix this by shifting traffic toward the better variant as evidence accumulates:
class BanditABTest:
"""A/B testing with Thompson Sampling
instead of fixed 50/50 allocation."""
def __init__(self, n_variants=2):
self.ts = ThompsonSampling(n_variants)
self.shown = np.zeros(
n_variants, dtype=int)
self.conversions = np.zeros(
n_variants, dtype=int)
def choose_variant(self):
return self.ts.choose_arm()
def record(self, variant, converted):
self.shown[variant] += 1
if converted:
self.conversions[variant] += 1
self.ts.update(
variant, 1 if converted else 0)
def report(self):
for v in range(len(self.shown)):
rate = (self.conversions[v]
/ max(1, self.shown[v]))
print(
f" Variant {v}: "
f"{self.shown[v]} shown, "
f"{self.conversions[v]} conv "
f"({rate:.1%})")
# Simulate: variant 0 converts at 5%,
# variant 1 at 8%
rng = np.random.RandomState(42)
test = BanditABTest(2)
for _ in range(10000):
v = test.choose_variant()
true_rate = [0.05, 0.08][v]
converted = rng.random() < true_rate
test.record(v, converted)
print("Bandit A/B test results:")
test.report()
After 10,000 users, Thompson Sampling will have shown variant 1 (the better one) to roughly 70-80% of users instead of 50%. That's 2,000-3,000 fewer users exposed to the inferior option -- real value if each conversion matters.
The tradeoff is real though: traditional A/B tests give you formal statistical significance (p-values, confidence intervals, power analysis) that many organizations require for decision-making, regulatory compliance, or just internal trust. Bandit approaches optimize total outcomes but make statistical claims harder because the allocation is adaptive. In practice, many companies use a hybrid: start with a bandit to quickly identify the likely winner, then switch to a fixed allocation for final statistical validation. Best of both worlds.
Optimistic initialization: a simple trick
One more technique worth knowing. Instead of initializing all Q-estimates to zero, start them high -- say, at 5.0 when the true rewards are typically around 0-1. Now every arm looks amazing at first. The agent picks arm 0, gets a reward of 0.7, and the estimate drops. Next time it picks a different arm (whose estimate is still 5.0) and that drops too. This forces systematic exploration of every arm until the estimates settle to realistic values:
class OptimisticGreedy:
"""Greedy with optimistic initialization.
No explicit exploration needed."""
def __init__(self, k, initial_value=5.0):
self.k = k
self.q_estimates = np.full(
k, initial_value)
self.n_pulls = np.zeros(k)
def choose_arm(self):
return int(
np.argmax(self.q_estimates))
def update(self, arm, reward):
self.n_pulls[arm] += 1
n = self.n_pulls[arm]
self.q_estimates[arm] += (
(reward - self.q_estimates[arm]) / n)
# Quick test
env = BanditEnvironment(k=10, seed=42)
agent = OptimisticGreedy(10, initial_value=5.0)
np.random.seed(0)
actions, rewards = run_experiment(
env, agent, 5000)
regret = compute_regret(env, actions, rewards)
opt_pct = np.mean(
actions == env.best_arm) * 100
print(f"Optimistic greedy: "
f"regret={regret[-1]:.1f}, "
f"optimal={opt_pct:.1f}%")
Optimistic initialization is purely greedy -- no epsilon, no randomness after the initial phase. The initial values drive exploration automatically. It's a neat trick and works surprisingly well for stationary problems where the arm values don't change over time. The limitation: it only explores at the beginning. If the environment changes later (non-stationary bandits), optimistic initialization won't re-explore. For non-stationary problems, you need something like a sliding window average or explicit exploration that never fully stops.
When bandits break: non-stationary rewards
All the algorithms above assume the true arm values are fixed. But what if they change over time? The restaurant that was great last month got a new chef. The ad that converted well in January flops in March. This is the non-stationary bandit problem, and it requires a different approach:
class SlidingWindowBandit:
"""Bandit that only considers recent
observations, adapting to change."""
def __init__(self, k, window=100,
epsilon=0.1):
self.k = k
self.window = window
self.epsilon = epsilon
self.history = [[] for _ in range(k)]
def choose_arm(self):
if np.random.random() < self.epsilon:
return np.random.randint(self.k)
estimates = np.zeros(self.k)
for arm in range(self.k):
recent = self.history[arm][
-self.window:]
if len(recent) == 0:
estimates[arm] = float('inf')
else:
estimates[arm] = np.mean(recent)
return int(np.argmax(estimates))
def update(self, arm, reward):
self.history[arm].append(reward)
# Simulate: arm values shift at t=2500
rng = np.random.RandomState(42)
k = 5
q_phase1 = rng.normal(0, 1, size=k)
q_phase2 = rng.normal(0, 1, size=k)
agent = SlidingWindowBandit(
k, window=200, epsilon=0.1)
regrets = []
for t in range(5000):
q = q_phase1 if t < 2500 else q_phase2
arm = agent.choose_arm()
reward = q[arm] + rng.normal(0, 0.5)
agent.update(arm, reward)
best = np.max(q)
regrets.append(best - q[arm])
print(f"Phase 1 avg regret: "
f"{np.mean(regrets[:2500]):.3f}")
print(f"Phase 2 avg regret: "
f"{np.mean(regrets[2500:]):.3f}")
print(f"Total regret: {np.sum(regrets):.1f}")
The sliding window approach forgets old data, allowing the agent to adapt when arm values change. The window size controls the forgetting speed -- too small and you react to noise, too large and you're slow to adapt. In production systems, you often see exponential weighting (recent observations count more) rather than a hard window cutoff, because it decays more smoothly.
Samengevat
- The multi-armed bandit is the simplest RL problem -- choose between K options with unknown rewards, no states, no transitions -- yet it captures the exploration-exploitation tradeoff in its purest form;
- epsilon-greedy explores randomly with fixed probability -- simple but inefficient because it doesn't target uncertain arms, and a fixed epsilon wastes pulls forever;
- UCB adds a confidence bonus based on how rarely each arm has been pulled, preferring uncertain arms -- provably near-optimal with logarithmic regret;
- Thompson Sampling maintains Bayesian posteriors (Beta for binary, Normal-Inverse-Gamma for continuous) and samples to decide -- naturally balances exploration and exploitation without tunable parameters, often performs best in practice;
- contextual bandits add user features so the best arm varies per situation -- LinUCB uses per-arm linear models with ridge regression uncertainty for principled exploration;
- A/B testing is secretly a bandit problem: bandit algorithms shift traffic to better variants faster than fixed 50/50 splits, reducing exposure to inferior options;
- optimistic initialization drives early exploration by setting initial estimates high -- simple, effective, but only works for stationary problems;
- for non-stationary bandits (where arm values change over time), sliding window or exponentially-weighted estimates let the agent forget old data and re-explore.
Bandits are just the beginning of RL, but they're the foundation. The exploration strategies we built today -- epsilon-greedy, UCB, Thompson Sampling -- will show up again in more complex settings where we add states, transitions, and delayed rewards on top of them. The next episode will formalize some of the mathematical framework we've been using informally, and from there we'll build up to the full sequential decision-making algorithms that power modern RL systems.
Exercises
Exercise 1: Build a multi-armed bandit tournament. Create a class BanditTournament that runs 5 algorithms -- EpsilonGreedy(0.1), EpsilonGreedy(0.01), DecayingEpsilonGreedy, UCB(c=2), and GaussianThompsonSampling -- on 20 different randomly-generated 10-armed bandit environments (seeds 0-19). For each environment, run each algorithm for 10,000 steps. Track per-environment final cumulative regret and percentage of optimal arm pulls. Print: (a) a summary table showing average regret and optimal-pull% across all 20 environments for each algorithm, (b) how many of the 20 environments each algorithm "won" (lowest regret), (c) for the hardest environment (where the gap between best and second-best arm is smallest), show each algorithm's regret curve at steps 100, 1000, 5000, 10000. Thompson Sampling should win the most environments overall.
Exercise 2: Build a contextual bandit news recommender. Create a class NewsRecommender using LinUCB with 5 article categories (tech, sports, politics, cooking, finance) and 6-dimensional user feature vectors (age_bucket, gender, morning/evening, weekday/weekend, mobile/desktop, returning_user). Generate 10,000 synthetic user visits where the true click probability for each article-user combination is a sigmoid of a linear function (so each article has a 6-dim weight vector, initialized randomly with seed=42). Train the LinUCB agent online: observe user features, choose article, observe click/no-click, update. Print: (a) overall CTR (click-through rate) in the first 1000 visits vs the last 1000, (b) per-article selection counts (the agent should learn to show different articles to different user types), (c) comparison with a random baseline and a fixed "always show most popular" baseline. LinUCB should achieve at least 2x the CTR of random by the end.
Exercise 3: Build a non-stationary bandit detector. Create a class ChangeDetector that wraps a sliding-window epsilon-greedy agent and adds change-point detection. Track the per-arm reward mean in a sliding window of size 100. Every 50 steps, compare the current window mean against the previous window mean using a simple threshold (if the absolute difference exceeds 0.5, flag a change). When a change is detected for any arm, temporarily boost epsilon to 0.5 for 50 steps (re-explore), then decay back to 0.1. Test on a 5-armed environment where arm values shift every 1000 steps (resample from Normal(0,1) with different seeds). Run for 5000 steps and print: (a) which arms had detected changes and at what step, (b) the regret curve compared to a fixed-epsilon agent (no detection, no boosting), (c) total regret for both agents. The change-detecting agent should recover faster after each shift, producing lower total regret.