Learn AI Series (#103) - Multi-Armed Bandits

avatar

Learn AI Series (#103) - Multi-Armed Bandits

variant-c-11-teal.png

What will I learn

  • You will learn the multi-armed bandit problem: the simplest reinforcement learning formulation that still captures the exploration-exploitation tradeoff;
  • epsilon-greedy exploration: the most common baseline strategy and why fixed epsilon wastes pulls;
  • Upper Confidence Bound (UCB): optimism in the face of uncertainty with provably logarithmic regret;
  • Thompson Sampling: Bayesian exploration with probability matching that often beats UCB in practice;
  • contextual bandits and LinUCB: when user features determine which arm is best;
  • how traditional A/B testing is secretly a bandit problem, and why bandit approaches reduce wasted exposure.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#103) - Multi-Armed Bandits

Solutions to Episode #102 Exercises

Exercise 1: Windy grid world -- create a WindyGridWorld class (7x10, start=(3,0), goal=(3,7)) where columns 3-8 have upward wind of 1 cell (columns 6-7 push by 2). Train Q-learning for 500 episodes.

import numpy as np


class WindyGridWorld:
    def __init__(self):
        self.rows, self.cols = 7, 10
        self.start = (3, 0)
        self.goal = (3, 7)
        self.wind = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0]
        self.state = self.start

    def reset(self):
        self.state = self.start
        return self.state

    def step(self, action):
        r, c = self.state
        if action == 0: r -= 1
        elif action == 1: c += 1
        elif action == 2: r += 1
        elif action == 3: c -= 1
        r -= self.wind[min(c, self.cols - 1)]
        r = max(0, min(self.rows - 1, r))
        c = max(0, min(self.cols - 1, c))
        self.state = (r, c)
        if self.state == self.goal:
            return self.state, 0, True
        return self.state, -1, False


env = WindyGridWorld()
Q = {}
eps, alpha, gamma = 0.1, 0.5, 1.0
rng = np.random.RandomState(42)

for ep in range(500):
    s = env.reset()
    for _ in range(200):
        if s not in Q:
            Q[s] = np.zeros(4)
        if rng.random() < eps:
            a = rng.randint(4)
        else:
            a = int(np.argmax(Q[s]))
        ns, r, done = env.step(a)
        if ns not in Q:
            Q[ns] = np.zeros(4)
        Q[s][a] += alpha * (
            r + gamma * np.max(Q[ns]) - Q[s][a])
        s = ns
        if done:
            break

arrows = ["^", ">", "v", "<"]
print("Learned policy:")
for row in range(7):
    line = ""
    for col in range(10):
        if (row, col) == env.goal:
            line += " G "
        elif (row, col) in Q:
            line += f" {arrows[int(np.argmax(Q[(row, col)]))]} "
        else:
            line += " . "
    print(line)
print(f"States visited: {len(Q)}")

The learned policy compensates for the wind by biasing downward in windy columns. The agent discovers it needs to "aim low" in the middle section because the wind will push it upward anyway. Average steps-to-goal in the last 50 episodes is typically 15-20 (compared to 6 for the no-wind optimal path), because the agent must navigate around the strong wind in columns 6-7.

Exercise 2: Epsilon decay comparison -- train 4 agents with different epsilon schedules on the 4x4 grid world for 1000 episodes.

import numpy as np


class SimpleGridWorld:
    def __init__(self, size=4):
        self.size = size
        self.state = (0, 0)
        self.goal = (size - 1, size - 1)

    def reset(self):
        self.state = (0, 0)
        return self.state

    def step(self, action):
        r, c = self.state
        if action == 0: r = max(0, r - 1)
        elif action == 1: c = min(self.size - 1, c + 1)
        elif action == 2: r = min(self.size - 1, r + 1)
        elif action == 3: c = max(0, c - 1)
        self.state = (r, c)
        if self.state == self.goal:
            return self.state, 1.0, True
        return self.state, -0.01, False


def train_agent(env, schedule, n_episodes=1000):
    Q = {}
    rng = np.random.RandomState(42)
    rewards_hist = []
    explore_count = 0

    for ep in range(n_episodes):
        eps = schedule(ep)
        s = env.reset()
        total = 0
        for _ in range(100):
            if s not in Q:
                Q[s] = np.zeros(4)
            if rng.random() < eps:
                a = rng.randint(4)
                explore_count += 1
            else:
                a = int(np.argmax(Q[s]))
            ns, r, done = env.step(a)
            if ns not in Q:
                Q[ns] = np.zeros(4)
            Q[s][a] += 0.1 * (
                r + 0.99 * np.max(Q[ns]) - Q[s][a])
            total += r
            s = ns
            if done:
                break
        rewards_hist.append(total)

    # Greedy test
    test_rewards = []
    for _ in range(100):
        s = env.reset()
        total = 0
        for _ in range(100):
            if s not in Q:
                break
            a = int(np.argmax(Q[s]))
            ns, r, done = env.step(a)
            total += r
            s = ns
            if done:
                break
        test_rewards.append(total)

    return rewards_hist, explore_count, np.mean(test_rewards)


schedules = {
    "Fixed 0.1": lambda ep: 0.1,
    "Linear decay": lambda ep: max(0.01, 1.0 - ep / 1000),
    "Exponential": lambda ep: 0.99 ** ep,
    "Fixed 0.5": lambda ep: 0.5,
}

for name, sched in schedules.items():
    env = SimpleGridWorld()
    hist, explores, test_score = train_agent(
        env, sched)
    blocks = [np.mean(hist[i:i+100])
              for i in range(0, 1000, 100)]
    print(f"{name:20s} | test={test_score:.3f} "
          f"| explores={explores} "
          f"| blocks={[f'{b:.2f}' for b in blocks[:3]]}...")

The linear and exponential decay agents explore heavily early (discovering the state space) then converge to greedy exploitation, producing the best final test scores. Fixed 0.1 works reasonably but keeps wasting 10% of actions on random exploration even after it knows the optimal path. Fixed 0.5 explores too much throughout and never fully commits.

Exercise 3: Stochastic environment comparison -- train on deterministic, p=0.8, and p=0.5 grid worlds for 2000 episodes.

import numpy as np


class StochasticGridWorld:
    def __init__(self, size=4, p_success=1.0):
        self.size = size
        self.p_success = p_success
        self.state = (0, 0)
        self.goal = (size - 1, size - 1)
        self.rng = np.random.RandomState(42)

    def reset(self):
        self.state = (0, 0)
        return self.state

    def step(self, action):
        if self.rng.random() > self.p_success:
            others = [a for a in range(4)
                      if a != action]
            action = self.rng.choice(others)
        r, c = self.state
        if action == 0: r = max(0, r - 1)
        elif action == 1:
            c = min(self.size - 1, c + 1)
        elif action == 2:
            r = min(self.size - 1, r + 1)
        elif action == 3: c = max(0, c - 1)
        self.state = (r, c)
        if self.state == self.goal:
            return self.state, 1.0, True
        return self.state, -0.01, False


for p in [1.0, 0.8, 0.5]:
    env = StochasticGridWorld(p_success=p)
    Q = {}
    rng = np.random.RandomState(0)
    rewards = []
    for ep in range(2000):
        s = env.reset()
        total = 0
        for _ in range(100):
            if s not in Q:
                Q[s] = np.zeros(4)
            if rng.random() < 0.1:
                a = rng.randint(4)
            else:
                a = int(np.argmax(Q[s]))
            ns, r, done = env.step(a)
            if ns not in Q:
                Q[ns] = np.zeros(4)
            Q[s][a] += 0.1 * (
                r + 0.99 * np.max(Q[ns])
                - Q[s][a])
            total += r
            s = ns
            if done:
                break
        rewards.append(total)

    blocks = [np.mean(rewards[i:i+200])
              for i in range(0, 2000, 200)]
    spreads = []
    for s in Q:
        spreads.append(np.max(Q[s]) - np.min(Q[s]))
    avg_spread = np.mean(spreads)
    print(f"p={p}: blocks={[f'{b:.2f}' for b in blocks]}")
    print(f"  Q-spread={avg_spread:.3f}")

Deterministic (p=1.0) converges fastest with the largest Q-value spread because the agent's actions directly control outcomes -- so the difference between a good action and a bad one is clear. At p=0.8, convergence is slower and Q-spreads compress because even the "correct" action sometimes fails. At p=0.5, the environment is nearly random and Q-values barely differentiate between actions, because the agent controls very little of its trajectory.

On to today's episode

Here we go! Episode 103, and we're staying squarely in reinforcement learning territory. Last time (episode #102) we set up the whole RL framework -- agent, environment, state, action, reward, policies, value functions, the complete picture. We built a Q-learning agent that taught itself to navigate a grid world purely from trial and error. Beautiful stuff.

But here's the thing -- that grid world had states and transitions and all sorts of sequential structure. What if we strip ALL of that away? What if there are no states, no transitions, no delayed rewards? Just: pick an option, get immediate feedback, repeat. What's the simplest possible RL problem?

That's the multi-armed bandit. And despite being dead simple, it captures the core challange of reinforcement learning -- the exploration-exploitation tradeoff -- in its purest, most distilled form. It's also surprisingly practical. Every time you see a website showing you "version A" or "version B" of a page, every time an ad platform picks which ad to show you, every time a clinical trial decides which treatment the next patient should receive -- there's a bandit algorithm behind it (or there should be).

The problem: a row of slot machines

Picture yourself in a casino facing ten slot machines, each with a different (and unknown) payout probability. You have 10,000 pulls. How do you maximize your total winnings?

The name "multi-armed bandit" comes from the old slang "one-armed bandit" for a slot machine -- one arm, pulls your money. Now imagine K of them. K arms, K unknown reward distributions, and the fundamental question: which one should I pull next?

import numpy as np


class BanditEnvironment:
    """K-armed bandit testbed. Each arm has
    a fixed (unknown) mean reward drawn from
    a standard normal distribution."""

    def __init__(self, k=10, seed=42):
        self.rng = np.random.RandomState(seed)
        self.q_true = self.rng.normal(
            0, 1, size=k)
        self.k = k
        self.best_arm = int(
            np.argmax(self.q_true))

    def pull(self, arm):
        """Pull arm, receive noisy reward.
        Reward = true mean + Gaussian noise."""
        return (self.q_true[arm]
                + np.random.normal(0, 1))

The agent doesn't know q_true. It only sees the noisy reward from each pull. Each arm's reward is drawn from a Gaussian centered on some unknown mean -- so even pulling the same arm twice gives different results. The challenge: figure out which arm is best while simultaneously trying to maximize total reward. If you explore too much, you waste pulls on bad arms. If you exploit too early, you might miss a better arm that was simply unlucky in its first few pulls ;-)

Measuring performance with regret

We measure how well a bandit algorithm does using regret -- the difference between what you earned and what you could have earned if you'd always pulled the best arm:

def compute_regret(env, actions, rewards):
    """Cumulative regret: total cost of
    not always picking the best arm."""
    optimal = env.q_true[env.best_arm]
    per_step = optimal - rewards
    return np.cumsum(per_step)

A perfect oracle agent has zero regret (it always pulls the best arm). A completely random agent has regret that grows linearly -- it never learns, so every pull wastes the same expected amount. A good bandit algorithm has logarithmic regret -- it makes fewer and fewer mistakes over time as it identifies the best arm. The difference between linear and logarithmic is HUGE when you have thousands or millions of pulls, which is exactly the regime where bandits shine in production.

Epsilon-greedy: the simplest strategy

You already saw this in episode #102 for Q-learning. With probability epsilon, pick a random arm (explore). With probability 1 - epsilon, pick the arm with the highest estimated value (exploit). Same concept, but now without states:

class EpsilonGreedy:
    """Epsilon-greedy bandit agent with
    incremental mean updates."""

    def __init__(self, k, epsilon=0.1):
        self.k = k
        self.epsilon = epsilon
        self.q_estimates = np.zeros(k)
        self.n_pulls = np.zeros(k)

    def choose_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.k)
        return int(np.argmax(self.q_estimates))

    def update(self, arm, reward):
        """Incremental mean: O(1) memory,
        exact running average."""
        self.n_pulls[arm] += 1
        n = self.n_pulls[arm]
        self.q_estimates[arm] += (
            (reward - self.q_estimates[arm]) / n)

The incremental mean update is a standard trick from statistics: in stead of storing all rewards and recomputing the mean every time, you keep a running average that adjusts with each new observation. new_estimate = old_estimate + (1/n)(reward - old_estimate). This uses O(1) memory per arm regardless of how many thousands of pulls you've done.

The problem with fixed epsilon? Early on, you know nothing about any arm, so random exploration is genuinely useful -- you need data. But after 5,000 pulls, you've almost certainly identified the best arm, and that 10% random exploration is just throwing away pulls on arms you already know are bad. Enter decaying epsilon:

class DecayingEpsilonGreedy:
    """Epsilon starts high (explore a lot)
    and shrinks over time (exploit more)."""

    def __init__(self, k, initial_eps=1.0,
                 decay=0.995):
        self.k = k
        self.epsilon = initial_eps
        self.decay = decay
        self.q_estimates = np.zeros(k)
        self.n_pulls = np.zeros(k)

    def choose_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.k)
        return int(np.argmax(self.q_estimates))

    def update(self, arm, reward):
        self.n_pulls[arm] += 1
        n = self.n_pulls[arm]
        self.q_estimates[arm] += (
            (reward - self.q_estimates[arm]) / n)
        self.epsilon *= self.decay

Start at epsilon=1.0 (pure exploration), multiply by 0.995 after each step. After 1000 steps, epsilon is about 0.007 -- almost pure exploitation. This schedule works well in practice, though the optimal decay rate depends on how many arms you have and how different their means are.

Upper Confidence Bound: smarter exploration

Here's what bugs me about epsilon-greedy: when it explores, it picks a random arm. It doesn't consider which arms actually need more exploration. If you've pulled arm 3 a thousand times and arm 7 only twice, maybe you should focus your exploration budget on arm 7 -- you know very little about it, and it might be better than it currently looks.

UCB (Upper Confidence Bound) formalizes this intuition. For each arm, compute a score that combines its estimated value with a confidence bonus that's large for rarely-pulled arms and small for well-explored arms:

class UCB:
    """Upper Confidence Bound: optimism
    in the face of uncertainty."""

    def __init__(self, k, c=2.0):
        self.k = k
        self.c = c
        self.q_estimates = np.zeros(k)
        self.n_pulls = np.zeros(k)
        self.total_pulls = 0

    def choose_arm(self):
        self.total_pulls += 1

        # Pull each arm once first
        for arm in range(self.k):
            if self.n_pulls[arm] == 0:
                return arm

        # UCB score = estimate + bonus
        bonus = self.c * np.sqrt(
            np.log(self.total_pulls)
            / self.n_pulls)
        ucb_scores = self.q_estimates + bonus
        return int(np.argmax(ucb_scores))

    def update(self, arm, reward):
        self.n_pulls[arm] += 1
        n = self.n_pulls[arm]
        self.q_estimates[arm] += (
            (reward - self.q_estimates[arm]) / n)

The confidence bonus c * sqrt(ln(t) / n_i) comes from concentration inequalities in probability theory (specifically Hoeffding's inequality). It represents a statistical upper bound on how much the true value could plausibly exceed our estimate given the number of samples we've taken. The c parameter controls how optimistic we are -- higher c means wider confidence intervals and more exploration.

The principle is called "optimism in the face of uncertainty": if you don't know much about an arm, assume it might be great and try it. If it turns out to be bad, the confidence bonus shrinks (because you've now pulled it more, so n_i grew) and you'll stop selecting it. If it turns out to be genuinely good, you keep pulling it (because the estimated value grew). Either way, you reduce uncertainty. UCB has provably logarithmic regret -- it's theoretically near-optimal for the stationary bandit problem.

Thompson Sampling: the Bayesian approach

If you remember episode #32 on Bayesian methods, you know the core idea: maintain a full probability distribution over unknown quantities, update it as evidence arrives. Thompson Sampling applies this directly to bandits. Instead of keeping a single point estimate for each arm's value, maintain a posterior distribution. To choose an arm, sample from each arm's posterior and pick the arm with the highest sample:

class ThompsonSampling:
    """Bayesian bandit: maintain Beta
    distributions for binary rewards,
    sample to decide."""

    def __init__(self, k):
        self.k = k
        # Beta(alpha, beta) prior:
        # start uniform Beta(1,1)
        self.alpha = np.ones(k)
        self.beta_param = np.ones(k)

    def choose_arm(self):
        samples = np.array([
            np.random.beta(
                self.alpha[i],
                self.beta_param[i])
            for i in range(self.k)])
        return int(np.argmax(samples))

    def update(self, arm, reward):
        """Binary reward: success or failure.
        Beta posterior is conjugate."""
        if reward > 0:
            self.alpha[arm] += 1
        else:
            self.beta_param[arm] += 1

The beauty here -- and I genuinely think this is one of the most elegant ideas in all of machine learning -- is that Thompson Sampling naturally balances exploration and exploitation without any explicit parameter. No epsilon to tune, no c to set. Arms with high uncertainty produce widely varying samples: sometimes the sample is high (triggering exploration), sometimes low (so some other arm wins). As you collect more data, the posterior narrows and the algorithm increasingly commits to the best arm. The exploration schedule emerges from the Bayesian update rule itself.

For continuous (Gaussian) rewards rather than binary success/failure, you use a Normal-Inverse-Gamma posterior in stead of the Beta:

class GaussianThompsonSampling:
    """Thompson Sampling for Gaussian
    (continuous) rewards."""

    def __init__(self, k):
        self.k = k
        self.mu = np.zeros(k)
        self.lambda_ = np.ones(k)
        self.alpha = np.ones(k)
        self.beta_ = np.ones(k)

    def choose_arm(self):
        samples = np.zeros(self.k)
        for i in range(self.k):
            tau = np.random.gamma(
                self.alpha[i],
                1.0 / self.beta_[i])
            sigma = 1.0 / np.sqrt(
                tau * self.lambda_[i])
            samples[i] = np.random.normal(
                self.mu[i], sigma)
        return int(np.argmax(samples))

    def update(self, arm, reward):
        old_mu = self.mu[arm]
        old_lam = self.lambda_[arm]
        self.lambda_[arm] += 1
        self.mu[arm] = (
            (old_lam * old_mu + reward)
            / self.lambda_[arm])
        self.alpha[arm] += 0.5
        self.beta_[arm] += (
            0.5 * old_lam
            * (reward - old_mu) ** 2
            / self.lambda_[arm])

The math here is the conjugate update for the Normal-Inverse-Gamma distribution -- the posterior stays in the same family as the prior, so each update is a simple parameter adjustment rather than a full Bayesian integral. The key insight: we're sampling both the mean AND the variance of each arm's reward distribution, which gives us a richer uncertainty model than UCB's single confidence bonus.

Head-to-head comparison

Enough theory. Let's run all three strategies on the same 10-armed bandit problem and see what actually happens:

def run_experiment(env, agent, n_steps=10000):
    """Run one bandit experiment."""
    actions = np.zeros(n_steps, dtype=int)
    rewards = np.zeros(n_steps)

    for t in range(n_steps):
        arm = agent.choose_arm()
        reward = env.pull(arm)
        agent.update(arm, reward)
        actions[t] = arm
        rewards[t] = reward

    return actions, rewards


env = BanditEnvironment(k=10, seed=42)
print(f"True arm values: "
      f"{env.q_true.round(2)}")
print(f"Best arm: {env.best_arm} "
      f"(value={env.q_true[env.best_arm]:.2f})")
print()

n_steps = 5000
configs = {
    "Eps-Greedy (0.1)": EpsilonGreedy(
        10, epsilon=0.1),
    "UCB (c=2)": UCB(10, c=2.0),
    "Thompson": GaussianThompsonSampling(10),
}

for name, agent in configs.items():
    np.random.seed(0)
    actions, rewards = run_experiment(
        env, agent, n_steps)
    regret = compute_regret(
        env, actions, rewards)
    opt_pct = (np.mean(
        actions == env.best_arm) * 100)
    print(f"{name:22s} | "
          f"regret={regret[-1]:7.1f} | "
          f"optimal={opt_pct:.1f}%")

You'll typically see Thompson Sampling and UCB outperforming epsilon-greedy by a clear margin. Epsilon-greedy keeps wasting that 10% of its budget on random exploration forever, while UCB and Thompson Sampling progressivly concentrate on the best arm as uncertainty shrinks. Between UCB and Thompson Sampling, empirical results often favor Thompson -- it tends to be less sensitive to its parameters and adapts more gracefully when arm values are close together.

Having said that, the relative performance depends on the specific problem. With well-separated arm means, all algorithms converge quickly and the differences are small. With arms that have very similar means, the smarter exploration strategies of UCB and Thompson really shine because they can distinguish between close alternatives more efficiently.

Contextual bandits: when context matters

So far, the best arm is always the same -- there's one globally optimal choice. But in many real problems, the best choice depends on who's asking. Showing a tech article to a software developer is great; showing the same article to someone who wants cooking recipes is a wasted impression.

Contextual bandits extend the basic framework by adding a context vector (features describing the current situation) that the agent observes before choosing an arm. The reward now depends on both the arm and the context:

class LinUCB:
    """Contextual bandit with linear reward
    model and UCB-style exploration."""

    def __init__(self, n_arms, d, alpha=1.0):
        self.n_arms = n_arms
        self.alpha = alpha
        # Per-arm ridge regression params
        self.A = [np.eye(d)
                  for _ in range(n_arms)]
        self.b = [np.zeros(d)
                  for _ in range(n_arms)]

    def choose_arm(self, context):
        scores = np.zeros(self.n_arms)
        for arm in range(self.n_arms):
            A_inv = np.linalg.inv(
                self.A[arm])
            theta = A_inv @ self.b[arm]
            pred = context @ theta
            bonus = self.alpha * np.sqrt(
                context @ A_inv @ context)
            scores[arm] = pred + bonus
        return int(np.argmax(scores))

    def update(self, arm, context, reward):
        self.A[arm] += np.outer(
            context, context)
        self.b[arm] += reward * context


# Demo: 3 articles, 4-dim user features
n_arms, d = 3, 4
true_weights = np.array([
    [1.0, 0.0, -0.5, 0.2],   # article 0
    [0.0, 1.0, 0.3, -0.1],   # article 1
    [-0.3, 0.5, 1.0, 0.0],   # article 2
])

agent = LinUCB(n_arms, d, alpha=0.5)
rng = np.random.RandomState(42)
correct = 0

for t in range(2000):
    ctx = rng.randn(d)
    # True best arm for this context
    true_rewards = true_weights @ ctx
    best = int(np.argmax(true_rewards))
    # Agent's choice
    chosen = agent.choose_arm(ctx)
    reward = (true_weights[chosen] @ ctx
              + rng.normal(0, 0.1))
    agent.update(chosen, ctx, reward)
    if chosen == best:
        correct += 1

print(f"Accuracy: {correct/2000:.1%}")
print(f"Last-500 accuracy: "
      f"{correct/2000:.1%}")

This is LinUCB -- arguably the most important contextual bandit algorithm in industry. It models the expected reward of each arm as a linear function of the context vector: E[reward | arm, context] = theta_arm . context. The theta vectors are learned via ridge regression, and the confidence bonus comes from the regression uncertainty (specifically, the Mahalanobis distance of the context vector with respect to the inverse design matrix).

LinUCB was developed at Yahoo for personalized news recommendation (the famous 2010 paper by Li et al.) and variants of it power recommendation systems at Netflix, Spotify, and basically every ad platform you interact with. The context might include user demographics, browsing history, time of day, device type -- anything that helps predict which option will work best for this specific user in this specific moment.

A/B testing is a bandit problem

Traditional A/B testing works like this: show version A to 50% of users and version B to the other 50%, wait until you have enough data for statistical significance, then pick the winner. The problem? If version A is clearly better after 1,000 users, you're still forcing the remaining 49,000 users to see the inferior version B "for statistical rigor."

That's wasteful. And in domains where each bad experience has real cost -- medical trials, financial decisions, user churn -- it's not just wasteful, it's arguably unethical. Bandit algorithms fix this by shifting traffic toward the better variant as evidence accumulates:

class BanditABTest:
    """A/B testing with Thompson Sampling
    instead of fixed 50/50 allocation."""

    def __init__(self, n_variants=2):
        self.ts = ThompsonSampling(n_variants)
        self.shown = np.zeros(
            n_variants, dtype=int)
        self.conversions = np.zeros(
            n_variants, dtype=int)

    def choose_variant(self):
        return self.ts.choose_arm()

    def record(self, variant, converted):
        self.shown[variant] += 1
        if converted:
            self.conversions[variant] += 1
        self.ts.update(
            variant, 1 if converted else 0)

    def report(self):
        for v in range(len(self.shown)):
            rate = (self.conversions[v]
                    / max(1, self.shown[v]))
            print(
                f"  Variant {v}: "
                f"{self.shown[v]} shown, "
                f"{self.conversions[v]} conv "
                f"({rate:.1%})")


# Simulate: variant 0 converts at 5%,
# variant 1 at 8%
rng = np.random.RandomState(42)
test = BanditABTest(2)

for _ in range(10000):
    v = test.choose_variant()
    true_rate = [0.05, 0.08][v]
    converted = rng.random() < true_rate
    test.record(v, converted)

print("Bandit A/B test results:")
test.report()

After 10,000 users, Thompson Sampling will have shown variant 1 (the better one) to roughly 70-80% of users instead of 50%. That's 2,000-3,000 fewer users exposed to the inferior option -- real value if each conversion matters.

The tradeoff is real though: traditional A/B tests give you formal statistical significance (p-values, confidence intervals, power analysis) that many organizations require for decision-making, regulatory compliance, or just internal trust. Bandit approaches optimize total outcomes but make statistical claims harder because the allocation is adaptive. In practice, many companies use a hybrid: start with a bandit to quickly identify the likely winner, then switch to a fixed allocation for final statistical validation. Best of both worlds.

Optimistic initialization: a simple trick

One more technique worth knowing. Instead of initializing all Q-estimates to zero, start them high -- say, at 5.0 when the true rewards are typically around 0-1. Now every arm looks amazing at first. The agent picks arm 0, gets a reward of 0.7, and the estimate drops. Next time it picks a different arm (whose estimate is still 5.0) and that drops too. This forces systematic exploration of every arm until the estimates settle to realistic values:

class OptimisticGreedy:
    """Greedy with optimistic initialization.
    No explicit exploration needed."""

    def __init__(self, k, initial_value=5.0):
        self.k = k
        self.q_estimates = np.full(
            k, initial_value)
        self.n_pulls = np.zeros(k)

    def choose_arm(self):
        return int(
            np.argmax(self.q_estimates))

    def update(self, arm, reward):
        self.n_pulls[arm] += 1
        n = self.n_pulls[arm]
        self.q_estimates[arm] += (
            (reward - self.q_estimates[arm]) / n)


# Quick test
env = BanditEnvironment(k=10, seed=42)
agent = OptimisticGreedy(10, initial_value=5.0)
np.random.seed(0)
actions, rewards = run_experiment(
    env, agent, 5000)
regret = compute_regret(env, actions, rewards)
opt_pct = np.mean(
    actions == env.best_arm) * 100
print(f"Optimistic greedy: "
      f"regret={regret[-1]:.1f}, "
      f"optimal={opt_pct:.1f}%")

Optimistic initialization is purely greedy -- no epsilon, no randomness after the initial phase. The initial values drive exploration automatically. It's a neat trick and works surprisingly well for stationary problems where the arm values don't change over time. The limitation: it only explores at the beginning. If the environment changes later (non-stationary bandits), optimistic initialization won't re-explore. For non-stationary problems, you need something like a sliding window average or explicit exploration that never fully stops.

When bandits break: non-stationary rewards

All the algorithms above assume the true arm values are fixed. But what if they change over time? The restaurant that was great last month got a new chef. The ad that converted well in January flops in March. This is the non-stationary bandit problem, and it requires a different approach:

class SlidingWindowBandit:
    """Bandit that only considers recent
    observations, adapting to change."""

    def __init__(self, k, window=100,
                 epsilon=0.1):
        self.k = k
        self.window = window
        self.epsilon = epsilon
        self.history = [[] for _ in range(k)]

    def choose_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.k)
        estimates = np.zeros(self.k)
        for arm in range(self.k):
            recent = self.history[arm][
                -self.window:]
            if len(recent) == 0:
                estimates[arm] = float('inf')
            else:
                estimates[arm] = np.mean(recent)
        return int(np.argmax(estimates))

    def update(self, arm, reward):
        self.history[arm].append(reward)


# Simulate: arm values shift at t=2500
rng = np.random.RandomState(42)
k = 5
q_phase1 = rng.normal(0, 1, size=k)
q_phase2 = rng.normal(0, 1, size=k)

agent = SlidingWindowBandit(
    k, window=200, epsilon=0.1)
regrets = []
for t in range(5000):
    q = q_phase1 if t < 2500 else q_phase2
    arm = agent.choose_arm()
    reward = q[arm] + rng.normal(0, 0.5)
    agent.update(arm, reward)
    best = np.max(q)
    regrets.append(best - q[arm])

print(f"Phase 1 avg regret: "
      f"{np.mean(regrets[:2500]):.3f}")
print(f"Phase 2 avg regret: "
      f"{np.mean(regrets[2500:]):.3f}")
print(f"Total regret: {np.sum(regrets):.1f}")

The sliding window approach forgets old data, allowing the agent to adapt when arm values change. The window size controls the forgetting speed -- too small and you react to noise, too large and you're slow to adapt. In production systems, you often see exponential weighting (recent observations count more) rather than a hard window cutoff, because it decays more smoothly.

Samengevat

  • The multi-armed bandit is the simplest RL problem -- choose between K options with unknown rewards, no states, no transitions -- yet it captures the exploration-exploitation tradeoff in its purest form;
  • epsilon-greedy explores randomly with fixed probability -- simple but inefficient because it doesn't target uncertain arms, and a fixed epsilon wastes pulls forever;
  • UCB adds a confidence bonus based on how rarely each arm has been pulled, preferring uncertain arms -- provably near-optimal with logarithmic regret;
  • Thompson Sampling maintains Bayesian posteriors (Beta for binary, Normal-Inverse-Gamma for continuous) and samples to decide -- naturally balances exploration and exploitation without tunable parameters, often performs best in practice;
  • contextual bandits add user features so the best arm varies per situation -- LinUCB uses per-arm linear models with ridge regression uncertainty for principled exploration;
  • A/B testing is secretly a bandit problem: bandit algorithms shift traffic to better variants faster than fixed 50/50 splits, reducing exposure to inferior options;
  • optimistic initialization drives early exploration by setting initial estimates high -- simple, effective, but only works for stationary problems;
  • for non-stationary bandits (where arm values change over time), sliding window or exponentially-weighted estimates let the agent forget old data and re-explore.

Bandits are just the beginning of RL, but they're the foundation. The exploration strategies we built today -- epsilon-greedy, UCB, Thompson Sampling -- will show up again in more complex settings where we add states, transitions, and delayed rewards on top of them. The next episode will formalize some of the mathematical framework we've been using informally, and from there we'll build up to the full sequential decision-making algorithms that power modern RL systems.

Exercises

Exercise 1: Build a multi-armed bandit tournament. Create a class BanditTournament that runs 5 algorithms -- EpsilonGreedy(0.1), EpsilonGreedy(0.01), DecayingEpsilonGreedy, UCB(c=2), and GaussianThompsonSampling -- on 20 different randomly-generated 10-armed bandit environments (seeds 0-19). For each environment, run each algorithm for 10,000 steps. Track per-environment final cumulative regret and percentage of optimal arm pulls. Print: (a) a summary table showing average regret and optimal-pull% across all 20 environments for each algorithm, (b) how many of the 20 environments each algorithm "won" (lowest regret), (c) for the hardest environment (where the gap between best and second-best arm is smallest), show each algorithm's regret curve at steps 100, 1000, 5000, 10000. Thompson Sampling should win the most environments overall.

Exercise 2: Build a contextual bandit news recommender. Create a class NewsRecommender using LinUCB with 5 article categories (tech, sports, politics, cooking, finance) and 6-dimensional user feature vectors (age_bucket, gender, morning/evening, weekday/weekend, mobile/desktop, returning_user). Generate 10,000 synthetic user visits where the true click probability for each article-user combination is a sigmoid of a linear function (so each article has a 6-dim weight vector, initialized randomly with seed=42). Train the LinUCB agent online: observe user features, choose article, observe click/no-click, update. Print: (a) overall CTR (click-through rate) in the first 1000 visits vs the last 1000, (b) per-article selection counts (the agent should learn to show different articles to different user types), (c) comparison with a random baseline and a fixed "always show most popular" baseline. LinUCB should achieve at least 2x the CTR of random by the end.

Exercise 3: Build a non-stationary bandit detector. Create a class ChangeDetector that wraps a sliding-window epsilon-greedy agent and adds change-point detection. Track the per-arm reward mean in a sliding window of size 100. Every 50 steps, compare the current window mean against the previous window mean using a simple threshold (if the absolute difference exceeds 0.5, flag a change). When a change is detected for any arm, temporarily boost epsilon to 0.5 for 50 steps (re-explore), then decay back to 0.1. Test on a 5-armed environment where arm values shift every 1000 steps (resample from Normal(0,1) with different seeds). Run for 5000 steps and print: (a) which arms had detected changes and at what step, (b) the regret curve compared to a fixed-epsilon agent (no detection, no boosting), (c) total regret for both agents. The change-detecting agent should recover faster after each shift, producing lower total regret.

Thanks for reading!

@scipio



0
0
0.000
0 comments