Learn AI Series (#105) - Monte Carlo Methods

avatar

Learn AI Series (#105) - Monte Carlo Methods

variant-c-09-blue.png

What will I learn

  • You will learn how to estimate value functions from complete episodes of experience -- no model of the environment required;
  • first-visit vs every-visit Monte Carlo evaluation, and why the difference barely matters in practice;
  • Monte Carlo control: improving a policy purely from sampled experience, with epsilon-greedy keeping exploration alive;
  • off-policy learning: how to learn about one policy while following a completely different one;
  • importance sampling: the trick that corrects for the mismatch between the behavior and target policies;
  • and we solve Blackjack optimally, with the agent discovering basic strategy on its own -- nobody hands it the rules.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution with NumPy;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#105) - Monte Carlo Methods

Solutions to Episode #104 Exercises

Exercise 1: Build a stochastic grid world solver -- the chosen action succeeds with probability p=0.8, and with probability 0.2 the agent slips to one of the two perpendicular directions (0.1 each). get_transitions now returns a list of (prob, next_state, reward, done) tuples, and policy evaluation (and value iteration) sum over those outcomes.

import numpy as np


class StochasticGridWorld:
    """4x4 grid world. Intended action succeeds with prob p,
    else the agent slips to a perpendicular direction (0.1 each)."""

    def __init__(self, size=4, p=0.8):
        self.size = size
        self.n_states = size * size
        self.n_actions = 4  # up, right, down, left
        self.terminal_states = {0, self.n_states - 1}
        self.action_deltas = {0: (-1, 0), 1: (0, 1),
                              2: (1, 0), 3: (0, -1)}
        self.perp = {0: [3, 1], 1: [0, 2], 2: [1, 3], 3: [2, 0]}
        self.p = p

    def state_to_pos(self, s):
        return s // self.size, s % self.size

    def pos_to_state(self, row, col):
        return row * self.size + col

    def _move(self, state, action):
        row, col = self.state_to_pos(state)
        dr, dc = self.action_deltas[action]
        new_row = max(0, min(self.size - 1, row + dr))
        new_col = max(0, min(self.size - 1, col + dc))
        return self.pos_to_state(new_row, new_col)

    def get_transitions(self, state, action):
        """Return list of (prob, next_state, reward, done)."""
        if state in self.terminal_states:
            return [(1.0, state, 0.0, True)]
        outcomes = [(self.p, action)]
        for slip in self.perp[action]:
            outcomes.append((0.1, slip))
        result = []
        for prob, act in outcomes:
            ns = self._move(state, act)
            done = ns in self.terminal_states
            result.append((prob, ns, -1.0, done))
        return result


def value_iteration_stochastic(mdp, gamma=0.9, theta=1e-6):
    V = np.zeros(mdp.n_states)
    while True:
        delta = 0.0
        for s in range(mdp.n_states):
            if s in mdp.terminal_states:
                continue
            v_old = V[s]
            action_values = np.zeros(mdp.n_actions)
            for a in range(mdp.n_actions):
                for prob, ns, reward, done in mdp.get_transitions(s, a):
                    next_value = 0.0 if done else V[ns]
                    action_values[a] += prob * (reward + gamma * next_value)
            V[s] = np.max(action_values)
            delta = max(delta, abs(v_old - V[s]))
        if delta < theta:
            break
    # read off the greedy policy
    policy = np.zeros((mdp.n_states, mdp.n_actions))
    for s in range(mdp.n_states):
        if s in mdp.terminal_states:
            continue
        av = np.zeros(mdp.n_actions)
        for a in range(mdp.n_actions):
            for prob, ns, reward, done in mdp.get_transitions(s, a):
                next_value = 0.0 if done else V[ns]
                av[a] += prob * (reward + gamma * next_value)
        policy[s, int(np.argmax(av))] = 1.0
    return policy, V


mdp = StochasticGridWorld(size=4, p=0.8)
policy, V = value_iteration_stochastic(mdp, gamma=0.9)

arrows = ["^", ">", "v", "<"]
print("Optimal policy (slippery world):")
for row in range(mdp.size):
    for col in range(mdp.size):
        s = mdp.pos_to_state(row, col)
        if s in mdp.terminal_states:
            print("  T  ", end=" ")
        else:
            print(f"  {arrows[int(np.argmax(policy[s]))]}  ", end=" ")
    print()

The key insight: the values come out more negative than the deterministic case, because every step now carries a 20% chance of slipping sideways and wasting a move. And in some cells the policy steers away from a wall it would happily hug in the deterministic world -- because slipping into a wall just leaves you stuck paying -1, so the agent prefers actions whose slip-directions are also productive. The model knows the slip probabilities, so value iteration prices that risk in automatically.

Exercise 2: Build a gamma sweep experiment -- run value iteration for gamma in [0.5, 0.7, 0.9, 0.99, 1.0] on the deterministic 4x4 grid, and for each gamma print the iterations to convergence, the optimal value of a start-adjacent cell, and whether the optimal policy changed. This reuses GridWorldMDP and policy_improvement from episode #104.

import numpy as np
# Assumes GridWorldMDP and policy_improvement from episode #104 are imported.


def value_iteration_counted(mdp, gamma=1.0, theta=1e-6):
    V = np.zeros(mdp.n_states)
    iters = 0
    while True:
        iters += 1
        delta = 0.0
        for s in range(mdp.n_states):
            if s in mdp.terminal_states:
                continue
            v_old = V[s]
            av = np.zeros(mdp.n_actions)
            for a in range(mdp.n_actions):
                next_s, reward, done = mdp.get_transition(s, a)
                next_value = 0.0 if done else V[next_s]
                av[a] = reward + gamma * next_value
            V[s] = np.max(av)
            delta = max(delta, abs(v_old - V[s]))
        if delta < theta:
            break
    return policy_improvement(mdp, V, gamma), V, iters


def gamma_sweep(mdp, gammas):
    base_policy = None
    print(f"{'gamma':>6}{'iters':>7}{'V[adj]':>10}{'policy?':>10}")
    for gamma in gammas:
        policy, V, iters = value_iteration_counted(mdp, gamma)
        start_adj = V[1]  # the cell just right of the top-left exit
        greedy = tuple(int(np.argmax(policy[s])) for s in range(mdp.n_states))
        if base_policy is None:
            base_policy, changed = greedy, "base"
        else:
            changed = "same" if greedy == base_policy else "CHANGED"
        print(f"{gamma:6.2f}{iters:7d}{start_adj:10.3f}{changed:>10}")


gamma_sweep(GridWorldMDP(size=4), [0.5, 0.7, 0.9, 0.99, 1.0])

As predicted, the policy is identical for every gamma -- on a pure shortest-path problem the best route never depends on how patient you are, so the arrows don't budge. What does move are the values and the iteration counts: a low gamma like 0.5 converges in just a handful of sweeps (the contraction is fierce), while gamma=1.0 needs more sweeps because nothing is shrinking the value differences except termination. The values get less negative as gamma drops, since far-off costs are discounted into near-irrelevance.

Exercise 3: Build a policy iteration vs value iteration profiler -- tally the total number of Bellman backups (one backup = one reward + gamma*next_value evaluation for a single state-action pair) and run both on grids of size 4, 8, and 16.

import numpy as np
# Assumes GridWorldMDP from episode #104 is imported.


def policy_iteration_profiled(mdp, gamma=0.9, theta=1e-6):
    policy = np.ones((mdp.n_states, mdp.n_actions)) / mdp.n_actions
    backups = 0
    while True:
        V = np.zeros(mdp.n_states)
        while True:  # full evaluation to convergence
            delta = 0.0
            for s in range(mdp.n_states):
                if s in mdp.terminal_states:
                    continue
                v_old, v_new = V[s], 0.0
                for a in range(mdp.n_actions):
                    next_s, reward, done = mdp.get_transition(s, a)
                    next_value = 0.0 if done else V[next_s]
                    v_new += policy[s, a] * (reward + gamma * next_value)
                    backups += 1
                V[s] = v_new
                delta = max(delta, abs(v_old - v_new))
            if delta < theta:
                break
        new_policy = np.zeros((mdp.n_states, mdp.n_actions))
        for s in range(mdp.n_states):
            if s in mdp.terminal_states:
                continue
            av = np.zeros(mdp.n_actions)
            for a in range(mdp.n_actions):
                next_s, reward, done = mdp.get_transition(s, a)
                next_value = 0.0 if done else V[next_s]
                av[a] = reward + gamma * next_value
                backups += 1
            new_policy[s, int(np.argmax(av))] = 1.0
        if np.allclose(policy, new_policy):
            return backups
        policy = new_policy


def value_iteration_profiled(mdp, gamma=0.9, theta=1e-6):
    V = np.zeros(mdp.n_states)
    backups = 0
    while True:
        delta = 0.0
        for s in range(mdp.n_states):
            if s in mdp.terminal_states:
                continue
            v_old = V[s]
            av = np.zeros(mdp.n_actions)
            for a in range(mdp.n_actions):
                next_s, reward, done = mdp.get_transition(s, a)
                next_value = 0.0 if done else V[next_s]
                av[a] = reward + gamma * next_value
                backups += 1
            V[s] = np.max(av)
            delta = max(delta, abs(v_old - V[s]))
        if delta < theta:
            return backups


print(f"{'size':>5}{'policy-iter':>14}{'value-iter':>14}")
for size in [4, 8, 16]:
    mdp = GridWorldMDP(size=size)
    pi_b = policy_iteration_profiled(mdp, gamma=0.9)
    vi_b = value_iteration_profiled(mdp, gamma=0.9)
    print(f"{size:5d}{pi_b:14d}{vi_b:14d}")

Run it and you'll see policy iteration burning a lot of backups inside its full evaluation loop, even though it only needs a few outer rounds to lock in the policy. Value iteration spreads its work more evenly and -- on these shortest-path grids -- usually comes out ahead on total backups, because it never wastes effort polishing a value function it's about to overwrite. Which one wins flips around once evaluation gets cheap relative to improvement, but for this family of problems value iteration is the leaner one.

On to today's episode

Right, episode 105. Last time (episode #104) we solved grid worlds with dynamic programming -- and I kept hammering one uncomfortable assumption: DP needs the full model. Every transition probability, every reward, all of it known in advance. We had the answer key sitting on the desk and the only puzzle was how to read it.

But where does the answer key come from? For Blackjack, for a robot arm, for an agent learning to trade -- nobody hands you P(s'|s,a) on a plate. The universe is not so generous. So today we throw the model away and learn from something we can get our hands on: experience. Play the game, watch what happens, average the results. That's the whole idea behind Monte Carlo methods, and it's our first genuinely model-free algorithm in this series.

The name, by the way, is a little Cold War history. In the 1940s the physicists at Los Alamos needed to study neutron diffusion but the equations were a nightmare, so Stanislaw Ulam and John von Neumann did the unthinkable for mathematicians of the era -- they gambled. Simulate the random process thousands of times, average the outcomes, read off the answer. They named it after the casino in Monaco where Ulam's uncle liked to lose money. The principle in RL is identical: if you want to know the value of a state, visit it many times and average the returns you collect.

Learning from complete episodes

A Monte Carlo method needs complete episodes. The agent plays from start to terminal state, collecting a trajectory:

s_0, a_0, r_1, s_1, a_1, r_2, ..., s_T (terminal)

Once the episode is over -- and only once it's over, because we need the actual outcome, not a guess -- we compute the return G_t for each visited state (episode #102: the discounted sum of future rewards) and use it as one noisy sample of that state's value. Do that across enough episodes and the averages converge to the truth.

import numpy as np
from collections import defaultdict


def generate_episode(env, policy):
    """Play one complete episode following the policy.
    Returns a list of (state, action, reward) tuples."""
    episode = []
    state = env.reset()
    done = False
    while not done:
        action = policy(state)
        next_state, reward, done = env.step(action)
        episode.append((state, action, reward))
        state = next_state
    return episode


def compute_returns(episode, gamma=1.0):
    """Compute G_t for each step, walking the episode backwards."""
    returns = []
    G = 0.0
    for _, _, reward in reversed(episode):
        G = reward + gamma * G
        returns.insert(0, G)
    return returns

Notice we build the returns backwards. That's not a stylistic quirk -- it's the efficient way to do it. G_t = r_{t+1} + gamma * G_{t+1}, so once you've got the return for the next step, the current step is a single multiply-and-add. Walk forwards and you'd be re-summing the same tail over and over. (We did the exact same backwards trick in episode #104 inside policy evaluation -- the Bellman recursion keeps coming back, doesn't it ;-)

First-visit vs every-visit MC

Here's a wrinkle: a single episode might visit the same state more than once. Two reasonable answers to "which return do I use?" exist. First-visit MC only counts the return following the first time a state shows up in an episode. Every-visit MC counts every occurrence:

def first_visit_mc_evaluation(env, policy, n_episodes=10000, gamma=1.0):
    """Estimate V^pi using first-visit Monte Carlo."""
    V = defaultdict(float)
    returns_count = defaultdict(int)
    for _ in range(n_episodes):
        episode = generate_episode(env, policy)
        returns = compute_returns(episode, gamma)
        visited = set()
        for t, (state, action, reward) in enumerate(episode):
            if state not in visited:
                visited.add(state)
                returns_count[state] += 1
                # incremental mean: new_mean += (sample - old_mean) / count
                V[state] += (returns[t] - V[state]) / returns_count[state]
    return dict(V)


def every_visit_mc_evaluation(env, policy, n_episodes=10000, gamma=1.0):
    """Estimate V^pi using every-visit Monte Carlo."""
    V = defaultdict(float)
    returns_count = defaultdict(int)
    for _ in range(n_episodes):
        episode = generate_episode(env, policy)
        returns = compute_returns(episode, gamma)
        for t, (state, action, reward) in enumerate(episode):
            returns_count[state] += 1
            V[state] += (returns[t] - V[state]) / returns_count[state]
    return dict(V)

That incremental mean update is worth pausing on. Instead of storing every return and averaging at the end (which would eat memory), we nudge the running estimate toward each new sample by a shrinking step 1/count. After a million episodes the step is tiny and the estimate has settled. This little value += (sample - value) / count pattern is everywhere in RL -- keep an eye out for it, because it mutates into the learning rate of basically every algorithm that comes after this one.

Both flavours converge to the true V^pi as the episode count grows. First-visit has slightly cleaner theory (each first-visit return is an independent sample), every-visit squeezes a touch more data out of each episode. In practice the difference is academic -- pick one and move on.

Blackjack: the textbook Monte Carlo testbed

Blackjack is the classic MC example, and for good reason. It has clean episodes (one hand), genuinely stochastic dynamics (the card draws), and a small enough state space that we can study it completely. The state is just three things: the player's current sum, the dealer's showing card, and whether we hold a usable ace. Let's build it from scratch:

import random


class BlackjackEnv:
    """Simplified Blackjack environment (one player vs dealer)."""

    def reset(self):
        self.player_sum = self._draw_card() + self._draw_card()
        self.dealer_showing = self._draw_card()
        self.dealer_hidden = self._draw_card()
        self.usable_ace = False
        # Deal up until we are in the interesting 12-21 range
        while self.player_sum < 12:
            card = self._draw_card()
            if card == 1 and self.player_sum + 11 <= 21:
                self.player_sum += 11
                self.usable_ace = True
            else:
                self.player_sum += card
        return self._get_state()

    def _draw_card(self):
        return min(random.randint(1, 13), 10)  # face cards count as 10

    def _get_state(self):
        return (self.player_sum, self.dealer_showing, self.usable_ace)

    def step(self, action):
        """action: 0 = stick, 1 = hit."""
        if action == 1:  # hit
            card = self._draw_card()
            if card == 1 and self.player_sum + 11 <= 21:
                self.player_sum += 11
                self.usable_ace = True
            else:
                self.player_sum += card
            if self.player_sum > 21:
                if self.usable_ace:  # rescue with the soft ace
                    self.player_sum -= 10
                    self.usable_ace = False
                else:
                    return self._get_state(), -1.0, True  # bust
            return self._get_state(), 0.0, False
        else:  # stick -- dealer plays out their hand
            dealer_sum = self.dealer_showing + self.dealer_hidden
            while dealer_sum < 17:
                dealer_sum += self._draw_card()
            if dealer_sum > 21:
                return self._get_state(), 1.0, True   # dealer busts
            elif dealer_sum > self.player_sum:
                return self._get_state(), -1.0, True  # dealer wins
            elif dealer_sum < self.player_sum:
                return self._get_state(), 1.0, True   # player wins
            else:
                return self._get_state(), 0.0, True   # push

Now let's evaluate a deliberately naive policy -- stick only on 20 or 21, otherwise keep hitting -- and see what Monte Carlo thinks each state is worth:

def simple_policy(state):
    player_sum, dealer_showing, usable_ace = state
    return 0 if player_sum >= 20 else 1  # stick on 20+, else hit


env = BlackjackEnv()
V = first_visit_mc_evaluation(env, simple_policy, n_episodes=100000)

for sum_val in range(12, 22):
    for dealer in [2, 5, 10]:
        state = (sum_val, dealer, False)
        print(f"Player {sum_val}, dealer shows {dealer}: V = {V.get(state, 0):+.3f}")

The pattern that falls out is exactly what your gut says it should be: sums of 20 or 21 carry comfortably positive values (you win most of those), the 12-16 range is a swamp of negative values (you're stuck either busting on a hit or losing on a stick), and the dealer's showing card matters a lot -- a dealer showing a 10 is far scarier than one showing a 5. We learned all of that without writing a single line of Blackjack strategy. We just played a hundred thousand hands and averaged.

Monte Carlo control: finding the optimal policy

Evaluation tells you how good a policy is. Control finds the best one. The move is the same one we made in episode #104 with policy iteration -- estimate values, then act greedily with respect to them -- with two changes forced on us by the model-free setting.

First, we estimate action values Q(s,a) instead of state values V(s). Why? Because going greedy on V(s) requires a one-step look-ahead through the model ("which action lands me in the best next state?"), and we no longer have the model. Q(s,a) sidesteps that entirely: it already bakes in "take action a, then follow the policy", so greedy is just argmax_a Q(s,a).

Second, we have to keep exploring. If we always act greedily we might never try an action that's actually better, simply because an unlucky early sample made it look bad. The fix is epsilon-greedy (straight from the bandit toolkit in episode #103): act greedily most of the time, but with probability epsilon take a random action to keep the data honest.

def mc_control_epsilon_greedy(env, n_episodes=500000, gamma=1.0, epsilon=0.1):
    """Find a near-optimal policy with first-visit MC control."""
    Q = defaultdict(lambda: np.zeros(2))  # 2 actions: stick, hit
    returns_count = defaultdict(lambda: np.zeros(2))

    def policy(state):
        if random.random() < epsilon:
            return random.randint(0, 1)
        return int(np.argmax(Q[state]))

    for _ in range(n_episodes):
        episode = []
        state = env.reset()
        done = False
        while not done:
            action = policy(state)
            next_state, reward, done = env.step(action)
            episode.append((state, action, reward))
            state = next_state

        G = 0.0
        visited = set()
        for state, action, reward in reversed(episode):
            G = reward + gamma * G
            if (state, action) not in visited:
                visited.add((state, action))
                returns_count[state][action] += 1
                n = returns_count[state][action]
                Q[state][action] += (G - Q[state][action]) / n

    optimal_policy = {s: int(np.argmax(Q[s])) for s in Q}
    return Q, optimal_policy


Q, policy = mc_control_epsilon_greedy(BlackjackEnv(), n_episodes=500000)

print("Optimal play (S=stick, H=hit), no usable ace:")
print(f"{'':>8}", end="")
for dealer in range(2, 12):
    print(f"{dealer:>4}", end="")
print()
for player_sum in range(21, 11, -1):
    print(f"Sum {player_sum:>2}: ", end="")
    for dealer in range(2, 12):
        action = policy.get((player_sum, dealer, False), 1)
        print(f"{'S' if action == 0 else 'H':>4}", end="")
    print()

After half a million hands, the discovered policy lines up almost perfectly with the basic strategy card every Blackjack book sells you: stick early when the dealer shows a weak card (they're likely to bust), keep hitting when they show a strong one, and the stick/hit threshold slides with the dealer's card. The thing I find genuinely lovely here is that nobody told the agent the rules. It never read a strategy guide. It played, it lost, it won, it averaged -- and basic strategy fell out of the arithmetic. That's the Monte Carlo promise made concrete.

Off-policy learning: learning from someone else's mistakes

Everything so far has been on-policy -- the policy generating the episodes is the same one we're evaluating and improving. Tidy, but limiting. Sometimes you want to learn about one policy while following a different one.

That's off-policy learning, and it splits the world in two. The behavior policy generates the data (often deliberately exploratory, or just whatever you happen to have). The target policy is the one you actually care about -- usually the sharp, greedy one you're trying to optimize. Separating them is enormously useful: you can learn an optimal policy from data collected by a clumsy exploratory one, from old logs gathering dust, from human demonstrations, or from a completely different agent.

The snag is obvious once you say it out loud. Episodes drawn from the behavior policy visit states and pick actions with the wrong frequencies -- wrong, that is, relative to what the target policy would have done. If we just averaged those returns we'd get the value of the behavior policy, not the target. We need to correct for the mismatch.

Importance sampling

The correction is importance sampling: re-weight each return by how much more (or less) likely the trajectory was under the target policy than under the behavior policy.

def off_policy_mc_evaluation(env, target_prob, behavior_policy, behavior_prob,
                             n_episodes=100000, gamma=1.0):
    """Evaluate a target policy from data generated by a behavior policy,
    using weighted importance sampling."""
    Q = defaultdict(float)
    C = defaultdict(float)  # cumulative importance-sampling weights

    for _ in range(n_episodes):
        episode = []
        state = env.reset()
        done = False
        while not done:
            action = behavior_policy(state)
            next_state, reward, done = env.step(action)
            episode.append((state, action, reward))
            state = next_state

        G = 0.0
        W = 1.0  # the running importance-sampling ratio
        for state, action, reward in reversed(episode):
            G = reward + gamma * G
            C[(state, action)] += W
            Q[(state, action)] += (W / C[(state, action)]) * (G - Q[(state, action)])
            # if the target would never take this action, the rest is irrelevant
            if target_prob(state, action) == 0:
                break
            W *= target_prob(state, action) / behavior_prob(state, action)

    return dict(Q)

The ratio W is a product of per-step terms pi_target(a|s) / pi_behavior(a|s). Read it like this: when the target policy strongly favours an action the behavior policy only took occasionally, that episode was rare evidence about the target, so we crank its weight up. When the behavior policy took an action the target would shun, the weight collapses toward zero -- that episode tells us little about the target, so we down-weight it (and the break cuts it off entirely once the target probability hits zero).

There are two ways to use those weights. Ordinary importance sampling divides by the episode count -- unbiased, but the variance can be savage (a single huge W can wreck your estimate). Weighted importance sampling -- what the code above does -- divides by the cumulative weights C instead. That introduces a little bias but slashes the variance, and in practice it wins almost every time. When in doubt, reach for weighted.

Monte Carlo vs Dynamic Programming

So how does our new model-free tool stack up against last episode's open-book DP? A side-by-side:

Dynamic ProgrammingMonte Carlo
Model required?Yes -- full MDPNo -- just episodes
Bootstraps?Yes -- V(s) uses V(s')No -- uses the actual return
Needs complete episodes?NoYes
BiasNone (given true model)None (unbiased estimates)
VarianceLow (deterministic sweeps)High (returns swing a lot)
State coverageEvery state, every sweepOnly states it actually visits

Monte Carlo's big win is right there in the first row -- no model. And because it learns from real returns rather than from its own value estimates (no bootstrapping), it carries no bias from a half-trained value function. Its weaknesses are the flip side: returns are noisy, so variance is high, and the whole thing only works for tasks that actually terminate. An agent that runs forever never produces a return to average, and Monte Carlo just sits there waiting for an episode that never ends.

That last limitation is a real itch. What if we could learn from a single step instead of waiting for the whole episode to finish? What if we could blend Monte Carlo's learn-from-reality honesty with the bootstrapping trick that made DP so efficient? That combination is the most important idea in all of reinforcement learning, and it's exactly where we're headed next. But that's a story for another episode ;-)

So, what do you know now?

  • Monte Carlo methods estimate value functions by averaging the actual returns from complete episodes -- no model of the environment needed, which makes them our first truly model-free algorithm;
  • first-visit MC uses only the first occurrence of each state per episode, every-visit MC uses all of them, and both converge to V^pi given enough episodes (the practical difference is negligible);
  • the incremental mean update value += (sample - value) / count is the workhorse pattern -- it quietly becomes the learning rate of nearly every algorithm that follows;
  • MC control swaps V for action values Q and leans on epsilon-greedy exploration to discover optimal policies -- we watched it rediscover Blackjack basic strategy from raw experience alone;
  • off-policy learning evaluates a target policy from data generated by a different behavior policy, and importance sampling (weighted, for low variance) corrects the probability mismatch;
  • MC has zero bias but high variance and demands complete episodes -- limitations that the next family of methods is built specifically to overcome, by learning from single steps in stead of whole episodes.

Exercises

Exercise 1: Add a usable-ace policy comparison to the Blackjack evaluator. Run first_visit_mc_evaluation on the simple_policy and this time print the value table separately for usable_ace = True and usable_ace = False (same sums and dealer cards). Compare the two tables: for which states does holding a usable ace make the position meaningfully better, and why? (Hint: a soft hand can't bust on the next hit, so the downside is smaller.)

Exercise 2: Implement and compare ordinary vs weighted importance sampling on the same off-policy Blackjack evaluation. Use a uniform-random behavior policy and a deterministic target policy (stick on 18+). Track both estimators for a single fixed state across increasing episode counts (say 1k, 10k, 100k, 1M) and plot or print how each estimate behaves. You should see ordinary IS swinging wildly with the occasional spike, while weighted IS settles down smoothly -- the bias/variance trade-off made visible.

Exercise 3: Build a first-visit vs every-visit race. Pick a small environment where states genuinely repeat within an episode (a tiny grid world with a random policy works -- the agent will revisit cells). Run both estimators for the same number of episodes and measure how fast each one's value estimate for a chosen state approaches the answer from a very-long reference run. Does every-visit's extra data actually buy faster convergence here, or does the correlation between repeated visits eat the advantage? Report what you find.

Bedankt en tot de volgende keer!

@scipio



0
0
0.000
0 comments