Learn AI Series (#105) - Monte Carlo Methods
Learn AI Series (#105) - Monte Carlo Methods

What will I learn
- You will learn how to estimate value functions from complete episodes of experience -- no model of the environment required;
- first-visit vs every-visit Monte Carlo evaluation, and why the difference barely matters in practice;
- Monte Carlo control: improving a policy purely from sampled experience, with epsilon-greedy keeping exploration alive;
- off-policy learning: how to learn about one policy while following a completely different one;
- importance sampling: the trick that corrects for the mismatch between the behavior and target policies;
- and we solve Blackjack optimally, with the agent discovering basic strategy on its own -- nobody hands it the rules.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution with NumPy;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals
- Learn AI Series (#78) - Object Detection (Part 1) - Foundations
- Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches
- Learn AI Series (#80) - Image Segmentation
- Learn AI Series (#81) - Pose Estimation and Tracking
- Learn AI Series (#82) - Optical Character Recognition
- Learn AI Series (#83) - Video Understanding
- Learn AI Series (#84) - Generative Images - Diffusion Models (Part 1)
- Learn AI Series (#85) - Generative Images - Diffusion Models (Part 2)
- Learn AI Series (#86) - Image-to-Image and Editing
- Learn AI Series (#87) - 3D Vision
- Learn AI Series (#88) - Face Analysis
- Learn AI Series (#89) - Medical and Scientific Imaging
- Learn AI Series (#90) - Self-Supervised Learning for Vision
- Learn AI Series (#91) - Mini Project - Building a Visual AI System
- Learn AI Series (#92) - Audio Fundamentals for AI
- Learn AI Series (#93) - Speech Recognition
- Learn AI Series (#94) - Text-to-Speech (TTS)
- Learn AI Series (#95) - Audio Classification
- Learn AI Series (#96) - Music Generation
- Learn AI Series (#97) - Speaker Recognition and Diarization
- Learn AI Series (#98) - Natural Language Understanding for Voice
- Learn AI Series (#99) - Audio Enhancement
- Learn AI Series (#100) - Multimodal Audio-Visual Models
- Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant
- Learn AI Series (#102) - What Is Reinforcement Learning?
- Learn AI Series (#103) - Multi-Armed Bandits
- Learn AI Series (#104) - Dynamic Programming
- Learn AI Series (#105) - Monte Carlo Methods (this post)
Learn AI Series (#105) - Monte Carlo Methods
Solutions to Episode #104 Exercises
Exercise 1: Build a stochastic grid world solver -- the chosen action succeeds with probability p=0.8, and with probability 0.2 the agent slips to one of the two perpendicular directions (0.1 each). get_transitions now returns a list of (prob, next_state, reward, done) tuples, and policy evaluation (and value iteration) sum over those outcomes.
import numpy as np
class StochasticGridWorld:
"""4x4 grid world. Intended action succeeds with prob p,
else the agent slips to a perpendicular direction (0.1 each)."""
def __init__(self, size=4, p=0.8):
self.size = size
self.n_states = size * size
self.n_actions = 4 # up, right, down, left
self.terminal_states = {0, self.n_states - 1}
self.action_deltas = {0: (-1, 0), 1: (0, 1),
2: (1, 0), 3: (0, -1)}
self.perp = {0: [3, 1], 1: [0, 2], 2: [1, 3], 3: [2, 0]}
self.p = p
def state_to_pos(self, s):
return s // self.size, s % self.size
def pos_to_state(self, row, col):
return row * self.size + col
def _move(self, state, action):
row, col = self.state_to_pos(state)
dr, dc = self.action_deltas[action]
new_row = max(0, min(self.size - 1, row + dr))
new_col = max(0, min(self.size - 1, col + dc))
return self.pos_to_state(new_row, new_col)
def get_transitions(self, state, action):
"""Return list of (prob, next_state, reward, done)."""
if state in self.terminal_states:
return [(1.0, state, 0.0, True)]
outcomes = [(self.p, action)]
for slip in self.perp[action]:
outcomes.append((0.1, slip))
result = []
for prob, act in outcomes:
ns = self._move(state, act)
done = ns in self.terminal_states
result.append((prob, ns, -1.0, done))
return result
def value_iteration_stochastic(mdp, gamma=0.9, theta=1e-6):
V = np.zeros(mdp.n_states)
while True:
delta = 0.0
for s in range(mdp.n_states):
if s in mdp.terminal_states:
continue
v_old = V[s]
action_values = np.zeros(mdp.n_actions)
for a in range(mdp.n_actions):
for prob, ns, reward, done in mdp.get_transitions(s, a):
next_value = 0.0 if done else V[ns]
action_values[a] += prob * (reward + gamma * next_value)
V[s] = np.max(action_values)
delta = max(delta, abs(v_old - V[s]))
if delta < theta:
break
# read off the greedy policy
policy = np.zeros((mdp.n_states, mdp.n_actions))
for s in range(mdp.n_states):
if s in mdp.terminal_states:
continue
av = np.zeros(mdp.n_actions)
for a in range(mdp.n_actions):
for prob, ns, reward, done in mdp.get_transitions(s, a):
next_value = 0.0 if done else V[ns]
av[a] += prob * (reward + gamma * next_value)
policy[s, int(np.argmax(av))] = 1.0
return policy, V
mdp = StochasticGridWorld(size=4, p=0.8)
policy, V = value_iteration_stochastic(mdp, gamma=0.9)
arrows = ["^", ">", "v", "<"]
print("Optimal policy (slippery world):")
for row in range(mdp.size):
for col in range(mdp.size):
s = mdp.pos_to_state(row, col)
if s in mdp.terminal_states:
print(" T ", end=" ")
else:
print(f" {arrows[int(np.argmax(policy[s]))]} ", end=" ")
print()
The key insight: the values come out more negative than the deterministic case, because every step now carries a 20% chance of slipping sideways and wasting a move. And in some cells the policy steers away from a wall it would happily hug in the deterministic world -- because slipping into a wall just leaves you stuck paying -1, so the agent prefers actions whose slip-directions are also productive. The model knows the slip probabilities, so value iteration prices that risk in automatically.
Exercise 2: Build a gamma sweep experiment -- run value iteration for gamma in [0.5, 0.7, 0.9, 0.99, 1.0] on the deterministic 4x4 grid, and for each gamma print the iterations to convergence, the optimal value of a start-adjacent cell, and whether the optimal policy changed. This reuses GridWorldMDP and policy_improvement from episode #104.
import numpy as np
# Assumes GridWorldMDP and policy_improvement from episode #104 are imported.
def value_iteration_counted(mdp, gamma=1.0, theta=1e-6):
V = np.zeros(mdp.n_states)
iters = 0
while True:
iters += 1
delta = 0.0
for s in range(mdp.n_states):
if s in mdp.terminal_states:
continue
v_old = V[s]
av = np.zeros(mdp.n_actions)
for a in range(mdp.n_actions):
next_s, reward, done = mdp.get_transition(s, a)
next_value = 0.0 if done else V[next_s]
av[a] = reward + gamma * next_value
V[s] = np.max(av)
delta = max(delta, abs(v_old - V[s]))
if delta < theta:
break
return policy_improvement(mdp, V, gamma), V, iters
def gamma_sweep(mdp, gammas):
base_policy = None
print(f"{'gamma':>6}{'iters':>7}{'V[adj]':>10}{'policy?':>10}")
for gamma in gammas:
policy, V, iters = value_iteration_counted(mdp, gamma)
start_adj = V[1] # the cell just right of the top-left exit
greedy = tuple(int(np.argmax(policy[s])) for s in range(mdp.n_states))
if base_policy is None:
base_policy, changed = greedy, "base"
else:
changed = "same" if greedy == base_policy else "CHANGED"
print(f"{gamma:6.2f}{iters:7d}{start_adj:10.3f}{changed:>10}")
gamma_sweep(GridWorldMDP(size=4), [0.5, 0.7, 0.9, 0.99, 1.0])
As predicted, the policy is identical for every gamma -- on a pure shortest-path problem the best route never depends on how patient you are, so the arrows don't budge. What does move are the values and the iteration counts: a low gamma like 0.5 converges in just a handful of sweeps (the contraction is fierce), while gamma=1.0 needs more sweeps because nothing is shrinking the value differences except termination. The values get less negative as gamma drops, since far-off costs are discounted into near-irrelevance.
Exercise 3: Build a policy iteration vs value iteration profiler -- tally the total number of Bellman backups (one backup = one reward + gamma*next_value evaluation for a single state-action pair) and run both on grids of size 4, 8, and 16.
import numpy as np
# Assumes GridWorldMDP from episode #104 is imported.
def policy_iteration_profiled(mdp, gamma=0.9, theta=1e-6):
policy = np.ones((mdp.n_states, mdp.n_actions)) / mdp.n_actions
backups = 0
while True:
V = np.zeros(mdp.n_states)
while True: # full evaluation to convergence
delta = 0.0
for s in range(mdp.n_states):
if s in mdp.terminal_states:
continue
v_old, v_new = V[s], 0.0
for a in range(mdp.n_actions):
next_s, reward, done = mdp.get_transition(s, a)
next_value = 0.0 if done else V[next_s]
v_new += policy[s, a] * (reward + gamma * next_value)
backups += 1
V[s] = v_new
delta = max(delta, abs(v_old - v_new))
if delta < theta:
break
new_policy = np.zeros((mdp.n_states, mdp.n_actions))
for s in range(mdp.n_states):
if s in mdp.terminal_states:
continue
av = np.zeros(mdp.n_actions)
for a in range(mdp.n_actions):
next_s, reward, done = mdp.get_transition(s, a)
next_value = 0.0 if done else V[next_s]
av[a] = reward + gamma * next_value
backups += 1
new_policy[s, int(np.argmax(av))] = 1.0
if np.allclose(policy, new_policy):
return backups
policy = new_policy
def value_iteration_profiled(mdp, gamma=0.9, theta=1e-6):
V = np.zeros(mdp.n_states)
backups = 0
while True:
delta = 0.0
for s in range(mdp.n_states):
if s in mdp.terminal_states:
continue
v_old = V[s]
av = np.zeros(mdp.n_actions)
for a in range(mdp.n_actions):
next_s, reward, done = mdp.get_transition(s, a)
next_value = 0.0 if done else V[next_s]
av[a] = reward + gamma * next_value
backups += 1
V[s] = np.max(av)
delta = max(delta, abs(v_old - V[s]))
if delta < theta:
return backups
print(f"{'size':>5}{'policy-iter':>14}{'value-iter':>14}")
for size in [4, 8, 16]:
mdp = GridWorldMDP(size=size)
pi_b = policy_iteration_profiled(mdp, gamma=0.9)
vi_b = value_iteration_profiled(mdp, gamma=0.9)
print(f"{size:5d}{pi_b:14d}{vi_b:14d}")
Run it and you'll see policy iteration burning a lot of backups inside its full evaluation loop, even though it only needs a few outer rounds to lock in the policy. Value iteration spreads its work more evenly and -- on these shortest-path grids -- usually comes out ahead on total backups, because it never wastes effort polishing a value function it's about to overwrite. Which one wins flips around once evaluation gets cheap relative to improvement, but for this family of problems value iteration is the leaner one.
On to today's episode
Right, episode 105. Last time (episode #104) we solved grid worlds with dynamic programming -- and I kept hammering one uncomfortable assumption: DP needs the full model. Every transition probability, every reward, all of it known in advance. We had the answer key sitting on the desk and the only puzzle was how to read it.
But where does the answer key come from? For Blackjack, for a robot arm, for an agent learning to trade -- nobody hands you P(s'|s,a) on a plate. The universe is not so generous. So today we throw the model away and learn from something we can get our hands on: experience. Play the game, watch what happens, average the results. That's the whole idea behind Monte Carlo methods, and it's our first genuinely model-free algorithm in this series.
The name, by the way, is a little Cold War history. In the 1940s the physicists at Los Alamos needed to study neutron diffusion but the equations were a nightmare, so Stanislaw Ulam and John von Neumann did the unthinkable for mathematicians of the era -- they gambled. Simulate the random process thousands of times, average the outcomes, read off the answer. They named it after the casino in Monaco where Ulam's uncle liked to lose money. The principle in RL is identical: if you want to know the value of a state, visit it many times and average the returns you collect.
Learning from complete episodes
A Monte Carlo method needs complete episodes. The agent plays from start to terminal state, collecting a trajectory:
s_0, a_0, r_1, s_1, a_1, r_2, ..., s_T (terminal)
Once the episode is over -- and only once it's over, because we need the actual outcome, not a guess -- we compute the return G_t for each visited state (episode #102: the discounted sum of future rewards) and use it as one noisy sample of that state's value. Do that across enough episodes and the averages converge to the truth.
import numpy as np
from collections import defaultdict
def generate_episode(env, policy):
"""Play one complete episode following the policy.
Returns a list of (state, action, reward) tuples."""
episode = []
state = env.reset()
done = False
while not done:
action = policy(state)
next_state, reward, done = env.step(action)
episode.append((state, action, reward))
state = next_state
return episode
def compute_returns(episode, gamma=1.0):
"""Compute G_t for each step, walking the episode backwards."""
returns = []
G = 0.0
for _, _, reward in reversed(episode):
G = reward + gamma * G
returns.insert(0, G)
return returns
Notice we build the returns backwards. That's not a stylistic quirk -- it's the efficient way to do it. G_t = r_{t+1} + gamma * G_{t+1}, so once you've got the return for the next step, the current step is a single multiply-and-add. Walk forwards and you'd be re-summing the same tail over and over. (We did the exact same backwards trick in episode #104 inside policy evaluation -- the Bellman recursion keeps coming back, doesn't it ;-)
First-visit vs every-visit MC
Here's a wrinkle: a single episode might visit the same state more than once. Two reasonable answers to "which return do I use?" exist. First-visit MC only counts the return following the first time a state shows up in an episode. Every-visit MC counts every occurrence:
def first_visit_mc_evaluation(env, policy, n_episodes=10000, gamma=1.0):
"""Estimate V^pi using first-visit Monte Carlo."""
V = defaultdict(float)
returns_count = defaultdict(int)
for _ in range(n_episodes):
episode = generate_episode(env, policy)
returns = compute_returns(episode, gamma)
visited = set()
for t, (state, action, reward) in enumerate(episode):
if state not in visited:
visited.add(state)
returns_count[state] += 1
# incremental mean: new_mean += (sample - old_mean) / count
V[state] += (returns[t] - V[state]) / returns_count[state]
return dict(V)
def every_visit_mc_evaluation(env, policy, n_episodes=10000, gamma=1.0):
"""Estimate V^pi using every-visit Monte Carlo."""
V = defaultdict(float)
returns_count = defaultdict(int)
for _ in range(n_episodes):
episode = generate_episode(env, policy)
returns = compute_returns(episode, gamma)
for t, (state, action, reward) in enumerate(episode):
returns_count[state] += 1
V[state] += (returns[t] - V[state]) / returns_count[state]
return dict(V)
That incremental mean update is worth pausing on. Instead of storing every return and averaging at the end (which would eat memory), we nudge the running estimate toward each new sample by a shrinking step 1/count. After a million episodes the step is tiny and the estimate has settled. This little value += (sample - value) / count pattern is everywhere in RL -- keep an eye out for it, because it mutates into the learning rate of basically every algorithm that comes after this one.
Both flavours converge to the true V^pi as the episode count grows. First-visit has slightly cleaner theory (each first-visit return is an independent sample), every-visit squeezes a touch more data out of each episode. In practice the difference is academic -- pick one and move on.
Blackjack: the textbook Monte Carlo testbed
Blackjack is the classic MC example, and for good reason. It has clean episodes (one hand), genuinely stochastic dynamics (the card draws), and a small enough state space that we can study it completely. The state is just three things: the player's current sum, the dealer's showing card, and whether we hold a usable ace. Let's build it from scratch:
import random
class BlackjackEnv:
"""Simplified Blackjack environment (one player vs dealer)."""
def reset(self):
self.player_sum = self._draw_card() + self._draw_card()
self.dealer_showing = self._draw_card()
self.dealer_hidden = self._draw_card()
self.usable_ace = False
# Deal up until we are in the interesting 12-21 range
while self.player_sum < 12:
card = self._draw_card()
if card == 1 and self.player_sum + 11 <= 21:
self.player_sum += 11
self.usable_ace = True
else:
self.player_sum += card
return self._get_state()
def _draw_card(self):
return min(random.randint(1, 13), 10) # face cards count as 10
def _get_state(self):
return (self.player_sum, self.dealer_showing, self.usable_ace)
def step(self, action):
"""action: 0 = stick, 1 = hit."""
if action == 1: # hit
card = self._draw_card()
if card == 1 and self.player_sum + 11 <= 21:
self.player_sum += 11
self.usable_ace = True
else:
self.player_sum += card
if self.player_sum > 21:
if self.usable_ace: # rescue with the soft ace
self.player_sum -= 10
self.usable_ace = False
else:
return self._get_state(), -1.0, True # bust
return self._get_state(), 0.0, False
else: # stick -- dealer plays out their hand
dealer_sum = self.dealer_showing + self.dealer_hidden
while dealer_sum < 17:
dealer_sum += self._draw_card()
if dealer_sum > 21:
return self._get_state(), 1.0, True # dealer busts
elif dealer_sum > self.player_sum:
return self._get_state(), -1.0, True # dealer wins
elif dealer_sum < self.player_sum:
return self._get_state(), 1.0, True # player wins
else:
return self._get_state(), 0.0, True # push
Now let's evaluate a deliberately naive policy -- stick only on 20 or 21, otherwise keep hitting -- and see what Monte Carlo thinks each state is worth:
def simple_policy(state):
player_sum, dealer_showing, usable_ace = state
return 0 if player_sum >= 20 else 1 # stick on 20+, else hit
env = BlackjackEnv()
V = first_visit_mc_evaluation(env, simple_policy, n_episodes=100000)
for sum_val in range(12, 22):
for dealer in [2, 5, 10]:
state = (sum_val, dealer, False)
print(f"Player {sum_val}, dealer shows {dealer}: V = {V.get(state, 0):+.3f}")
The pattern that falls out is exactly what your gut says it should be: sums of 20 or 21 carry comfortably positive values (you win most of those), the 12-16 range is a swamp of negative values (you're stuck either busting on a hit or losing on a stick), and the dealer's showing card matters a lot -- a dealer showing a 10 is far scarier than one showing a 5. We learned all of that without writing a single line of Blackjack strategy. We just played a hundred thousand hands and averaged.
Monte Carlo control: finding the optimal policy
Evaluation tells you how good a policy is. Control finds the best one. The move is the same one we made in episode #104 with policy iteration -- estimate values, then act greedily with respect to them -- with two changes forced on us by the model-free setting.
First, we estimate action values Q(s,a) instead of state values V(s). Why? Because going greedy on V(s) requires a one-step look-ahead through the model ("which action lands me in the best next state?"), and we no longer have the model. Q(s,a) sidesteps that entirely: it already bakes in "take action a, then follow the policy", so greedy is just argmax_a Q(s,a).
Second, we have to keep exploring. If we always act greedily we might never try an action that's actually better, simply because an unlucky early sample made it look bad. The fix is epsilon-greedy (straight from the bandit toolkit in episode #103): act greedily most of the time, but with probability epsilon take a random action to keep the data honest.
def mc_control_epsilon_greedy(env, n_episodes=500000, gamma=1.0, epsilon=0.1):
"""Find a near-optimal policy with first-visit MC control."""
Q = defaultdict(lambda: np.zeros(2)) # 2 actions: stick, hit
returns_count = defaultdict(lambda: np.zeros(2))
def policy(state):
if random.random() < epsilon:
return random.randint(0, 1)
return int(np.argmax(Q[state]))
for _ in range(n_episodes):
episode = []
state = env.reset()
done = False
while not done:
action = policy(state)
next_state, reward, done = env.step(action)
episode.append((state, action, reward))
state = next_state
G = 0.0
visited = set()
for state, action, reward in reversed(episode):
G = reward + gamma * G
if (state, action) not in visited:
visited.add((state, action))
returns_count[state][action] += 1
n = returns_count[state][action]
Q[state][action] += (G - Q[state][action]) / n
optimal_policy = {s: int(np.argmax(Q[s])) for s in Q}
return Q, optimal_policy
Q, policy = mc_control_epsilon_greedy(BlackjackEnv(), n_episodes=500000)
print("Optimal play (S=stick, H=hit), no usable ace:")
print(f"{'':>8}", end="")
for dealer in range(2, 12):
print(f"{dealer:>4}", end="")
print()
for player_sum in range(21, 11, -1):
print(f"Sum {player_sum:>2}: ", end="")
for dealer in range(2, 12):
action = policy.get((player_sum, dealer, False), 1)
print(f"{'S' if action == 0 else 'H':>4}", end="")
print()
After half a million hands, the discovered policy lines up almost perfectly with the basic strategy card every Blackjack book sells you: stick early when the dealer shows a weak card (they're likely to bust), keep hitting when they show a strong one, and the stick/hit threshold slides with the dealer's card. The thing I find genuinely lovely here is that nobody told the agent the rules. It never read a strategy guide. It played, it lost, it won, it averaged -- and basic strategy fell out of the arithmetic. That's the Monte Carlo promise made concrete.
Off-policy learning: learning from someone else's mistakes
Everything so far has been on-policy -- the policy generating the episodes is the same one we're evaluating and improving. Tidy, but limiting. Sometimes you want to learn about one policy while following a different one.
That's off-policy learning, and it splits the world in two. The behavior policy generates the data (often deliberately exploratory, or just whatever you happen to have). The target policy is the one you actually care about -- usually the sharp, greedy one you're trying to optimize. Separating them is enormously useful: you can learn an optimal policy from data collected by a clumsy exploratory one, from old logs gathering dust, from human demonstrations, or from a completely different agent.
The snag is obvious once you say it out loud. Episodes drawn from the behavior policy visit states and pick actions with the wrong frequencies -- wrong, that is, relative to what the target policy would have done. If we just averaged those returns we'd get the value of the behavior policy, not the target. We need to correct for the mismatch.
Importance sampling
The correction is importance sampling: re-weight each return by how much more (or less) likely the trajectory was under the target policy than under the behavior policy.
def off_policy_mc_evaluation(env, target_prob, behavior_policy, behavior_prob,
n_episodes=100000, gamma=1.0):
"""Evaluate a target policy from data generated by a behavior policy,
using weighted importance sampling."""
Q = defaultdict(float)
C = defaultdict(float) # cumulative importance-sampling weights
for _ in range(n_episodes):
episode = []
state = env.reset()
done = False
while not done:
action = behavior_policy(state)
next_state, reward, done = env.step(action)
episode.append((state, action, reward))
state = next_state
G = 0.0
W = 1.0 # the running importance-sampling ratio
for state, action, reward in reversed(episode):
G = reward + gamma * G
C[(state, action)] += W
Q[(state, action)] += (W / C[(state, action)]) * (G - Q[(state, action)])
# if the target would never take this action, the rest is irrelevant
if target_prob(state, action) == 0:
break
W *= target_prob(state, action) / behavior_prob(state, action)
return dict(Q)
The ratio W is a product of per-step terms pi_target(a|s) / pi_behavior(a|s). Read it like this: when the target policy strongly favours an action the behavior policy only took occasionally, that episode was rare evidence about the target, so we crank its weight up. When the behavior policy took an action the target would shun, the weight collapses toward zero -- that episode tells us little about the target, so we down-weight it (and the break cuts it off entirely once the target probability hits zero).
There are two ways to use those weights. Ordinary importance sampling divides by the episode count -- unbiased, but the variance can be savage (a single huge W can wreck your estimate). Weighted importance sampling -- what the code above does -- divides by the cumulative weights C instead. That introduces a little bias but slashes the variance, and in practice it wins almost every time. When in doubt, reach for weighted.
Monte Carlo vs Dynamic Programming
So how does our new model-free tool stack up against last episode's open-book DP? A side-by-side:
| Dynamic Programming | Monte Carlo | |
|---|---|---|
| Model required? | Yes -- full MDP | No -- just episodes |
| Bootstraps? | Yes -- V(s) uses V(s') | No -- uses the actual return |
| Needs complete episodes? | No | Yes |
| Bias | None (given true model) | None (unbiased estimates) |
| Variance | Low (deterministic sweeps) | High (returns swing a lot) |
| State coverage | Every state, every sweep | Only states it actually visits |
Monte Carlo's big win is right there in the first row -- no model. And because it learns from real returns rather than from its own value estimates (no bootstrapping), it carries no bias from a half-trained value function. Its weaknesses are the flip side: returns are noisy, so variance is high, and the whole thing only works for tasks that actually terminate. An agent that runs forever never produces a return to average, and Monte Carlo just sits there waiting for an episode that never ends.
That last limitation is a real itch. What if we could learn from a single step instead of waiting for the whole episode to finish? What if we could blend Monte Carlo's learn-from-reality honesty with the bootstrapping trick that made DP so efficient? That combination is the most important idea in all of reinforcement learning, and it's exactly where we're headed next. But that's a story for another episode ;-)
So, what do you know now?
- Monte Carlo methods estimate value functions by averaging the actual returns from complete episodes -- no model of the environment needed, which makes them our first truly model-free algorithm;
- first-visit MC uses only the first occurrence of each state per episode, every-visit MC uses all of them, and both converge to V^pi given enough episodes (the practical difference is negligible);
- the incremental mean update
value += (sample - value) / countis the workhorse pattern -- it quietly becomes the learning rate of nearly every algorithm that follows; - MC control swaps V for action values Q and leans on epsilon-greedy exploration to discover optimal policies -- we watched it rediscover Blackjack basic strategy from raw experience alone;
- off-policy learning evaluates a target policy from data generated by a different behavior policy, and importance sampling (weighted, for low variance) corrects the probability mismatch;
- MC has zero bias but high variance and demands complete episodes -- limitations that the next family of methods is built specifically to overcome, by learning from single steps in stead of whole episodes.
Exercises
Exercise 1: Add a usable-ace policy comparison to the Blackjack evaluator. Run first_visit_mc_evaluation on the simple_policy and this time print the value table separately for usable_ace = True and usable_ace = False (same sums and dealer cards). Compare the two tables: for which states does holding a usable ace make the position meaningfully better, and why? (Hint: a soft hand can't bust on the next hit, so the downside is smaller.)
Exercise 2: Implement and compare ordinary vs weighted importance sampling on the same off-policy Blackjack evaluation. Use a uniform-random behavior policy and a deterministic target policy (stick on 18+). Track both estimators for a single fixed state across increasing episode counts (say 1k, 10k, 100k, 1M) and plot or print how each estimate behaves. You should see ordinary IS swinging wildly with the occasional spike, while weighted IS settles down smoothly -- the bias/variance trade-off made visible.
Exercise 3: Build a first-visit vs every-visit race. Pick a small environment where states genuinely repeat within an episode (a tiny grid world with a random policy works -- the agent will revisit cells). Run both estimators for the same number of episodes and measure how fast each one's value estimate for a chosen state approaches the answer from a very-long reference run. Does every-visit's extra data actually buy faster convergence here, or does the correlation between repeated visits eat the advantage? Report what you find.