Learn AI Series (#112) - RL for Games
Learn AI Series (#112) - RL for Games

What will I learn
- You will learn AlphaZero -- the single algorithm that mastered Chess, Go and Shogi with nothing but the rules and a stubborn habit of playing itself;
- Monte Carlo Tree Search (MCTS), the planning machine that lets a network "think ahead" by simulating thousands of possible futures before committing to a move;
- why Atari -- and one infamous game in particular -- forced the field to invent curiosity, because some worlds hand out almost no reward at all;
- how StarCraft and Dota 2 dragged RL kicking and screaming into long horizons, hidden information and action spaces the size of a small country;
- reward shaping -- how to nudge a learner with extra signals without accidentally teaching it to cheat;
- and curriculum learning, the deceptively obvious idea of starting easy and turning the difficulty knob up only when the agent has earned it.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution with NumPy and PyTorch;
- You've followed episodes #107 (DQN), #109 (PPO), #110 (model-based RL) and #111 (multi-agent self-play) -- today is where all four finally shake hands.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals
- Learn AI Series (#78) - Object Detection (Part 1) - Foundations
- Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches
- Learn AI Series (#80) - Image Segmentation
- Learn AI Series (#81) - Pose Estimation and Tracking
- Learn AI Series (#82) - Optical Character Recognition
- Learn AI Series (#83) - Video Understanding
- Learn AI Series (#84) - Generative Images - Diffusion Models (Part 1)
- Learn AI Series (#85) - Generative Images - Diffusion Models (Part 2)
- Learn AI Series (#86) - Image-to-Image and Editing
- Learn AI Series (#87) - 3D Vision
- Learn AI Series (#88) - Face Analysis
- Learn AI Series (#89) - Medical and Scientific Imaging
- Learn AI Series (#90) - Self-Supervised Learning for Vision
- Learn AI Series (#91) - Mini Project - Building a Visual AI System
- Learn AI Series (#92) - Audio Fundamentals for AI
- Learn AI Series (#93) - Speech Recognition
- Learn AI Series (#94) - Text-to-Speech (TTS)
- Learn AI Series (#95) - Audio Classification
- Learn AI Series (#96) - Music Generation
- Learn AI Series (#97) - Speaker Recognition and Diarization
- Learn AI Series (#98) - Natural Language Understanding for Voice
- Learn AI Series (#99) - Audio Enhancement
- Learn AI Series (#100) - Multimodal Audio-Visual Models
- Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant
- Learn AI Series (#102) - What Is Reinforcement Learning?
- Learn AI Series (#103) - Multi-Armed Bandits
- Learn AI Series (#104) - Dynamic Programming
- Learn AI Series (#105) - Monte Carlo Methods
- Learn AI Series (#106) - Temporal Difference Learning
- Learn AI Series (#107) - Deep Q-Networks (DQN)
- Learn AI Series (#108) - Policy Gradient Methods
- Learn AI Series (#109) - Advanced Policy Optimization
- Learn AI Series (#110) - Model-Based Reinforcement Learning
- Learn AI Series (#111) - Multi-Agent Reinforcement Learning
- Learn AI Series (#112) - RL for Games (this post)
Learn AI Series (#112) - RL for Games
Solutions to Episode #111 Exercises
Before we let an agent loose on a chessboard, let's clear last episode's three exercises. They all reuse the IndependentAgent, IteratedPrisonersDilemma, SelfPlayTrainer and QMIXMixer classes from episode #111, so I'm assuming those are imported and in scope.
Exercise 1: Two IndependentAgents play the IteratedPrisonersDilemma for 5000 rounds, each one fed the last joint action as a 4-dim one-hot observation. Track the cooperation rate, and run it under three seeds to feel how seed-sensitive MARL really is.
import numpy as np
import random
# Assumes IndependentAgent and IteratedPrisonersDilemma from episode #111.
def joint_onehot(a1, a2):
"""Encode the previous joint action as a 4-dim one-hot observation."""
v = np.zeros(4, dtype=np.float32)
v[a1 * 2 + a2] = 1.0 # (0,0)->0 (0,1)->1 (1,0)->2 (1,1)->3
return v
def run_ipd(seed, rounds=5000):
random.seed(seed)
np.random.seed(seed)
env = IteratedPrisonersDilemma()
a1 = IndependentAgent(obs_dim=4, n_actions=2, agent_id=0)
a2 = IndependentAgent(obs_dim=4, n_actions=2, agent_id=1)
obs = np.zeros(4, dtype=np.float32) # no history on round 0
coop_history = []
for t in range(rounds):
act1 = a1.choose_action(obs)
act2 = a2.choose_action(obs)
r1, r2 = env.step(act1, act2)
nxt = joint_onehot(act1, act2)
a1.store(obs, act1, r1, nxt, False)
a2.store(obs, act2, r2, nxt, False)
a1.learn(); a2.learn()
# decay exploration so they can actually settle on a policy
a1.epsilon = max(0.02, a1.epsilon * 0.999)
a2.epsilon = max(0.02, a2.epsilon * 0.999)
coop_history.append(int(act1 == 0 and act2 == 0))
obs = nxt
return np.mean(coop_history[-500:]) # cooperation rate at the end
for s in (0, 1, 2):
print(f"seed={s}: final mutual-cooperation rate = {run_ipd(s):.2f}")
The honest result is the lesson: across the three seeds you'll typically get three different stories. One run drifts into the cosy (cooperate, cooperate) corner and stays there, another collapses into mutual defection, and a third never settles at all -- it oscillates, because each agent keeps "discovering" that defection pays right up until the other punishes it. This is non-stationarity from episode #111 made visible: there is no fixed target to converge to when your opponent is a moving part. Run it ten times and you'd get a distribution, not an answer -- which is exactly why serious MARL papers report results over many seeds and never trust a single lucky run.
Exercise 2: A minimal SelfPlayTrainer for rock-paper-scissors (one stateless step, three actions). Train with the historical opponent pool and without it, and watch the difference.
import numpy as np
import random
class RPSPolicy:
"""A stateless policy: just three logits over rock/paper/scissors."""
def __init__(self):
self.logits = np.zeros(3, dtype=np.float64)
def probs(self):
z = self.logits - self.logits.max()
e = np.exp(z)
return e / e.sum()
def act(self):
return np.random.choice(3, p=self.probs())
def beats(a, b): # +1 if a beats b, -1 if loses, 0 tie
return 0 if a == b else (1 if (a - b) % 3 == 1 else -1)
def train_rps(use_pool, steps=20000, lr=0.05):
cur = RPSPolicy()
pool = []
for t in range(steps):
if use_pool and pool and random.random() < 0.5:
opp = random.choice(pool)
else:
opp = cur
a, b = cur.act(), opp.act()
reward = beats(a, b)
# crude policy-gradient nudge toward actions that won
p = cur.probs()
grad = -p
grad[a] += 1.0
cur.logits += lr * reward * grad
if use_pool and t % 500 == 0:
snap = RPSPolicy(); snap.logits = cur.logits.copy()
pool.append(snap)
return cur.probs()
print("no pool :", np.round(train_rps(False), 3))
print("with pool:", np.round(train_rps(True), 3))
The "with pool" run converges toward the famous (1/3, 1/3, 1/3) Nash mixture -- the only strategy that nobody can exploit -- because it's forced to stay strong against every past version of itself. The "no pool" run, always chasing the latest self, cycles instead: rock beats scissors, so it leans rock, so the copy leans paper, so it leans scissors, round and round forever. Same game, same code, one extra list of checkpoints, completely different outcome. That little pool is doing real work.
Exercise 3: A numerical check that the QMIXMixer is genuinely monotonic -- and a demonstration that the lone torch.abs is what makes it so.
import torch
# Assumes QMIXMixer from episode #111.
def monotonicity_violations(use_abs, trials=5000, n_agents=4, state_dim=8):
mixer = QMIXMixer(n_agents, state_dim)
if not use_abs:
# Sabotage the monotonicity trick: patch torch.abs to a no-op.
import builtins
orig = torch.abs
torch.abs = lambda x: x
violations = 0
for _ in range(trials):
qs = torch.randn(1, n_agents)
state = torch.randn(1, state_dim)
q_before = mixer(qs, state)
bump = qs.clone()
bump[0, 0] += 0.1 # raise ONE agent's Q-value
q_after = mixer(bump, state)
if q_after.item() < q_before.item() - 1e-6:
violations += 1
if not use_abs:
torch.abs = orig
return violations
print("violations WITH abs :", monotonicity_violations(True))
print("violations WITHOUT abs:", monotonicity_violations(False))
With the torch.abs in place you should see a flat zero violations across all 5000 trials: raise any agent's Q-value, and Q_total never drops. Strip the abs out and the violations pour in, because now some mixing weights are negative and an agent's improvement can lower the team value. That's the whole game: monotonicity is what guarantees "each agent greedily picks its own best action" matches "the team picks its best joint action". One function call, load-bearing for the entire decentralized-execution story. Nota bene -- monkey-patching torch.abs like this is a teaching hack, not something you'd ever ship ;-)
Right -- on to the main event
So. Eleven episodes ago we couldn't get an agent to balance a stick. Today we're going to talk about agents that beat the best humans on the planet at the hardest games we ever invented. And here's the thing that still gives me a little thrill after all these years: games have been the proving ground for reinforcement learning since the very beginning, and almost every big idea in modern RL was born trying to win one.
A quick roll of honour. Tesauro's TD-Gammon taught itself Backgammon back in 1992 -- in stead of copying human play, it found moves that surprised grandmasters. DeepMind's DQN (our episode #107) chewed through dozens of Atari games straight from pixels in 2015. AlphaGo toppled the world Go champion in 2016, a decade earlier than anyone sane predicted. OpenAI Five dismantled professional Dota 2 teams in 2019, and AlphaStar hit Grandmaster in StarCraft II the same year.
Why games, though? Because they are very nearly perfect laboratories: the rules are crisp, the reward is unambiguous (you won, or you didn't), the simulator runs faster than real time, and difficulty comes pre-graded from "toddler" to "world champion". And -- this is the part that matters far beyond games -- the algorithms born here travel. AlphaZero's MCTS planning grew into MuZero's general model-based RL. PPO, invented to wrangle game agents, now trains language models. The toy was always a Trojan horse for something serious.
AlphaZero: one algorithm, three games, zero human knowledge
AlphaZero (Silver et al., 2018) is, to me, one of the most beautiful results in the whole field. The same code, with no game-specific tweaks, learned Chess, Go and Shogi from scratch -- three games with wildly different boards, pieces and tactics. No opening books, no endgame tablebases, no hand-crafted features. Just the rules, and an awful lot of self-play.
The recipe has three ingredients, and that's genuinely all:
- A neural network that reads the board and outputs two things at once -- a policy (a probability over legal moves) and a value (who is winning, from -1 to +1);
- Monte Carlo Tree Search that uses that network to look ahead intelligently;
- Self-play (the episode #111 trick) that generates the training data out of thin air.
The network is a dual-headed beast sitting on a tower of residual blocks (remember those from the CNN episodes? they show up everywhere):
import torch
import torch.nn as nn
import numpy as np
import math
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn2 = nn.BatchNorm2d(channels)
def forward(self, x):
residual = x
x = torch.relu(self.bn1(self.conv1(x)))
x = self.bn2(self.conv2(x))
return torch.relu(x + residual) # the skip connection
class AlphaZeroNetwork(nn.Module):
"""Dual-head network: a policy AND a value from one board state."""
def __init__(self, board_size, n_actions, n_res_blocks=5, channels=128):
super().__init__()
# 17 input planes is the Go encoding (stone history + side to move).
self.conv_input = nn.Sequential(
nn.Conv2d(17, channels, 3, padding=1),
nn.BatchNorm2d(channels), nn.ReLU(),
)
self.res_blocks = nn.ModuleList(
[ResidualBlock(channels) for _ in range(n_res_blocks)])
self.policy_head = nn.Sequential( # probability over moves
nn.Conv2d(channels, 2, 1), nn.BatchNorm2d(2), nn.ReLU(),
nn.Flatten(),
nn.Linear(2 * board_size * board_size, n_actions),
)
self.value_head = nn.Sequential( # who is winning? (-1..1)
nn.Conv2d(channels, 1, 1), nn.BatchNorm2d(1), nn.ReLU(),
nn.Flatten(),
nn.Linear(board_size * board_size, 256), nn.ReLU(),
nn.Linear(256, 1), nn.Tanh(),
)
def forward(self, state):
x = self.conv_input(state)
for block in self.res_blocks:
x = block(x)
return self.policy_head(x), self.value_head(x)
Two heads, one body. The shared residual tower learns to "see" the board; the policy head turns that into a hunch about good moves, and the value head into a verdict about the outcome. Having said that, the network alone is a strong intuition but a weak player -- it reacts, it doesn't plan. The planning is where MCTS comes in.
Monte Carlo Tree Search: thinking by simulating
MCTS is how AlphaZero "thinks". Rather than glance at a position and blurt out a move, it imagines hundreds or thousands of continuations, growing a search tree, and uses the network to decide which branches are worth exploring. Four steps, repeated over and over: select a promising path down the tree, expand a new leaf, evaluate it with the network, and backpropagate the verdict up to the root.
class MCTSNode:
"""One node in the search tree."""
def __init__(self, prior, parent=None):
self.prior = prior # P(a) suggested by the network
self.visit_count = 0 # N(a)
self.total_value = 0.0 # W(a)
self.children = {} # action -> child node
self.parent = parent
@property
def mean_value(self):
return 0.0 if self.visit_count == 0 else self.total_value / self.visit_count
class MCTS:
"""Monte Carlo Tree Search guided by a policy/value network."""
def __init__(self, network, n_simulations=100, c_puct=1.0):
self.network = network
self.n_simulations = n_simulations
self.c_puct = c_puct
def search(self, game_state):
root = MCTSNode(prior=0)
for _ in range(self.n_simulations):
node, state = root, game_state.clone()
# 1. SELECT: walk down using the PUCT score
while node.children and not state.is_terminal():
action, node = self._select_child(node)
state.apply_action(action)
# 2. EXPAND + EVALUATE with the network
if not state.is_terminal():
policy, value = self._evaluate(state)
for action in state.get_legal_actions():
node.children.setdefault(
action, MCTSNode(prior=policy[action], parent=node))
else:
value = state.get_reward()
# 3. BACKPROPAGATE, flipping perspective each level up
while node is not None:
node.visit_count += 1
node.total_value += value
value = -value
node = node.parent
# The move played most often during search is the move we trust.
visits = {a: c.visit_count for a, c in root.children.items()}
total = sum(visits.values())
return {a: n / total for a, n in visits.items()}
def _select_child(self, node):
"""PUCT: exploit high value, explore high prior with few visits."""
sqrt_parent = math.sqrt(node.visit_count)
best_score, best = -float("inf"), (None, None)
for action, child in node.children.items():
exploit = child.mean_value
explore = self.c_puct * child.prior * sqrt_parent / (1 + child.visit_count)
score = exploit + explore
if score > best_score:
best_score, best = score, (action, child)
return best
def _evaluate(self, state):
with torch.no_grad():
logits, value = self.network(state.to_tensor().unsqueeze(0))
policy = torch.softmax(logits, dim=1).squeeze().numpy()
return policy, value.item()
The magic is in that PUCT formula inside _select_child. It balances two urges: exploit the moves that have looked good so far (high mean_value), and explore the moves the network thinks are promising but we've barely tried (high prior, low visit_count). The network prior is what makes this tractable -- Go has roughly 250 legal moves per turn and games go 150+ moves deep, so brute-force search is hopeless. The prior whispers "don't bother with these obviously terrible moves" and the tree spends its precious simulations where they count. And notice the final return value: not the highest-value move, but the most-visited one. The move the search kept coming back to is the move it trusts, because visit count is a vote weighted by everything the search learned.
The training loop: the dog that teaches itself new tricks
Here's where the whole thing becomes a perpetual-motion machine (well, almost). Self-play with the current network generates games. Each position, paired with the MCTS visit-distribution and the eventual game result, becomes a training example. The network trains on that, gets a little sharper, which makes MCTS a little sharper, which produces better games, which... you see where this goes.
def alphazero_training_loop(game, network, n_iterations=100,
games_per_iteration=100, mcts_sims=200):
optimizer = torch.optim.Adam(network.parameters(), lr=1e-3, weight_decay=1e-4)
for iteration in range(n_iterations):
# Phase 1: self-play generates fresh training data
training_data = []
mcts = MCTS(network, n_simulations=mcts_sims)
for _ in range(games_per_iteration):
state = game.new_game()
history = []
while not state.is_terminal():
pi = mcts.search(state)
history.append((state.to_tensor(), pi))
actions = list(pi.keys())
action = np.random.choice(actions, p=[pi[a] for a in actions])
state.apply_action(action)
# Hand the final result back to every position, alternating sign
result = state.get_reward()
for board_tensor, pi in history:
training_data.append((board_tensor, pi, result))
result = -result
# Phase 2: train on the self-play data
for epoch in range(10):
np.random.shuffle(training_data)
for start in range(0, len(training_data), 64):
batch = training_data[start:start + 64]
boards, pis, results = zip(*batch)
boards_t = torch.stack(boards)
results_t = torch.FloatTensor(results).unsqueeze(1)
logits, values = network(boards_t)
# Policy loss: cross-entropy against the MCTS visit-distribution
targets = torch.zeros_like(logits)
for i, pi in enumerate(pis):
for action, prob in pi.items():
targets[i, action] = prob
policy_loss = -(targets * torch.log_softmax(logits, dim=1)).sum(1).mean()
# Value loss: MSE against who actually won
value_loss = nn.functional.mse_loss(values, results_t)
loss = policy_loss + value_loss
optimizer.zero_grad(); loss.backward(); optimizer.step()
print(f"Iteration {iteration}: {len(training_data)} positions generated")
The numbers are worth sitting with. AlphaZero trained roughly 9 hours on Chess, 12 hours on Shogi and 13 days on Go -- on a single machine with 4 TPUs -- and the resulting players were stronger than the best specialised engines humanity had built over decades. Stockfish for Chess, the dedicated Go programs, all of it, beaten by one general recipe that started from random play. The first time I read that paper I had to put it down and go for a walk ;-)
Beyond the board: Atari and the curse of sparse reward
Board games are tidy -- perfect information, discrete moves, take-turns. Atari is messier: real-time, raw pixels, partial information. DQN (episode #107) cracked dozens of Atari games, but a handful stayed stubbornly unsolved, and the poster child for "stubborn" is Montezuma's Revenge.
The problem is sparse reward. In Montezuma's Revenge the agent must climb down ladders, jump gaps, grab a key, cross a room, unlock a door -- hundreds of correct steps -- before a single point arrives. Plain epsilon-greedy DQN just rattles around randomly, never stumbling on that long lucky sequence, and so it never sees a reward to learn from. It's like trying to learn the piano when the only feedback is a standing ovation at the end of a concert you can't yet play.
The fix that stuck is giving the agent its own internal motivation -- a curiosity bonus for visiting states it can't yet predict:
class ICMModule(nn.Module):
"""Intrinsic Curiosity Module: reward the agent for surprises."""
def __init__(self, state_dim, action_dim, feature_dim=64):
super().__init__()
self.feature_net = nn.Sequential( # compress states into features
nn.Linear(state_dim, 128), nn.ReLU(),
nn.Linear(128, feature_dim),
)
self.forward_model = nn.Sequential( # predict next feature from this + action
nn.Linear(feature_dim + action_dim, 128), nn.ReLU(),
nn.Linear(128, feature_dim),
)
def compute_intrinsic_reward(self, state, action, next_state):
phi_s = self.feature_net(state)
phi_ns = self.feature_net(next_state)
pred = self.forward_model(torch.cat([phi_s, action], dim=-1))
# Big prediction error = big surprise = big intrinsic reward
return 0.5 * (pred - phi_ns.detach()).pow(2).sum(dim=-1)
The trick is self-balancing. States the agent can't predict are novel, so they pay a curiosity bonus and the agent goes looking. But as it explores them the forward model learns them, the prediction error shrinks, and those once-exciting states become boring -- pushing the agent toward the next unexplored frontier. It's an artificial sense of wonder that automatically fades where it's been satisfied. Count-based exploration and Random Network Distillation are cousins of this idea, and together they finally got agents through that wretched first room.
Real-time strategy: where RL meets the meat grinder
If board games are tidy and Atari is messy, StarCraft II and Dota 2 are the meat grinder. They pile on every hard thing at once: games lasting 30-45 minutes (long time horizons), a fog of war hiding the opponent (imperfect information), continuous real-time decisions, and action spaces so vast you can barely write them down.
OpenAI Five tackled Dota 2 by throwing PPO (episode #109) at the problem at a frankly absurd scale: 256 GPUs and 128,000 CPU cores, playing the equivalent of 180 years of Dota every single day for ten months. A few of the tricks that made it work:
- Patience as a hyperparameter: a Dota game is ~20,000 decisions long, so they used a discount factor of
gamma = 0.9997-- an agent that genuinely cares about a reward arising thousands of steps in the future. - Taming the action space: the raw continuous control was discretised into roughly a thousand sensible options per step, so the policy had something finite to choose from.
- Team reward: all five heroes on a team shared one reward signal -- cooperative MARL straight out of episode #111, learned at industrial scale.
AlphaStar went at StarCraft II differently: a transformer-based policy (yes, the same attention machinery from episodes #51-53) reading the game as a set of entities, trained first on human replays and then refined with self-play inside a league of deliberately diverse agents -- some aggressive, some weird, some built only to expose the main agent's blind spots. It reached Grandmaster, top 0.2% of human players, in all three races. The league is the key idea: a single self-play line can get stuck in a rut, but a whole ecosystem of rivals keeps everyone honest.
Reward shaping: helpful nudges and how they backfire
Raw game rewards are often too sparse or too delayed to learn from quickly, so we're tempted to add intermediate signals -- a little reward for picking up the key, for moving toward the goal, for staying alive. This is reward shaping, and it is gloriously easy to get wrong. Shape it carelessly and the agent will gleefully maximise your shaping signal in stead of the actual objective -- circling the key forever because approaching it pays, never bothering to pick it up.
There is exactly one form of shaping proven safe, and it's worth knowing by name: potential-based shaping (Ng et al., 1999).
def shaped_reward(state, next_state, raw_reward, gamma=0.99):
"""Potential-based shaping: provably keeps the optimal policy intact.
F(s, s') = gamma * Phi(s') - Phi(s) (Ng, Harada & Russell, 1999)."""
return raw_reward + (gamma * potential(next_state) - potential(state))
def potential(state):
"""A heuristic 'goodness' of a state -- here, closeness to the goal."""
return -np.linalg.norm(state.position - state.goal_position)
The reason this particular shape is safe is subtle and lovely: because the bonus is the difference of a potential across a transition, it telescopes away over any complete path and cancels out of every loop. So it can speed learning -- pointing the agent roughly toward the goal -- without ever changing which policy is optimal. Any other shaping scheme risks quietly rewriting the objective. In practice plenty of successful projects use non-potential shaping anyway, carefully, accepting the theoretical risk for the practical speed-up -- but they do it with their eyes open, and so should you.
Curriculum learning: don't throw the toddler in the deep end
The last idea is the most human one. Instead of dropping an agent into the full, brutal game from step one, curriculum learning starts it on an easy version and turns the difficulty up only as it earns the right:
class CurriculumScheduler:
"""Raise the difficulty as the agent's win rate clears a threshold."""
def __init__(self, difficulties, threshold=0.7):
self.difficulties = difficulties # easy -> hard
self.current_level = 0
self.threshold = threshold # win rate needed to advance
self.recent = []
def get_difficulty(self):
return self.difficulties[self.current_level]
def report_result(self, won: bool):
self.recent.append(won)
if len(self.recent) > 100:
self.recent.pop(0)
win_rate = sum(self.recent) / len(self.recent)
if win_rate > self.threshold and self.current_level < len(self.difficulties) - 1:
self.current_level += 1
self.recent = []
print(f"Advancing to level {self.current_level}")
OpenAI Five leaned on this hard: weaker opponents first, fewer heroes, simplified mechanics, then gradually the full chaos. Without a curriculum the initial exploration problem was simply too hard -- a random agent couldn't blunder into a win often enough to ever learn what winning looks like. Start easy and the early wins come freely; each one teaches just enough to survive the next rung. It's the sparse-reward problem solved with patience in stead of curiosity -- and often the two are used together.
Pull all six of these threads together and you've got the modern game-playing toolkit: a network for intuition, MCTS for planning, self-play for opponents, curiosity for exploration, shaping for guidance and a curriculum for pacing. The same kit, pointed away from games, is what we'll start aiming at messier real-world problems before long -- but every one of these ideas was forged on a scoreboard first.
In case you skimmed...
- AlphaZero mastered Chess, Go and Shogi with one algorithm -- a dual-head policy/value network guided by MCTS, trained purely by self-play, knowing nothing but the rules;
- MCTS plans by simulating thousands of futures, using the PUCT formula to balance exploiting good moves against exploring promising-but-untried ones, and trusts the most-visited move rather than the highest-scoring guess;
- sparse-reward games like Montezuma's Revenge need intrinsic motivation (curiosity) because the external reward is too rare for ordinary exploration to ever find;
- real-time strategy games (Dota 2, StarCraft II) push RL to its limits -- long horizons, hidden information, vast action spaces -- and were cracked with enormous compute, PPO, self-play leagues and a lot of engineering;
- reward shaping adds helpful signals, but only potential-based shaping is proven not to corrupt the optimal policy -- everything else is a calculated risk;
- curriculum learning starts easy and ramps difficulty as the agent improves, dissolving exploration problems that are otherwise hopeless.
Exercises
Exercise 1: Implement Tic-Tac-Toe as a tiny game object exposing clone(), is_terminal(), get_legal_actions(), get_reward(), apply_action() and to_tensor(), then run the MCTS class above on it with a random evaluator (skip the network -- just return a uniform policy and a random value). Play 200 games of pure-MCTS against a random opponent and report the win rate. You should see MCTS alone, with no learning at all, already crush random play -- planning is powerful even before training enters the picture.
Exercise 2: Take your Tic-Tac-Toe game and wire up the full alphazero_training_loop with a small AlphaZeroNetwork (shrink the board planes to fit). Train for a few dozen iterations and plot the loss. Then play the trained agent against the pure-MCTS agent from Exercise 1 using the same number of simulations. Does the network-guided search win? By how much? This is the whole AlphaZero thesis in miniature -- intuition plus search beats search alone.
Exercise 3: Bolt the ICMModule onto a DQN agent (reuse your DQN from episode #107) on MountainCar-v0 -- a classic sparse-reward environment where the car only scores by reaching the flag. Train once with the intrinsic curiosity bonus added to the reward and once without. Compare how many episodes each takes to first reach the goal. You'll feel, first-hand, why curiosity exists: the plain agent may never get there at all, while the curious one goes looking.
Get the Tic-Tac-Toe pipeline working before the next episode -- a board you can beat your own agent on is the best possible intuition for everything we've built, and the techniques here stop being academic the moment you watch your own code learn to play. The next stop takes this game-playing machinery and starts pointing it at problems that don't come with a tidy scoreboard at all.