Learn AI Series (#111) - Multi-Agent Reinforcement Learning
Learn AI Series (#111) - Multi-Agent Reinforcement Learning

What will I learn
- You will learn the three flavours of multi-agent RL: cooperative (shared reward), competitive (zero-sum), and mixed (a bit of both);
- why putting more than one learner in a room breaks the comfortable assumptions we leaned on since episode #102 -- the dreaded non-stationarity problem;
- independent learners, the naive baseline, and exactly where and why it falls apart;
- CTDE (centralized training, decentralized execution) and QMIX, the workhorse of cooperative MARL;
- self-play -- how you train a champion when the only opponent worth having is a better version of yourself;
- and the genuinely magical part: emergent behaviour -- complex strategy nobody programmed, falling out of simple rewards and a little competitive pressure.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution with NumPy and PyTorch;
- You've followed episodes #102 (the RL framework), #107 (DQN) and #109 (PPO) -- this one stands on all three.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals
- Learn AI Series (#78) - Object Detection (Part 1) - Foundations
- Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches
- Learn AI Series (#80) - Image Segmentation
- Learn AI Series (#81) - Pose Estimation and Tracking
- Learn AI Series (#82) - Optical Character Recognition
- Learn AI Series (#83) - Video Understanding
- Learn AI Series (#84) - Generative Images - Diffusion Models (Part 1)
- Learn AI Series (#85) - Generative Images - Diffusion Models (Part 2)
- Learn AI Series (#86) - Image-to-Image and Editing
- Learn AI Series (#87) - 3D Vision
- Learn AI Series (#88) - Face Analysis
- Learn AI Series (#89) - Medical and Scientific Imaging
- Learn AI Series (#90) - Self-Supervised Learning for Vision
- Learn AI Series (#91) - Mini Project - Building a Visual AI System
- Learn AI Series (#92) - Audio Fundamentals for AI
- Learn AI Series (#93) - Speech Recognition
- Learn AI Series (#94) - Text-to-Speech (TTS)
- Learn AI Series (#95) - Audio Classification
- Learn AI Series (#96) - Music Generation
- Learn AI Series (#97) - Speaker Recognition and Diarization
- Learn AI Series (#98) - Natural Language Understanding for Voice
- Learn AI Series (#99) - Audio Enhancement
- Learn AI Series (#100) - Multimodal Audio-Visual Models
- Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant
- Learn AI Series (#102) - What Is Reinforcement Learning?
- Learn AI Series (#103) - Multi-Armed Bandits
- Learn AI Series (#104) - Dynamic Programming
- Learn AI Series (#105) - Monte Carlo Methods
- Learn AI Series (#106) - Temporal Difference Learning
- Learn AI Series (#107) - Deep Q-Networks (DQN)
- Learn AI Series (#108) - Policy Gradient Methods
- Learn AI Series (#109) - Advanced Policy Optimization
- Learn AI Series (#110) - Model-Based Reinforcement Learning
- Learn AI Series (#111) - Multi-Agent Reinforcement Learning (this post)
Learn AI Series (#111) - Multi-Agent Reinforcement Learning
Solutions to Episode #110 Exercises
Before we let a second agent into the room, let's clear last episode's three exercises. All of them reuse the DynaQ, EnvironmentModel and ModelTrainer classes from episode #110, so I'm assuming those are imported and in scope. As usual I'm leaning on gymnasium throughout (pip install gymnasium if you skipped it).
Exercise 1: Run tabular DynaQ on FrozenLake-v1 with is_slippery=False, train it with planning_steps = 0, 5, 50 under the same seed, plot episodes-to-solve, and explain why the benefit eventually plateaus.
import gymnasium as gym
import numpy as np
# Assumes the DynaQ class from episode #110 is imported.
def episodes_to_solve(planning_steps, seed, target=0.9, window=20, max_ep=2000):
env = gym.make("FrozenLake-v1", is_slippery=False)
agent = DynaQ(env.observation_space.n, env.action_space.n,
planning_steps=planning_steps)
np.random.seed(seed)
recent = []
for ep in range(max_ep):
state, _ = env.reset(seed=seed + ep)
done = False
total = 0.0
while not done:
action = agent.choose_action(state)
nxt, reward, term, trunc, _ = env.step(action)
done = term or trunc
agent.update(state, action, reward, nxt, float(done))
state = nxt
total += reward
recent.append(total)
if len(recent) > window and np.mean(recent[-window:]) >= target:
return ep # solved
return max_ep
for p in (0, 5, 50):
solved = episodes_to_solve(p, seed=0)
print(f"planning_steps={p:>2}: solved in {solved} episodes")
Plot solved against planning_steps and the shape tells the story: zero planning (plain Q-Learning) crawls, five planning steps cuts the episode count dramatically, fifty cuts it further -- but not proportionally further. You don't get ten times the speed-up for ten times the planning. And the reason is worth stating out loud: each real transition carries a fixed amount of genuinely new information, and planning can only ever redistribute that information faster through the Q-table. It can't manufacture facts the agent hasn't witnessed yet. Once the existing experience has been fully propagated, extra planning steps are just re-deriving conclusions the table already holds. That is why the curve flattens -- planning is leverage on what you know, not a substitute for going out and learning more.
Exercise 2: Collect a few thousand random-policy transitions from CartPole-v1, train the EnvironmentModel, then measure compounding error: roll the model forward k steps and compare against the true environment for k = 1, 5, 10, 20.
import gymnasium as gym
import numpy as np
import torch
# Assumes EnvironmentModel and ModelTrainer from episode #110.
def onehot(a, n):
v = np.zeros(n, dtype=np.float32)
v[a] = 1.0
return v
env = gym.make("CartPole-v1")
n_act = env.action_space.n
model = EnvironmentModel(env.observation_space.shape[0], n_act)
trainer = ModelTrainer(model)
# 1. Gather random experience and train the dynamics model.
s, _ = env.reset(seed=0)
for _ in range(5000):
a = env.action_space.sample()
ns, r, term, trunc, _ = env.step(a)
trainer.add_transition(s, onehot(a, n_act), r, ns)
s = ns if not (term or trunc) else env.reset()[0]
for _ in range(3000):
trainer.train_step()
# 2. Roll the model forward k steps and compare to the real env.
for k in (1, 5, 10, 20):
errors = []
for trial in range(200):
s, _ = env.reset(seed=1000 + trial)
pred = torch.FloatTensor(s).unsqueeze(0)
real = s
for _ in range(k):
a = env.action_space.sample()
a_t = torch.FloatTensor(onehot(a, n_act)).unsqueeze(0)
with torch.no_grad():
pred, _ = model(pred, a_t) # imagined step
real, _, term, trunc, _ = env.step(a) # real step
if term or trunc:
break
errors.append(np.linalg.norm(pred.squeeze().numpy() - real))
print(f"k={k:>2}: mean state error = {np.mean(errors):.4f}")
Watch the printed errors climb as k grows, and not linearly -- they accelerate. This is the 0.95 ** k snowball from episode #110, except now you've measured it on a real environment instead of trusting my arithmetic. The model's one-step prediction is excellent; its twenty-step prediction is a fairy tale. Feeding the model its own output as the next input means every small mistake becomes the foundation for the next prediction, and the foundation rots. This is exactly why serious model-based agents keep their imagined rollouts short -- the dream is only trustworthy near the present.
Exercise 3: Build an ensemble of five EnvironmentModels trained on the same data with different seeds, compute the variance of their predictions per state, and argue how you'd use that variance as an uncertainty signal.
import torch
import numpy as np
# Assumes EnvironmentModel, ModelTrainer, and a filled `trainer.buffer` in scope.
def train_one(seed, state_dim, n_act, steps=3000):
torch.manual_seed(seed)
m = EnvironmentModel(state_dim, n_act)
t = ModelTrainer(m)
t.buffer = trainer.buffer # share the same real data
for _ in range(steps):
t.train_step()
return m
state_dim = env.observation_space.shape[0]
ensemble = [train_one(seed, state_dim, n_act) for seed in range(5)]
# For a batch of states + actions, measure how much the 5 models DISAGREE.
states = torch.FloatTensor(np.array([env.reset(seed=s)[0] for s in range(64)]))
actions = torch.stack([torch.FloatTensor(onehot(env.action_space.sample(), n_act))
for _ in range(64)])
with torch.no_grad():
preds = torch.stack([m(states, actions)[0] for m in ensemble]) # (5, 64, dim)
disagreement = preds.var(dim=0).mean(dim=-1) # per-state variance
print("most uncertain states:", disagreement.topk(5).indices.tolist())
The trick is that five models trained on the same data will agree closely where that data was dense (they've all seen plenty of evidence) and scatter wildly where it was sparse (each one is guessing, and they guess differently). High variance across the ensemble is therefore a cheap, self-supervised flag for "we are off the edge of the map here". You'd use it as a brake: while imagining a rollout, watch the ensemble disagreement, and the moment it spikes past some threshold, stop trusting the dream and either truncate the rollout or fall back to a model-free value estimate. No labels required -- the disagreement is the uncertainty. That little idea is the difference between a model-based agent that quietly hallucinates jackpots and one that knows when to keep its mouth shut ;-)
On to today's episode
Right -- so notice what every single thing we've built since episode #102 has in common. Q-Learning, SARSA, DQN, REINFORCE, Actor-Critic, PPO, the model-based machinery from last time -- all of it assumes one agent in one world. The agent acts, the world responds, the agent learns. The world is a patient, indifferent backdrop that doesn't have opinions about you.
Today we tear that assumption up. Because the most interesting problems on the planet are not one agent against a backdrop -- they're many agents against each other and alongside each other. Traffic is thousands of drivers, each optimising their own commute. Markets are millions of traders, each trying to outwit the rest. Every board game ever invented is a fight. Even a warehouse full of robots fetching boxes is a team that has to not crash into each other. Multi-agent reinforcement learning (MARL) is what happens when you put more than one learner in the room, and -- fair warning -- it is gloriously, exasperatingly harder than what came before.
The one problem that makes everything hard
There's a single core difficulty in MARL, and if you understand it you understand why the whole field looks the way it does. It's called non-stationarity, and here's the plain-English version.
In single-agent RL, the environment has fixed rules. Pull lever A, get outcome X -- maybe with some randomness, but the distribution of outcomes doesn't change underneath you. That stability is exactly what lets Q-Learning converge: you're estimating a fixed target, like surveying a mountain that stays put while you measure it.
Now drop a second learning agent into that environment. From your perspective, that other agent is part of the environment -- and it is changing its behaviour as it learns. The rules you're trying to estimate are shifting every time your opponent improves. Your best response depends on their strategy; their best response depends on yours; and both are moving. You're surveying a mountain that rearranges itself every time you take a reading. That is non-stationarity, and it quietly breaks the convergence guarantees we relied on. Having said that, it's also what makes MARL fascinating -- the agents are co-authoring the problem they're solving.
Three settings: friend, foe, and "it's complicated"
Multi-agent problems sort into three buckets, and the bucket decides almost everything about which algorithm you reach for.
Cooperative -- all agents share one reward. A team of robots carrying a heavy table: everybody wins or everybody fails, together. The challenge here isn't conflict, it's coordination and credit assignment (when the team scores, whose actions deserve the credit?).
Competitive -- zero-sum. One agent's gain is exactly another's loss. Chess, Go, poker, two-player anything. What's good for you is by definition bad for your opponent.
Mixed -- partially aligned, partially opposed. This is most of real life. On a motorway everyone shares the goal of arriving safely, but each driver would also quite like to merge in front of you. Cooperation and competition tangled together.
Keep these three in mind, because each of the techniques below is really an answer to "which bucket are we in?".
Independent learners: the naive baseline
The simplest thing you can possibly do is pretend the problem isn't multi-agent. Give each agent its own single-agent algorithm and let them all learn in parallel, blissfully unaware of one another. Agent 1 runs DQN, Agent 2 runs DQN, and neither one knows the other exists -- the other agents just look like a slightly weird, twitchy environment.
import torch
import torch.nn as nn
import numpy as np
from collections import deque
import random
class IndependentAgent:
"""Each agent learns on its own with a private Q-network (independent DQN)."""
def __init__(self, obs_dim, n_actions, agent_id, lr=1e-3, gamma=0.99):
self.agent_id = agent_id
self.n_actions = n_actions
self.gamma = gamma
self.epsilon = 1.0
self.q_net = nn.Sequential(
nn.Linear(obs_dim, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
nn.Linear(128, n_actions),
)
self.optimizer = torch.optim.Adam(self.q_net.parameters(), lr=lr)
self.buffer = deque(maxlen=50000)
def choose_action(self, obs):
if random.random() < self.epsilon:
return random.randint(0, self.n_actions - 1)
with torch.no_grad():
q = self.q_net(torch.FloatTensor(obs))
return q.argmax().item()
def store(self, obs, action, reward, next_obs, done):
self.buffer.append((obs, action, reward, next_obs, done))
def learn(self, batch_size=64):
if len(self.buffer) < batch_size:
return
batch = random.sample(self.buffer, batch_size)
obs, acts, rews, next_obs, dones = zip(*batch)
obs_t = torch.FloatTensor(np.array(obs))
acts_t = torch.LongTensor(acts)
rews_t = torch.FloatTensor(rews)
next_obs_t = torch.FloatTensor(np.array(next_obs))
dones_t = torch.FloatTensor(dones)
q_vals = self.q_net(obs_t).gather(1, acts_t.unsqueeze(1)).squeeze()
with torch.no_grad():
next_q = self.q_net(next_obs_t).max(dim=1)[0]
targets = rews_t + self.gamma * next_q * (1 - dones_t)
loss = nn.functional.mse_loss(q_vals, targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
And here's the honest truth: this sometimes works. For cooperative tasks where the agents barely interfere with each other, independent learning finds perfectly decent policies, and it's wonderfully simple to implement. But it sits on a cracked foundation -- the experience-replay buffer is the giveaway. A transition you stored ten thousand steps ago was generated against an old version of the other agents, who have since changed. You're learning from stale snapshots of a world that no longer exists. In competitive settings independent learners frequently cycle forever, each one chasing the other's last move like two cats spinning after their tails. It's the right place to start, and the wrong place to stop.
Centralized Training, Decentralized Execution
The idea that rescued cooperative MARL is a mouthful but the intuition is lovely: centralized training, decentralized execution (CTDE). During training -- when you're offline, in a simulator, with a god's-eye view -- you let the learning algorithm see everything: every agent's observation, every agent's action, the full global state. But the policies you actually deploy each see only their own local observation. Train together with full information; act alone with partial information. You get the coordination benefits of shared knowledge without the impossible requirement that, at game time, every robot can read every other robot's mind.
The flagship CTDE algorithm for cooperative tasks is QMIX. Each agent keeps its own little Q-network over its local observation. Then a mixing network combines all those individual Q-values into one global Q-value -- and crucially, it's constrained to be monotonic in each agent's Q. Why monotonic? Because then "the action that's best globally" decomposes neatly into "each agent picks the action that's best locally", which is exactly what you need if each agent has to act on its own at execution time.
class QMIXAgent:
"""One agent's private utility network in QMIX."""
def __init__(self, obs_dim, n_actions, hidden=64):
self.q_net = nn.Sequential(
nn.Linear(obs_dim, hidden), nn.ReLU(),
nn.Linear(hidden, n_actions),
)
def get_q_values(self, obs):
return self.q_net(obs)
class QMIXMixer(nn.Module):
"""Mixing network: folds individual Q-values into one Q_total.
Weights are forced non-negative -> the monotonicity constraint."""
def __init__(self, n_agents, state_dim, mixing_dim=32):
super().__init__()
self.n_agents = n_agents
# Hypernetworks generate the mixing weights FROM the global state.
self.hyper_w1 = nn.Sequential(
nn.Linear(state_dim, mixing_dim), nn.ReLU(),
nn.Linear(mixing_dim, n_agents * mixing_dim),
)
self.hyper_b1 = nn.Linear(state_dim, mixing_dim)
self.hyper_w2 = nn.Sequential(
nn.Linear(state_dim, mixing_dim), nn.ReLU(),
nn.Linear(mixing_dim, mixing_dim),
)
self.hyper_b2 = nn.Sequential(
nn.Linear(state_dim, mixing_dim), nn.ReLU(),
nn.Linear(mixing_dim, 1),
)
def forward(self, agent_qs, state):
# agent_qs: (batch, n_agents) - each agent's chosen Q-value
# state: (batch, state_dim) - the global state
batch_size = agent_qs.size(0)
# torch.abs keeps the weights non-negative -> monotonic mixing.
w1 = torch.abs(self.hyper_w1(state)).view(batch_size, self.n_agents, -1)
b1 = self.hyper_b1(state).unsqueeze(1)
hidden = torch.relu(torch.bmm(agent_qs.unsqueeze(1), w1) + b1)
w2 = torch.abs(self.hyper_w2(state)).view(batch_size, -1, 1)
b2 = self.hyper_b2(state).unsqueeze(1)
q_total = torch.bmm(hidden, w2) + b2 # -> a single scalar per batch
return q_total.squeeze()
Two things deserve a second look. First, the mixing weights aren't fixed parameters -- they're generated from the global state by little hypernetworks (networks that output the weights of another network). That means the way agents' contributions are combined can change depending on the situation, which is exactly what you want: in some states agent 1 matters more, in others agent 3 does. Second, the torch.abs on every weight is the entire monotonicity trick in one function call. Non-negative weights mean "if any agent's Q goes up, Q_total goes up", and that's the property that lets decentralized greedy action selection match the centralized optimum. Elegant, no? The hard global coordination problem gets smuggled into training, and execution stays trivially simple.
Self-play: become your own sparring partner
Cooperative settled, let's talk fighting. Competitive games have an awkward chicken-and-egg problem: to train a strong agent you need a strong opponent, but where do you get one before you've trained anything? Worse, if you train against a fixed opponent, your agent doesn't learn to play well -- it learns to beat that one specific opponent, exploiting its particular quirks. Show it a different style and it falls apart.
Self-play is the way out, and it's beautiful in its simplicity: the agent trains against copies of itself.
class SelfPlayTrainer:
"""Train an agent by pitting it against past versions of itself."""
def __init__(self, agent_class, obs_dim, n_actions):
self.agent_class = agent_class
self.current_agent = agent_class(obs_dim, n_actions)
self.opponent_pool = [] # frozen snapshots of past selves
self.save_interval = 100 # snapshot every N episodes
def get_opponent(self):
"""Sample an opponent: usually the latest self, sometimes an old one."""
if not self.opponent_pool or random.random() < 0.8:
opp = self.agent_class(self.current_agent.obs_dim,
self.current_agent.n_actions)
opp.q_net.load_state_dict(self.current_agent.q_net.state_dict())
return opp
return random.choice(self.opponent_pool)
def train_episode(self, env):
opponent = self.get_opponent()
env.reset()
done = False
while not done:
obs_1 = env.get_obs(player=1)
action_1 = self.current_agent.choose_action(obs_1)
obs_2 = env.get_obs(player=2)
action_2 = opponent.choose_action(obs_2)
_, rewards, done = env.step(action_1, action_2)
# Only the current agent learns; the opponent is frozen.
self.current_agent.store(obs_1, action_1, rewards[0],
env.get_obs(player=1), done)
self.current_agent.learn()
def maybe_snapshot(self, episode):
if episode % self.save_interval == 0:
snap = self.agent_class(self.current_agent.obs_dim,
self.current_agent.n_actions)
snap.q_net.load_state_dict(self.current_agent.q_net.state_dict())
self.opponent_pool.append(snap)
This is the engine behind AlphaGo and its descendants. The agent starts out hopeless, plays itself, gets slightly less hopeless, plays the improved version, improves again -- a bootstrap that climbed all the way to superhuman Go. Nota bene the opponent_pool: keeping a stable of historical selves, not just the latest one, is what stops the training from cycling. If you only ever play the current version, the two of you can fall into a rock-paper-scissors loop where each "beats" the last without anyone actually getting better. Forcing the agent to stay strong against every past version is what makes the improvement monotone instead of circular. It's a surprisingly deep little detail dressed up as a list of checkpoints.
Learning to talk
In cooperative settings agents often do better if they can share what they each see -- but here's the lovely question: what should they actually say? Rather than hand-designing a communication protocol like some 1970s networking committee, we can just let the agents learn their own language by making the messages differentiable and training them end-to-end with the rest of the policy.
class CommAgent(nn.Module):
"""An agent that learns to emit and consume messages."""
def __init__(self, obs_dim, n_actions, msg_dim=16, hidden=64):
super().__init__()
self.msg_dim = msg_dim
self.obs_encoder = nn.Sequential(
nn.Linear(obs_dim, hidden), nn.ReLU(),
)
self.msg_generator = nn.Sequential( # what to broadcast to teammates
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, msg_dim),
)
self.policy = nn.Sequential( # act on obs + received messages
nn.Linear(hidden + msg_dim, hidden), nn.ReLU(),
nn.Linear(hidden, n_actions),
)
def encode(self, obs):
return self.obs_encoder(obs)
def generate_message(self, encoded_obs):
return self.msg_generator(encoded_obs)
def act(self, encoded_obs, received_messages):
if len(received_messages) > 0:
msg_aggregate = torch.stack(received_messages).mean(dim=0)
else:
msg_aggregate = torch.zeros(self.msg_dim)
combined = torch.cat([encoded_obs, msg_aggregate], dim=-1)
return self.policy(combined)
Because the whole pipeline -- encode, message, act -- is differentiable, the gradient from "did we win?" flows all the way back into "what should I have said?". The agents discover, on their own, which signals are worth sending. CommNet and TarMAC are the well-known architectures here, and the eyebrow-raising finding from this line of work is that agents develop their own private languages -- communication patterns no human designed and that we sometimes can't even decode, yet which coordinate the team beautifully. They invent a vocabulary because the vocabulary helps them win. Wowzers.
Emergent behaviour: the part that gives me goosebumps
Here's my favourite thing in all of MARL, and it's not an algorithm -- it's a phenomenon. When you put learning agents under competitive pressure, strategies emerge that nobody designed, wrote down, or anticipated.
The cleanest demonstration is OpenAI's hide-and-seek experiment from 2019. Two teams, hiders and seekers, a handful of moveable boxes and ramps, and dirt-simple rewards: hiders score for staying unseen, seekers score for spotting them. That's it. Over hundreds of millions of episodes, the agents climbed an entire ladder of strategy on their own:
- Seekers learn to chase; hiders learn to flee.
- Hiders learn to drag boxes into doorways and barricade themselves in.
- Seekers learn to shove ramps against the walls and climb over the barricades.
- Hiders learn to grab the ramps first and lock them away so seekers can't use them.
- Seekers discover a physics exploit and learn to "box-surf" over the walls anyway.
Nobody programmed a single one of those moves. Each was invented by the agents, and each invention was a response to the previous one -- an arms race, run in silico, producing tool-use and counter-strategy out of nothing but competitive pressure and a scoreboard. The same fundamental story produced superhuman Go, StarCraft and Dota agents. This, to me, is the whole promise of MARL in one experiment: you don't engineer the cleverness, you engineer the pressure, and the cleverness grows.
When self-interest fights the common good
Mixed settings give us social dilemmas -- situations where what's rational for the individual is ruinous for the group. The textbook case is the Prisoner's Dilemma, and it's worth coding up because MARL agents reproduce its famous lesson on their own.
class IteratedPrisonersDilemma:
"""Two agents play repeated Prisoner's Dilemma. 0 = cooperate, 1 = defect."""
def __init__(self):
self.payoffs = {
(0, 0): (3, 3), # both cooperate -> mutual reward
(0, 1): (0, 5), # I cooperate, you defect -> I'm the sucker
(1, 0): (5, 0), # I defect, you cooperate -> I exploit you
(1, 1): (1, 1), # both defect -> mutual punishment
}
def step(self, action_1, action_2):
return self.payoffs[(action_1, action_2)]
Play this once and cold logic says defect: whatever the other does, you score higher by defecting. Two rational agents both reason this way, both defect, and walk away with (1, 1) -- when they could have had (3, 3) by cooperating. Individual rationality, collective stupidity. But play it repeatedly, and something hopeful happens: MARL agents routinely learn to cooperate, because mutual cooperation (3+3 every round) crushes mutual defection (1+1 every round) over a long game. They rediscover tit-for-tat all by themselves -- open friendly, punish a defection, forgive once the other returns to cooperating. The shadow of the future teaches selfish agents to be decent. There's a whole essay about human society hiding in that payoff table, but I'll leave that one to you ;-)
Why MARL is genuinely hard (the sobering bit)
I don't want to oversell this. MARL is difficult, and not just philosophically. The joint action space explodes: 2 agents with 5 actions each give you 25 joint actions; 10 agents give you 5 to the power of 10, which is about 10 million. Credit assignment turns brutal -- the team won, lovely, but which agent's choices actually mattered? Non-stationarity, as we said up top, quietly voids the convergence guarantees. And training can be maddeningly unstable, with agents chasing each other round in circles.
The state of the art handles up to a few hundred agents in simplified worlds. The genuinely enormous problems -- traffic with millions of vehicles, a real financial market -- remain out of reach for end-to-end MARL, though clever approximations chip away at the edges. So treat this episode as a map of the terrain, not a claim that the terrain is conquered. A part from the headline successes (games, mostly), there's a lot of open ground here.
So, what do you know now?
- Multi-agent RL puts more than one learner in the same world, and its defining headache is non-stationarity -- each agent's environment keeps shifting because the other agents are learning too;
- problems come in three flavours -- cooperative (shared reward), competitive (zero-sum), mixed (both at once) -- and the flavour dictates the method;
- independent learners just run a single-agent algorithm per agent: dead simple, sometimes fine, but built on the cracked assumption of a stationary world;
- CTDE trains with global information but executes on local observations only -- QMIX does this for cooperative tasks with a monotonic mixing network so decentralized greedy actions still hit the global optimum;
- self-play manufactures an ever-improving opponent out of your own past selves -- the AlphaGo recipe, with a pool of historical checkpoints to stop it cycling;
- agents can learn to communicate, inventing private "languages" that coordinate them, and -- the magical part -- sophisticated strategy emerges from simple rewards under competitive pressure, as hide-and-seek, Go and the Prisoner's Dilemma all show.
Exercises
Exercise 1: Take two IndependentAgents and have them play the IteratedPrisonersDilemma for, say, 5000 rounds, feeding each agent the last joint action as its observation (a 4-dim one-hot). Track the cooperation rate over time. Do your independent learners drift toward mutual cooperation, mutual defection, or something oscillating? Run it under three different seeds and report whether the outcome is stable -- this is your first taste of how seed-sensitive MARL can be.
Exercise 2: Implement a minimal SelfPlayTrainer for a trivial symmetric game (rock-paper-scissors is perfect -- one stateless step, three actions). Train with the historical opponent_pool and without it (always playing the latest self). Plot the action distribution over training for both. You should be able to see the no-pool version cycling through rock -> paper -> scissors -> rock, while the pooled version settles toward the uniform (1/3, 1/3, 1/3) mix that no opponent can exploit.
Exercise 3: Using the QMIXMixer, write a tiny numerical check of the monotonicity property. Generate random agent_qs and a random state, compute Q_total, then nudge one agent's Q-value upward by a small amount and recompute. Confirm Q_total never decreases, across a few thousand random trials. Then -- the interesting part -- temporarily delete the torch.abs calls and show the property breaks. You'll have demonstrated, with your own code, exactly why that one function call is load-bearing.
The communication and emergence ideas we just met don't stay theoretical for long -- they're precisely what powers the agents that have started beating the world's best human players at games we once thought were safely ours. That's where we're headed next: taking this multi-agent machinery and pointing it at genuine, hard games, where self-play and emergent strategy stop being curiosities and start winning tournaments. Get the independent-learner and self-play exercises under your fingers now, because the next stop assumes you've felt these dynamics yourself, not just read about them.