Learn AI Series (#109) - Advanced Policy Optimization
Learn AI Series (#109) - Advanced Policy Optimization

What will I learn
- You will learn what a trust region is, and why bounding how far a policy may move in one update is the single idea that turns flaky policy gradients into something you can actually trust;
- TRPO -- Trust Region Policy Optimization -- and the KL-divergence constraint that sits at its heart;
- PPO -- Proximal Policy Optimization -- the clipped-objective workhorse that quietly runs most of modern reinforcement learning;
- GAE (Generalized Advantage Estimation), the dial that trades bias against variance when you estimate advantages;
- how to build PPO from scratch in PyTorch -- network, rollout buffer, the clipped update, and a training loop you can point at CartPole;
- and why this one algorithm ended up underneath everything from game-playing bots to the alignment of the chat assistants you use every day.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution with NumPy and PyTorch;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals
- Learn AI Series (#78) - Object Detection (Part 1) - Foundations
- Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches
- Learn AI Series (#80) - Image Segmentation
- Learn AI Series (#81) - Pose Estimation and Tracking
- Learn AI Series (#82) - Optical Character Recognition
- Learn AI Series (#83) - Video Understanding
- Learn AI Series (#84) - Generative Images - Diffusion Models (Part 1)
- Learn AI Series (#85) - Generative Images - Diffusion Models (Part 2)
- Learn AI Series (#86) - Image-to-Image and Editing
- Learn AI Series (#87) - 3D Vision
- Learn AI Series (#88) - Face Analysis
- Learn AI Series (#89) - Medical and Scientific Imaging
- Learn AI Series (#90) - Self-Supervised Learning for Vision
- Learn AI Series (#91) - Mini Project - Building a Visual AI System
- Learn AI Series (#92) - Audio Fundamentals for AI
- Learn AI Series (#93) - Speech Recognition
- Learn AI Series (#94) - Text-to-Speech (TTS)
- Learn AI Series (#95) - Audio Classification
- Learn AI Series (#96) - Music Generation
- Learn AI Series (#97) - Speaker Recognition and Diarization
- Learn AI Series (#98) - Natural Language Understanding for Voice
- Learn AI Series (#99) - Audio Enhancement
- Learn AI Series (#100) - Multimodal Audio-Visual Models
- Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant
- Learn AI Series (#102) - What Is Reinforcement Learning?
- Learn AI Series (#103) - Multi-Armed Bandits
- Learn AI Series (#104) - Dynamic Programming
- Learn AI Series (#105) - Monte Carlo Methods
- Learn AI Series (#106) - Temporal Difference Learning
- Learn AI Series (#107) - Deep Q-Networks (DQN)
- Learn AI Series (#108) - Policy Gradient Methods
- Learn AI Series (#109) - Advanced Policy Optimization (this post)
Learn AI Series (#109) - Advanced Policy Optimization
Solutions to Episode #108 Exercises
Before we put a leash on the policy gradient, let's clear last episode's three exercises. All of them build on the PolicyNetwork, REINFORCE, REINFORCEWithBaseline and ActorCritic classes from episode #108, so I'm assuming those are imported and sitting in scope. I'm also leaning on gymnasium throughout -- pip install gymnasium if you skipped it last time.
Exercise 1: Implement plain REINFORCE (no baseline) on CartPole-v1, train it for 1,000 episodes logging the 100-episode moving average, then run it three times under different seeds and plot all three curves together -- so you can see the variance problem with your own eyes.
import gymnasium as gym
import numpy as np
import torch
# Assumes PolicyNetwork and REINFORCE from episode #108.
def train_reinforce(seed, n_episodes=1000):
env = gym.make("CartPole-v1")
torch.manual_seed(seed)
np.random.seed(seed)
agent = REINFORCE(env.observation_space.shape[0], env.action_space.n)
rewards = []
for ep in range(n_episodes):
state, _ = env.reset(seed=seed + ep)
done, trunc, total = False, False, 0.0
while not (done or trunc):
action = agent.choose_action(state)
state, reward, done, trunc, _ = env.step(action)
agent.store_reward(reward)
total += reward
agent.learn() # Monte Carlo update, once per episode
rewards.append(total)
return rewards
def moving_average(x, w=100):
return np.convolve(x, np.ones(w) / w, mode="valid")
curves = [moving_average(train_reinforce(s)) for s in (0, 1, 2)]
for s, c in enumerate(curves):
print(f"seed {s}: final avg-100 = {c[-1]:6.1f} | peak = {c.max():6.1f}")
Plot the three curves on one axis and the lesson is impossible to miss: they wander all over the place. One seed might claw its way to CartPole's ceiling of 500 by episode 400 and stay there; another sputters around 80 for the whole run; a third climbs nicely and then collapses back down for no visible reason. That spread between three runs of the exact same algorithm is the variance problem made visual. Contrast it with the three DQN runs from episode #107, which would sit almost on top of one another -- experience replay and a frozen target buy you a steadiness that raw Monte Carlo policy gradient simply does not have. Same task, wildly different reliability.
Exercise 2: Add the learned value baseline (turn REINFORCE into REINFORCEWithBaseline), train both under the same seeds, and quantify the improvement -- roughly how many episodes does each take to first reach a 100-episode average of 195?
import gymnasium as gym
import numpy as np
import torch
# Assumes REINFORCE and REINFORCEWithBaseline from episode #108.
def episodes_to_solve(ctor, seed, target=195.0, n_episodes=1000):
env = gym.make("CartPole-v1")
torch.manual_seed(seed)
np.random.seed(seed)
agent = ctor(env.observation_space.shape[0], env.action_space.n)
rewards, hit = [], None
for ep in range(n_episodes):
state, _ = env.reset(seed=seed + ep)
done, trunc, total = False, False, 0.0
while not (done or trunc):
action = agent.choose_action(state)
state, reward, done, trunc, _ = env.step(action)
agent.store_reward(reward)
total += reward
agent.learn()
rewards.append(total)
if hit is None and len(rewards) >= 100 and np.mean(rewards[-100:]) >= target:
hit = ep # first episode the running avg clears 195
return hit
for name, ctor in [("REINFORCE", REINFORCE),
("baseline", REINFORCEWithBaseline)]:
hits = [episodes_to_solve(ctor, s) for s in (0, 1, 2)]
print(f"{name:>10}: episodes-to-195 per seed = {hits}")
The baseline version gets there sooner and -- more tellingly -- gets there at all on seeds where plain REINFORCE never does (you'll see a None or two creep into the bare version's row). Why? Because subtracting V(s) swaps the raw return G_t for the advantage G_t - V(s_t), and the advantage has a far smaller spread around zero. We proved last time that a state-only baseline leaves the gradient unbiased -- the expected update is unchanged -- so all you lose is noise. Less wobble in the gradient means a straighter climb to 195. Cleaner signal, same destination, fewer wrong turns.
Exercise 3: Add an entropy bonus to the ActorCritic agent -- compute dist.entropy() each step and add -beta * entropy to the loss -- then compare beta = 0 against beta = 0.01.
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
# Assumes ActorCritic from episode #108.
def train_with_entropy(beta, seed, n_episodes=800):
env = gym.make("CartPole-v1")
torch.manual_seed(seed)
np.random.seed(seed)
agent = ActorCritic(env.observation_space.shape[0], env.action_space.n)
rewards = []
for ep in range(n_episodes):
state, _ = env.reset(seed=seed + ep)
done, trunc, total = False, False, 0.0
while not (done or trunc):
state_t = torch.FloatTensor(state).unsqueeze(0)
feats = agent.features(state_t)
dist = Categorical(F.softmax(agent.actor(feats), dim=-1))
action = dist.sample()
value = agent.critic(feats).squeeze()
next_state, reward, done, trunc, _ = env.step(action.item())
with torch.no_grad():
nfeats = agent.features(torch.FloatTensor(next_state).unsqueeze(0))
next_value = agent.critic(nfeats).squeeze()
target = reward + agent.gamma * next_value * (1 - float(done or trunc))
advantage = target - value
actor_loss = -dist.log_prob(action) * advantage.detach()
critic_loss = advantage.pow(2)
entropy = dist.entropy().mean()
loss = actor_loss + 0.5 * critic_loss - beta * entropy # the new term
agent.optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(agent._all_params, 0.5)
agent.optimizer.step()
state, total = next_state, total + reward
rewards.append(total)
return rewards
for beta in (0.0, 0.01):
finals = [np.mean(train_with_entropy(beta, s)[-100:]) for s in (0, 1, 2)]
print(f"beta={beta:<4}: final avg-100 = {np.mean(finals):6.1f} "
f"+/- {np.std(finals):.1f}")
With beta = 0 the agent occasionally does something maddening: early on it stumbles into one action that looks great for a few episodes, the softmax piles probability onto it, and the policy collapses -- it stops exploring and gets stuck on a mediocre habit it can no longer escape. The -beta * entropy term fights exactly that. Because the optimiser is now mildly rewarded for keeping the action distribution spread out, it resists that premature certainty, and across the three seeds the final scores cluster tighter and higher. This is the same exploration-versus-exploitation tension we first wrestled with on the bandits back in episode #103 -- only here, in stead of an epsilon knob bolted on the outside, exploration is encouraged from inside the loss itself. Tidy, that ;-)
On to today's episode
Right -- episode 109, and this is the one where we stop letting the policy gradient run wild and put a proper safety rail on it.
Cast your mind back over episode #108. REINFORCE and Actor-Critic both do the same brave, slightly reckless thing: they compute whatever gradient the data suggests and take a step in that direction. And most of the time that's fine. But there is no seatbelt. One unlucky batch, one step a touch too big, and a policy that took thousands of episodes to learn can be wrecked in a single update. Worse, the wreck feeds on itself: a broken policy generates broken data, which produces worse updates, which generate even worse data. A death spiral, and you sit there watching your reward curve nose-dive off a cliff with no idea why.
Trust region methods are the fix, and the core idea is almost suspiciously simple: make the biggest improvement you can, but only within a region where you still trust your gradient estimate. Step boldly inside that region, never outside it. Today we'll see the two algorithms that turned that idea into practice -- TRPO and PPO -- and we'll build the second one from scratch, because PPO is, no exaggeration, the most important single algorithm in modern reinforcement learning.
Why unconstrained updates fall off cliffs
Let me make the danger concrete. Picture an agent that has learned to walk a tightrope near a cliff edge -- efficient, but with no margin for error. The gradient looks at the current policy and says: "lean a hair to the right, it's very slightly faster." Sound advice... for a tiny step. But a policy gradient step is not guaranteed to be tiny. Multiply that gradient by a learning rate that happened to be a bit large, and the agent doesn't lean a hair to the right -- it lunges, and walks straight off the edge.
Here's the crux, and it's worth saying slowly. A policy gradient is only reliable for small policy changes. It tells you the direction of improvement at the current policy -- not the direction three big steps away, where the landscape may look completely different. This is the very same trouble that made us care about learning rates and schedules back in episode #40, but in reinforcement learning it bites much harder. Why? Because in supervised learning your dataset sits still while you train. In RL the policy generates its own data -- change the policy too much and you don't just mis-step, you start collecting experience from a part of the world your gradient knew nothing about. The ground shifts under your feet because you shifted it.
So the question becomes: how do we take the largest useful step while guaranteeing we stay in the region where the gradient still tells the truth? That guarantee is what TRPO set out to provide.
TRPO: Trust Region Policy Optimization
TRPO (Schulman et al., 2015) formalises the trust-region idea. In stead of an ordinary gradient step, it solves a small constrained optimization problem at every update:
maximize E[ (pi_new(a|s) / pi_old(a|s)) * A(s, a) ]
subject to KL(pi_old || pi_new) <= delta
Two pieces, and both repay a careful read.
The thing being maximised is the surrogate advantage: the advantage A(s, a) (how much better an action did than expected -- the very quantity we built last episode) weighted by a probability ratio. That ratio,
r(theta) = pi_new(a|s) / pi_old(a|s)
measures how much more, or less, likely the new policy is to take the same action the old one took. If r = 1 the two policies agree perfectly on that action; if r = 2 the new policy is twice as keen; if r = 0.5 it's half as keen. Maximising r * A does the obvious sensible thing -- crank up the probability of actions that had positive advantage, crank down the ones that had negative advantage.
The second piece is the leash. KL(pi_old || pi_new) is the Kullback-Leibler divergence -- a standard measure of how different two probability distributions are (we first met KL in the context of distributions long ago; here it's quantifying "how far has the policy moved?"). The constraint <= delta says: change the policy as much as you like to boost that surrogate, but not so much that the new policy diverges from the old one by more than a small budget delta. That, right there, is the trust region, written in math.
TRPO genuinely works -- it was a real milestone. But it is a beast to implement. Solving that constrained problem means estimating the Fisher information matrix and running conjugate-gradient steps with a line search, all by hand. It's the kind of code you write once, get subtly wrong twice, and never quite enjoy maintaining. Surely, people thought, there's a cheaper way to get the same well-behaved updates? There was. It's called PPO.
PPO: just clip the ratio
PPO (Schulman et al., 2017) reaches the same destination as TRPO -- updates that never lurch too far -- with a mechanism so simple it feels like cheating: it just clips the probability ratio. No constraint, no Fisher matrix, no second-order anything. Plain first-order gradient descent, the same kind we've used since episode #7.
The clipped surrogate objective is this:
L_CLIP = E[ min( r(theta) * A, clip(r(theta), 1 - eps, 1 + eps) * A ) ]
where eps is a small number, typically 0.2. Let me unpack the min and the clip, because the interplay of those two is the entire trick.
- When the advantage is positive (a good action we'd like to make more likely), the
clipcaps the ratio at1 + eps. Past that ceiling the objective goes flat -- pushing the action's probability even higher buys no more reward in the surrogate, so the gradient vanishes. The policy is allowed to become more keen on a good action, but only up to a point. - When the advantage is negative (a bad action we'd like to suppress), the
clipfloors the ratio at1 - eps. Again the objective flattens out, so the policy can back away from a bad action but not slam away from it in one go.
The min is what makes the clipping bite only in the dangerous direction (it always takes the more pessimistic of the clipped and unclipped terms, so the update can never exploit the clip to take a bigger step than the raw objective would). The net effect: a flat region in the loss landscape just outside the trust region, where the gradient is zero. The policy is free to move within [1 - eps, 1 + eps] of the old one, and gets no encouragement whatsoever to move beyond it. Same trust-region behaviour as TRPO, conjured out of a one-line clamp. Holy Macaroni, it's elegant.
Building PPO from scratch
Enough theory -- let's build the thing. We'll need a network with an actor head and a critic head (sharing a trunk, exactly the pattern from last episode), a buffer to collect a chunk of experience, the clipped update, and a training loop. We start with the network:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
import numpy as np
class PPONetwork(nn.Module):
"""Shared trunk with separate actor (policy) and critic (value) heads."""
def __init__(self, state_dim, n_actions, hidden=64):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(state_dim, hidden), nn.Tanh(),
nn.Linear(hidden, hidden), nn.Tanh(),
)
self.actor = nn.Linear(hidden, n_actions) # logits over actions
self.critic = nn.Linear(hidden, 1) # state-value estimate
def forward(self, state):
features = self.shared(state)
return self.actor(features), self.critic(features)
def get_action(self, state):
state_t = torch.FloatTensor(state).unsqueeze(0)
logits, value = self.forward(state_t)
dist = Categorical(logits=logits)
action = dist.sample()
return action.item(), dist.log_prob(action), value.squeeze()
A small detail worth noticing: PPO conventionally uses Tanh activations in the trunk in stead of the ReLU we reached for in episode #108. It's not load-bearing, but Tanh tends to keep activations bounded, which pairs nicely with the careful, bounded updates PPO is all about. Little choices like that are quite some of what separates "works in the paper" from "works on your machine".
The rollout buffer and GAE
PPO doesn't update after every step (too noisy) nor only at episode's end (too slow). It collects a fixed-length rollout -- a couple of thousand steps, say -- and then learns from that whole batch. The buffer stores the experience and, crucially, computes the advantages using GAE:
class RolloutBuffer:
"""Stores a fixed rollout and computes GAE advantages."""
def __init__(self):
self.states, self.actions, self.log_probs = [], [], []
self.rewards, self.values, self.dones = [], [], []
def store(self, state, action, log_prob, reward, value, done):
self.states.append(state)
self.actions.append(action)
self.log_probs.append(log_prob)
self.rewards.append(reward)
self.values.append(value)
self.dones.append(done)
def compute_gae(self, last_value, gamma=0.99, lam=0.95):
"""Generalized Advantage Estimation -- a lambda-weighted TD blend."""
advantages, gae = [], 0.0
values = self.values + [last_value]
for t in reversed(range(len(self.rewards))):
delta = (self.rewards[t]
+ gamma * values[t + 1] * (1 - self.dones[t])
- values[t])
gae = delta + gamma * lam * (1 - self.dones[t]) * gae
advantages.insert(0, gae)
advantages = torch.FloatTensor(advantages)
returns = advantages + torch.FloatTensor(self.values)
return returns, advantages
def batches(self, returns, advantages, batch_size=64):
"""Yield shuffled minibatches for several epochs of updates."""
n = len(self.states)
idx = np.arange(n)
np.random.shuffle(idx)
states = torch.FloatTensor(np.array(self.states))
actions = torch.LongTensor(self.actions)
old_log_probs = torch.stack(self.log_probs).detach()
for start in range(0, n, batch_size):
b = idx[start:start + batch_size]
yield (states[b], actions[b], old_log_probs[b],
returns[b], advantages[b])
def clear(self):
self.__init__()
GAE (Schulman et al., 2015 -- the same group, busy year) deserves a paragraph of its own, because it's the quiet hero of practical policy gradients. Remember the delta = r + gamma * V(s') - V(s) TD error from episode #106? GAE blends those one-step TD errors across many steps with an exponential weight lam (lambda). The dial does exactly what the n-step dial did back then: lam = 0 gives you the pure one-step TD advantage (low variance, but biased by the critic's mistakes), lam = 1 gives you the full Monte Carlo advantage (unbiased, but high variance). A value like 0.95 lives in the sweet spot we found empirically last time -- most of the bias gone, most of the variance tamed. It is, quite literally, the policy-gradient cousin of the n-step idea, and the same U-shaped trade-off governs both.
The PPO agent
Now the heart of it -- the clipped update, run for several epochs over the collected rollout:
class PPOAgent:
"""PPO with the clipped surrogate objective."""
def __init__(self, state_dim, n_actions, lr=3e-4, gamma=0.99, lam=0.95,
clip_eps=0.2, epochs=4, batch_size=64,
entropy_coef=0.01, value_coef=0.5):
self.gamma, self.lam = gamma, lam
self.clip_eps, self.epochs = clip_eps, epochs
self.batch_size = batch_size
self.entropy_coef, self.value_coef = entropy_coef, value_coef
self.net = PPONetwork(state_dim, n_actions)
self.opt = torch.optim.Adam(self.net.parameters(), lr=lr)
self.buffer = RolloutBuffer()
def choose_action(self, state):
action, log_prob, value = self.net.get_action(state)
return action, log_prob, value.item()
def update(self, last_value):
returns, advantages = self.buffer.compute_gae(
last_value, self.gamma, self.lam)
# normalise advantages -- a small trick that helps a lot
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
for _ in range(self.epochs):
for (states, actions, old_log_probs,
b_returns, b_adv) in self.buffer.batches(
returns, advantages, self.batch_size):
logits, values = self.net(states)
dist = Categorical(logits=logits)
new_log_probs = dist.log_prob(actions)
entropy = dist.entropy().mean()
ratio = torch.exp(new_log_probs - old_log_probs) # r(theta)
surr1 = ratio * b_adv
surr2 = torch.clamp(ratio, 1 - self.clip_eps,
1 + self.clip_eps) * b_adv
actor_loss = -torch.min(surr1, surr2).mean() # the clip
critic_loss = F.mse_loss(values.squeeze(), b_returns)
loss = (actor_loss
+ self.value_coef * critic_loss
- self.entropy_coef * entropy)
self.opt.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(self.net.parameters(), 0.5)
self.opt.step()
self.buffer.clear()
Three things to flag. First, ratio = exp(new_log_probs - old_log_probs) -- we compute the probability ratio in log-space and exponentiate, which is numerically far kinder than dividing two probabilities directly (a habit worth keeping everywhere in ML). Second, the same entropy bonus you wired into Actor-Critic in exercise 3 reappears here as - entropy_coef * entropy, doing the same job: keeping the policy from collapsing too soon. Third -- and this is the bit that makes PPO sample-efficient -- we loop over the same rollout for several epochs. Each batch of hard-won experience gets squeezed for multiple gradient steps, and the clipping is precisely what makes that safe: without it, reusing data several times would march the policy miles away from the distribution that generated it.
The training loop
The loop just alternates collecting a rollout with updating on it:
def train_ppo(env, agent, total_steps=200_000, rollout_len=2048):
state, _ = env.reset()
episode_reward, history, steps = 0.0, [], 0
while steps < total_steps:
for _ in range(rollout_len): # collect a fixed rollout
action, log_prob, value = agent.choose_action(state)
next_state, reward, done, trunc, _ = env.step(action)
agent.buffer.store(state, action, log_prob, reward, value,
float(done or trunc))
episode_reward += reward
steps += 1
if done or trunc:
history.append(episode_reward)
episode_reward = 0.0
state, _ = env.reset()
else:
state = next_state
with torch.no_grad(): # bootstrap the cut-off tail
_, last_value = agent.net(
torch.FloatTensor(state).unsqueeze(0))
agent.update(last_value.item())
if history:
print(f"steps {steps:>7} | avg reward (last 20) "
f"{np.mean(history[-20:]):6.1f}")
return history
Point train_ppo(gym.make("CartPole-v1"), PPOAgent(4, 2)) at the pole and watch it climb to 500 and stay there -- none of the heart-stopping collapses you saw from bare REINFORCE in exercise 1. Same problem, same hardware, a vastly calmer ride. That calmness is the whole product.
Why PPO ended up everywhere
Let me be blunt about why PPO is the default you reach for unless you have a specific reason not to:
- It's simple. No constrained optimization, no Fisher matrix, no conjugate gradients -- just
torch.clampand.backward(). You can read the whole update in one sitting. - It's stable. The clip prevents catastrophic steps, so training rarely diverges. You spend your time tuning, not resurrecting dead runs.
- It performs. It matches or beats TRPO on very nearly every benchmark while being several times simpler to implement.
- It's versatile. Discrete actions, continuous actions, single-agent, multi-agent, robots, games, language models -- the same skeleton handles all of them.
That last point is not hypothetical. PPO is the algorithm OpenAI used for the RLHF step that aligned ChatGPT (the human-feedback machinery we met in episode #61). It's a staple in robotics labs. It's the thing that gets quietly swapped in when a fancier method proves too fiddly. When you don't have a strong reason to pick something exotic, you pick PPO -- and you're usually right to.
The hyperparameters that actually matter
PPO has a reputation for "just working", but it has a handful of knobs that genuinely move the needle. Here are the ones worth knowing:
| Parameter | Typical range | What it does |
|---|---|---|
clip_eps | 0.1 - 0.3 | Width of the trust region. Smaller = more conservative updates |
epochs | 3 - 10 | Gradient passes per rollout. More = more sample-efficient, but risks over-fitting the batch |
rollout_len | 128 - 2048 | Steps gathered before each update. Longer = lower-variance advantages |
lam (GAE) | 0.9 - 0.99 | Bias-variance of the advantage. Higher = less bias, more variance |
entropy_coef | 0.0 - 0.05 | Exploration pressure. Higher = the policy stays more random for longer |
lr | 1e-4 - 3e-4 | Learning rate, often linearly decayed toward zero over training |
batch_size | 32 - 256 | Minibatch size for the epoch loop |
If you only remember two: clip_eps is your trust-region width, and entropy_coef is your insurance against premature collapse. Get those sane and PPO is forgiving about the rest.
So, what do you know now?
- Unconstrained policy updates can destroy a learned policy, because the gradient is only trustworthy for small changes -- and since the policy generates its own data, one bad step can trigger a self-reinforcing death spiral;
- trust region methods fix this by bounding how far the policy may move in a single update -- the biggest improvement that still stays inside the region where the gradient is reliable;
- TRPO enforces that bound with a hard KL-divergence constraint between the old and new policies -- principled and effective, but heavy to implement (Fisher matrix, conjugate gradients);
- PPO gets the same well-behaved updates by simply clipping the probability ratio to
[1 - eps, 1 + eps], so the objective goes flat (zero gradient) the moment the policy tries to move too far -- no second-order machinery at all; - GAE gives a tunable bias-variance trade-off for the advantage via a lambda-weighted blend of TD errors -- the policy-gradient cousin of the n-step idea from episode #106;
- PPO collects fixed-length rollouts and reuses them for several epochs of minibatch updates, which the clip makes safe -- efficient and stable;
- PPO is the modern default: simple, stable, versatile, and running underneath everything from game-playing agents to the RLHF that aligns chat assistants (episode #61).
Exercises
Exercise 1: Get the PPOAgent from this episode training on CartPole-v1 and plot the per-rollout average reward. Then run an ablation on the clip: set clip_eps to something enormous like 100.0 so the clamp never triggers, and train again under the same seed. Describe how the two reward curves differ, and connect the unstable one back to the "no seatbelt" problem that opened this episode -- you are essentially turning PPO back into a multi-epoch REINFORCE-with-baseline, and it should show.
Exercise 2: Add an approximate-KL early stop to the update. After each epoch, estimate the mean KL between the old and new policies with the cheap formula mean(old_log_probs - new_log_probs), and if it exceeds a threshold (say 0.015), break out of the epoch loop before doing more updates. Log how often the early stop fires over a full training run, and explain how this re-introduces a piece of TRPO's explicit KL constraint on top of PPO's implicit clip.
Exercise 3: Adapt PPO to a continuous action space and run it on Pendulum-v1. Replace the Categorical distribution with a Normal: have the actor head output a mean (and learn a log_std parameter), sample actions from that Gaussian, and use dist.log_prob(action).sum(-1) for the ratio. Note carefully what had to change (the distribution, the action shape, the output layer) and -- more interestingly -- everything that stayed exactly the same (the clip, GAE, the epoch loop, the value loss). That invariance is the whole reason PPO travels so well across problem types.
That continuous-action version is your bridge to the harder environments coming up -- the ones where an agent doesn't just react to the world but starts trying to model it, or where several agents have to learn in each other's company. We've now got the core policy-optimization engine built and understood. Next we start pointing it at bigger, stranger worlds ;-)