Learn AI Series (#113) - RL for Real-World Applications
Learn AI Series (#113) - RL for Real-World Applications

What will I learn
- You will learn sim-to-real transfer -- training a policy in a physics simulator and then deploying it on real hardware without it falling over the moment it meets reality;
- domain randomization, the wonderfully counter-intuitive trick of making your simulator deliberately wrong so the policy stops caring which world it's in;
- how RL is actually used in robotics (manipulation, locomotion) and in resource management (data-center cooling, routing, scheduling) -- the domains where it genuinely pays the rent;
- why financial RL is so seductive and so treacherous at the same time;
- and the big one that separates the toy from the tool: safety -- why an agent that can break equipment, crash a car or lose real money needs a completely different mindset than one that just loses a game.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution with NumPy and PyTorch;
- You've followed the RL arc so far -- especially episodes #107 (DQN), #109 (PPO) and #112 (RL for games) -- because today is where all of that finally leaves the sandbox.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals
- Learn AI Series (#78) - Object Detection (Part 1) - Foundations
- Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches
- Learn AI Series (#80) - Image Segmentation
- Learn AI Series (#81) - Pose Estimation and Tracking
- Learn AI Series (#82) - Optical Character Recognition
- Learn AI Series (#83) - Video Understanding
- Learn AI Series (#84) - Generative Images - Diffusion Models (Part 1)
- Learn AI Series (#85) - Generative Images - Diffusion Models (Part 2)
- Learn AI Series (#86) - Image-to-Image and Editing
- Learn AI Series (#87) - 3D Vision
- Learn AI Series (#88) - Face Analysis
- Learn AI Series (#89) - Medical and Scientific Imaging
- Learn AI Series (#90) - Self-Supervised Learning for Vision
- Learn AI Series (#91) - Mini Project - Building a Visual AI System
- Learn AI Series (#92) - Audio Fundamentals for AI
- Learn AI Series (#93) - Speech Recognition
- Learn AI Series (#94) - Text-to-Speech (TTS)
- Learn AI Series (#95) - Audio Classification
- Learn AI Series (#96) - Music Generation
- Learn AI Series (#97) - Speaker Recognition and Diarization
- Learn AI Series (#98) - Natural Language Understanding for Voice
- Learn AI Series (#99) - Audio Enhancement
- Learn AI Series (#100) - Multimodal Audio-Visual Models
- Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant
- Learn AI Series (#102) - What Is Reinforcement Learning?
- Learn AI Series (#103) - Multi-Armed Bandits
- Learn AI Series (#104) - Dynamic Programming
- Learn AI Series (#105) - Monte Carlo Methods
- Learn AI Series (#106) - Temporal Difference Learning
- Learn AI Series (#107) - Deep Q-Networks (DQN)
- Learn AI Series (#108) - Policy Gradient Methods
- Learn AI Series (#109) - Advanced Policy Optimization
- Learn AI Series (#110) - Model-Based Reinforcement Learning
- Learn AI Series (#111) - Multi-Agent Reinforcement Learning
- Learn AI Series (#112) - RL for Games
- Learn AI Series (#113) - RL for Real-World Applications (this post)
Learn AI Series (#113) - RL for Real-World Applications
Solutions to Episode #112 Exercises
Before we drag reinforcement learning out of the arcade and into the factory, let's clear last episode's three exercises. They lean on the MCTS, MCTSNode, AlphaZeroNetwork, alphazero_training_loop and ICMModule classes from episode #112, so I'm assuming those are imported and in scope.
Exercise 1: Implement Tic-Tac-Toe as a tiny game object exposing the six methods MCTS expects, then run the MCTS class with a random evaluator (uniform policy, random value) and measure pure-search win rate against a random opponent over 200 games.
import numpy as np
import random
# Assumes MCTS and MCTSNode from episode #112.
class TicTacToe:
"""Minimal 3x3 board. current_player is +1 or -1."""
def __init__(self):
self.board = np.zeros(9, dtype=np.int8)
self.current_player = 1
def clone(self):
g = TicTacToe()
g.board = self.board.copy()
g.current_player = self.current_player
return g
def get_legal_actions(self):
return [i for i in range(9) if self.board[i] == 0]
def apply_action(self, a):
self.board[a] = self.current_player
self.current_player *= -1 # hand the turn over
def _winner(self):
lines = [(0,1,2),(3,4,5),(6,7,8),(0,3,6),(1,4,7),
(2,5,8),(0,4,8),(2,4,6)]
for x, y, z in lines:
s = self.board[x] + self.board[y] + self.board[z]
if s == 3: return 1
if s == -3: return -1
return 0
def is_terminal(self):
return self._winner() != 0 or not self.get_legal_actions()
def get_reward(self):
# Reward from the perspective of the player who JUST moved.
w = self._winner()
return 0.0 if w == 0 else float(w * -self.current_player)
def to_tensor(self):
return np.concatenate([(self.board == 1), (self.board == -1)]).astype(np.float32)
class RandomEvalMCTS(MCTS):
"""MCTS with no network: uniform priors and a coin-flip value."""
def _evaluate(self, state):
actions = state.get_legal_actions()
policy = np.zeros(9, dtype=np.float32)
for a in actions:
policy[a] = 1.0 / len(actions)
return policy, random.uniform(-1.0, 1.0)
def play_match(searcher, n_games=200):
wins = 0
for g in range(n_games):
game = TicTacToe()
mcts_is_x = (g % 2 == 0) # alternate who starts
while not game.is_terminal():
mcts_turn = (game.current_player == 1) == mcts_is_x
if mcts_turn:
pi = searcher.search(game)
action = max(pi, key=pi.get)
else:
action = random.choice(game.get_legal_actions())
game.apply_action(action)
result = game._winner()
if (result == 1) == mcts_is_x and result != 0:
wins += 1
return wins / n_games
print("pure-MCTS win rate vs random:",
play_match(RandomEvalMCTS(network=None, n_simulations=100)))
The result is the point: even with a random value function and no learning whatsoever, MCTS crushes a random opponent -- you'll typically see a win rate north of 0.9. Sit with that for a second. The network in AlphaZero is the famous part, but this exercise shows that a big chunk of the strength was always the search. Simulating a hundred futures and trusting the move you kept returning to is already a formidable player. Planning is powerful long before intuition shows up.
Exercise 2: Wire up the full alphazero_training_loop with a small AlphaZeroNetwork on your Tic-Tac-Toe game, train for a few dozen iterations, then pit the trained agent against the pure-MCTS agent from Exercise 1 at the same simulation budget.
import torch
import torch.nn as nn
import numpy as np
# Assumes AlphaZeroNetwork, MCTS, alphazero_training_loop from episode #112,
# and TicTacToe / RandomEvalMCTS from Exercise 1.
class TicTacToeGame:
"""Adapter so alphazero_training_loop can call new_game()."""
def new_game(self):
return TicTacToe()
class TinyAZNet(nn.Module):
"""A dense stand-in for the conv network -- 3x3 is too small for convs."""
def __init__(self):
super().__init__()
self.body = nn.Sequential(
nn.Linear(18, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
)
self.policy_head = nn.Linear(128, 9)
self.value_head = nn.Sequential(nn.Linear(128, 1), nn.Tanh())
def forward(self, state):
x = self.body(state)
return self.policy_head(x), self.value_head(x)
net = TinyAZNet()
alphazero_training_loop(TicTacToeGame(), net, n_iterations=40,
games_per_iteration=50, mcts_sims=50)
# Head to head at an identical search budget.
trained = MCTS(net, n_simulations=50)
random_searcher = RandomEvalMCTS(network=None, n_simulations=50)
wins = draws = 0
for g in range(200):
game = TicTacToe()
trained_is_x = (g % 2 == 0)
while not game.is_terminal():
turn_is_trained = (game.current_player == 1) == trained_is_x
searcher = trained if turn_is_trained else random_searcher
pi = searcher.search(game)
game.apply_action(max(pi, key=pi.get))
w = game._winner()
if w == 0:
draws += 1
elif (w == 1) == trained_is_x:
wins += 1
print(f"trained vs pure-MCTS -> wins {wins}, draws {draws}, losses {200 - wins - draws}")
At an equal simulation budget the trained network pulls ahead, and the gap is the whole AlphaZero thesis in miniature: intuition plus search beats search alone. The random evaluator wastes its fifty simulations wandering down obviously losing lines; the trained network's prior steers those same fifty simulations toward moves that matter. Nota bene, Tic-Tac-Toe is a solved, drawish little game -- with enough search both sides should mostly draw -- so watch the losses column shrink toward zero rather than expecting a tidy landslide of wins.
Exercise 3: Bolt the ICMModule onto a DQN agent on MountainCar-v0 -- a notoriously sparse-reward environment -- and compare episodes-to-first-goal with and without the curiosity bonus.
import gymnasium as gym
import numpy as np
import torch
# Assumes a DQN agent from episode #107 and ICMModule from episode #112.
def onehot(a, n):
v = np.zeros(n, dtype=np.float32)
v[a] = 1.0
return v
def episodes_to_first_goal(use_curiosity, seed, max_ep=400, eta=0.5):
env = gym.make("MountainCar-v0")
n_act = env.action_space.n
agent = DQNAgent(env.observation_space.shape[0], n_act) # your ep#107 DQN
icm = ICMModule(env.observation_space.shape[0], n_act) if use_curiosity else None
torch.manual_seed(seed); np.random.seed(seed)
for ep in range(max_ep):
s, _ = env.reset(seed=seed + ep)
done = False
while not done:
a = agent.choose_action(s)
ns, r, term, trunc, _ = env.step(a)
done = term or trunc
bonus = 0.0
if icm is not None:
st = torch.FloatTensor(s).unsqueeze(0)
nst = torch.FloatTensor(ns).unsqueeze(0)
act = torch.FloatTensor(onehot(a, n_act)).unsqueeze(0)
bonus = eta * icm.compute_intrinsic_reward(st, act, nst).item()
agent.store(s, a, r + bonus, ns, done)
agent.learn()
s = ns
if term: # MountainCar terminates only at the flag
return ep
return max_ep
for use in (False, True):
reached = [episodes_to_first_goal(use, seed=s) for s in range(3)]
tag = "curiosity" if use else "plain "
print(f"{tag}: first-goal episode per seed = {reached}")
The plain agent frequently burns through all 400 episodes without once reaching the flag -- the reward of -1 per step tells it "everything is equally bad", so epsilon-greedy just rattles around the valley forever. Add the curiosity bonus and the agent gets paid for reaching states it can't yet predict, which quite literally means driving further up the hill than before. That intrinsic pull is what finally flings it over the top. You feel, first-hand, why sparse reward is the villain and why intrinsic motivation is one of the escapes -- exactly the lesson episode #112 promised but this time measured on your own machine ;-)
Right -- out of the arcade
So. For eleven episodes we've been playing games. Balancing sticks, winning at Go, box-surfing in hide-and-seek. And games are a wonderful laboratory -- but let's be honest about why they're wonderful: because in a game, failure is free. Lose, hit reset, try again a million times, no harm done. The whole edifice of RL as we've built it quietly assumes that you can afford to be terrible for a very long time before you get good.
The real world does not offer you that deal. A robot arm that "explores" by flailing wildly breaks the fifty-thousand-euro robot arm. A self-driving car that tries a random action to see what happens can kill someone. A trading agent that explores by buying and selling at random burns real money with every curious click. This is the gap that has kept RL a research darling far more than a production workhorse, and today is about the bridge across it.
Sim-to-real: train in the dream, act in the world
The obvious escape from "failure is expensive" is: don't fail in the real world -- fail in a simulation instead, where failure is free again. Train the policy in a physics simulator for millions of episodes, then deploy the finished policy on the real hardware. This is sim-to-real transfer, and it's the backbone of essentially all modern RL robotics.
import torch
import torch.nn as nn
class RobotPolicy(nn.Module):
"""A policy that can run unchanged in simulation AND on the real robot."""
def __init__(self, obs_dim, action_dim, hidden=256):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, action_dim),
nn.Tanh(), # continuous actions bounded to [-1, 1]
)
def forward(self, obs):
return self.net(obs)
Notice there's nothing exotic there -- it's a bog-standard MLP policy of the sort we've built since episode #44. The hard part isn't the network, it's the honesty of the simulator. And here's the uncomfortable truth every roboticist learns the hard way: simulators are always wrong. Physics engines only approximate friction, contact forces, motor lag, sensor noise. A policy that scores a flawless 100% in simulation can faceplant the instant it touches real hardware, because the real world simply doesn't behave the way the simulator promised. That mismatch even has a name -- the reality gap -- and closing it is the whole game.
Domain randomization: make the dream deliberately unreliable
The fix is delightfully perverse. If the problem is that the simulator differs from reality in some unknown way, then don't train against one simulator -- train against thousands of randomly-perturbed simulators, so the policy has to work no matter what the physics happen to be. That's domain randomization: you jitter the friction, the masses, the gravity, the sensor noise, the actuator delay on every single episode.
import numpy as np
class RandomizedSimulator:
"""A simulator whose physics are re-rolled every episode."""
def __init__(self):
self.base_params = {
"friction": 0.5, "mass": 1.0, "gravity": 9.81,
"sensor_noise": 0.01, "actuator_delay": 0.0,
}
def randomize(self):
return {
"friction": np.random.uniform(0.2, 0.8),
"mass": np.random.uniform(0.5, 2.0),
"gravity": np.random.uniform(9.0, 10.5),
"sensor_noise": np.random.uniform(0.0, 0.05),
"actuator_delay": np.random.uniform(0.0, 0.02),
}
def reset(self):
self.params = self.randomize()
return self._get_observation()
def step(self, action):
noisy_action = action + np.random.normal(0, 0.01, size=action.shape)
next_state = self._simulate(noisy_action, self.params)
obs = next_state + np.random.normal(0, self.params["sensor_noise"],
size=next_state.shape)
reward = self._compute_reward(next_state)
done = self._check_termination(next_state)
return obs, reward, done
def _simulate(self, action, params): # real work: MuJoCo, Isaac Gym, PyBullet...
pass
def _compute_reward(self, state):
pass
def _check_termination(self, state):
pass
The logic is lovely once it clicks. If your policy is forced to succeed across friction from 0.2 to 0.8 and masses from half to double, then it has learned a behaviour robust to any friction and any mass in those ranges -- and the real robot's true friction is just one more value inside the range it already handles. The reality gap dissolves because the policy stopped assuming there was a single "true" physics at all.
The famous demonstration is OpenAI's Rubik's Cube-solving robot hand from 2019. They randomized the simulated hand so aggressively -- physics, visuals, even simulated "gremlins" that yanked at the cube -- that the real world was, from the policy's point of view, just another mildly weird simulation. The hand didn't know it had crossed into reality. It didn't need to. Having said that, domain randomization isn't free lunch: crank the randomization too hard and you get a policy so cautious and generic it's mediocre everywhere. Tuning that dial is half the craft.
Robotics: reaching, grasping, walking
Robot manipulation -- picking things up, assembling, using tools -- is the poster child for real-world RL. The observation is joint angles, joint velocities, gripper state and object poses; the actions are joint torques or target positions. The tricky bit is the reward, because "pick up the object and put it there" is really several sub-tasks stacked on top of each other, and a naive single reward at the very end is far too sparse (we just felt that pain in Exercise 3).
class ManipulationReward:
"""Shaped, multi-phase reward for a pick-and-place task."""
def __init__(self, target_pos):
self.target_pos = np.array(target_pos)
def compute(self, gripper_pos, object_pos, gripper_closed, object_grasped):
reward = 0.0
# Phase 1: get the gripper to the object.
reach_dist = np.linalg.norm(gripper_pos - object_pos)
reward -= 0.1 * reach_dist
# Phase 2: close on it once you're there.
if reach_dist < 0.05 and gripper_closed:
reward += 1.0
# Phase 3: carry it to the target.
if object_grasped:
place_dist = np.linalg.norm(object_pos - self.target_pos)
reward -= 0.1 * place_dist
if place_dist < 0.05:
reward += 10.0 # task complete
return reward
That's reward shaping from episode #112 doing real work -- gently paying the agent for approaching, then grasping, then placing, in stead of leaving it to stumble on the whole sequence by luck. Nota bene the danger we discussed: shape this carelessly and the agent will happily hover near the object collecting the approach reward forever, never bothering to actually grasp. The tiered structure, where each phase only unlocks after the previous one, is what keeps it honest.
Locomotion -- walking, running, climbing -- is the other big robotics arena, and here PPO (episode #109) and SAC (Soft Actor-Critic, its off-policy cousin) are the standard tools. The agent drives the joint motors and the reward is roughly "move forward, don't fall". One charming, slightly humbling result: RL-trained robots routinely discover gaits that look nothing like a human walk -- weird shuffles and lopes -- because the reward never said "walk like a person", only "go forward efficiently". The agent optimises what you actually asked for, not what you meant. Remember that; it comes back to bite people in the safety section.
Resource management: the domain where RL quietly wins
If robotics is where RL gets the headlines, resource management is where it quietly banks the money. Scheduling, routing, allocation -- any problem where a sequence of decisions shapes your future options -- is a natural fit, and crucially these problems are often simulable and forgiving in a way that autonomous driving is not.
The classic success story is data-center cooling. Around 2016 DeepMind pointed an RL agent at the cooling systems in Google's data centers: state was temperatures, power loads and weather; actions were cooling adjustments; reward was energy efficiency subject to staying within safe temperatures. The reported result was a ~40% cut in cooling energy -- an eye-watering sum at Google's scale. Here's a toy version of that control problem:
class DataCenterEnv:
"""Simplified data-center cooling control."""
def __init__(self):
self.temperature = 22.0 # server-room temp in degrees C
self.target_range = (18.0, 27.0)
self.power_usage = 0.0
self.outside_temp = 25.0
def step(self, action):
"""action: cooling power level from 0.0 to 1.0."""
server_heat = np.random.uniform(2.0, 5.0) # varies with load
cooling_effect = action * 8.0
self.temperature += server_heat - cooling_effect \
+ 0.1 * (self.outside_temp - self.temperature)
self.power_usage = action
reward = -self.power_usage # every watt costs
# Hard penalty for drifting out of the safe band.
if self.temperature > self.target_range[1]:
reward -= 10.0 * (self.temperature - self.target_range[1])
elif self.temperature < self.target_range[0]:
reward -= 10.0 * (self.target_range[0] - self.temperature)
obs = np.array([self.temperature, self.outside_temp,
server_heat, self.power_usage])
return obs, reward, False
Look at the reward: minimise power, but pay a steep penalty for leaving the safe temperature band. That penalty gap between "wasting a little energy" and "cooking the servers" is exactly the kind of asymmetry that resource problems have and games don't. The other everyday example is vehicle routing: an agent deciding the order a delivery van visits its stops, re-planning live as traffic shifts and new orders drop in. Static optimisers solve yesterday's problem perfectly; RL adapts to the problem as it actually unfolds.
Finance: the siren song
Financial markets look tailor-made for RL. Observe market state, take an action (buy, sell, hold), collect a reward (profit or loss). Textbook MDP, right? Here's a stripped-down trading environment to make the shape concrete:
class TradingEnvironment:
"""A deliberately simplified single-asset trading environment."""
def __init__(self, price_data, initial_balance=10000.0):
self.prices = price_data
self.initial_balance = initial_balance
self.reset()
def reset(self):
self.step_idx = 0
self.balance = self.initial_balance
self.position = 0.0 # units held
self.portfolio_value = self.balance
return self._get_obs()
def step(self, action):
"""action: -1 sell everything, 0 hold, 1 buy with everything."""
price = self.prices[self.step_idx]
next_price = self.prices[self.step_idx + 1]
if action == 1 and self.balance > 0:
self.position += self.balance / price
self.balance = 0.0
elif action == -1 and self.position > 0:
self.balance += self.position * price
self.position = 0.0
self.step_idx += 1
self.portfolio_value = self.balance + self.position * next_price
prev_value = self.balance + self.position * price
reward = np.log(self.portfolio_value / prev_value + 1e-8) # log return
done = self.step_idx >= len(self.prices) - 2
return self._get_obs(), reward, done
def _get_obs(self):
idx = self.step_idx
window = self.prices[max(0, idx - 20):idx + 1]
returns = np.diff(window) / window[:-1] if len(window) > 1 else [0.0]
returns = np.pad(returns, (20 - len(returns), 0))
pos_ratio = (self.position * self.prices[idx]) / (self.portfolio_value + 1e-8)
return np.append(returns, pos_ratio).astype(np.float32)
And now the honest part, because I've watched too many people set fire to their savings here. Financial RL is brutally hard, and the reasons are baked into the domain, not fixable with a bigger network. Markets are non-stationary -- the regime that trained your agent is gone by the time you deploy it (this is the non-stationarity from episode #111, except nobody warned you). They're partially observable -- you never see the other players' hands. They're adversarial -- other participants react to your trades. And backtests lie: transaction costs eat theoretical profits, and market impact (your own orders moving the price) never shows up in a simulation this simple. The dirty secret of the literature is that most impressive RL-trading papers show gorgeous backtests that quietly die in live trading. Where RL does earn its keep in finance is the humbler job of execution -- filling a large order at the best average price -- rather than the glamorous job of predicting direction. I'll say no more than that; this is a tutorial about learning agents, not a place for anyone to go hunting trade signals.
The safety problem: the whole ballgame
Everything above circles one central fact. In a game, the worst outcome is a loss on a scoreboard. In the real world, an unsafe action can be irreversible -- broken equipment, an injured person, money that isn't coming back. So real-world RL needs a concept games never forced on us: safety as a hard constraint, not a soft preference. One common pattern is a safety layer that sits between the policy and the world and vetoes anything dangerous:
class SafeRLAgent:
"""Wrap a policy in a hard safety filter."""
def __init__(self, policy, safety_checker):
self.policy = policy
self.safety_checker = safety_checker
def act(self, state):
proposed = self.policy(state)
if self.safety_checker.is_safe(state, proposed):
return proposed
# Otherwise fall back to the safe action nearest the policy's wish.
safe_actions = self.safety_checker.get_safe_actions(state)
return self._nearest_safe_action(proposed, safe_actions)
def _nearest_safe_action(self, preferred, safe_set):
distances = [np.linalg.norm(preferred - safe) for safe in safe_set]
return safe_set[np.argmin(distances)]
The idea is simple and powerful: the learning agent can be as adventurous as it likes inside a fence that a separate, trusted checker draws. The policy proposes; the safety layer disposes. More formally, the field frames this with constrained MDPs -- maximise reward subject to keeping some cost (collision probability, energy use, patient risk) below a threshold. PPO-Lagrangian is the workhorse here: it bolts Lagrange multipliers onto PPO so that violating a constraint gets automatically more expensive until the agent stops doing it. This connects straight back to the locomotion lesson -- an agent optimises what you measured, so if safety isn't in the optimisation, it doesn't exist.
So what's actually deployed?
Time for a reality check, because the hype badly outruns the shipping products. The genuinely-in-production list is shorter than you'd guess:
- Data-center cooling (Google/DeepMind) -- proven, deployed, saving real energy;
- Chip floorplanning (Google) -- RL placing components on silicon;
- Recommendation and ad systems (YouTube, TikTok, ad auctions) -- mostly bandits and short-horizon RL, learned from oceans of data;
- LLM alignment (RLHF/PPO) -- yes, the assistant-tuning pipeline is reinforcement learning, and it's arguably RL's biggest commercial hit;
- Robotics -- real, but mostly in controlled industrial cells, not your living room;
- Game AI -- well established and genuinely shipped.
Spot the common thread? Every winner is either cheaply simulable (games, chips, cooling, language models) or data-rich enough to learn safely offline (recommendations, ads). The domains where failure is catastrophic and simulation is hard -- full autonomous driving, clinical treatment -- are exactly the ones still stuck in the lab. That's not a coincidence; it's the sim-to-real gap and the safety problem, restated as a business fact.
Pull the threads together and the shape of real-world RL is clear: simulate whatever you can, randomize it so the policy stops trusting the simulator, shape rewards to guide multi-step tasks, and wrap the whole thing in hard safety constraints because reality doesn't offer a reset button. The machinery is the same PPO, DQN and self-play we've built all series -- what changes is the discipline you have to bring to deploying it. And there's a related question we've been dancing around this whole time: we keep hand-designing reward functions, but what if the reward is the very thing we don't know how to write down? That's a thread worth pulling on soon.
In case you skimmed...
- Sim-to-real transfer trains policies in a simulator and deploys them on real hardware, but the ever-present reality gap means a simulator-perfect policy can still fail in the flesh;
- domain randomization closes that gap by training across thousands of randomly-perturbed physics, so the real world becomes just one more variation the policy already handles;
- robotics uses PPO/SAC for manipulation and locomotion, leaning hard on multi-phase reward shaping -- and cheerfully finds weird, un-human solutions because it optimises what you measured, not what you meant;
- resource management (data-center cooling, routing, scheduling) is RL's quiet success story: simulable, forgiving, and with rewards you can actually write down;
- financial RL is seductive but treacherous -- non-stationary, partially observable, adversarial, and prone to backtests that die on contact with reality; execution beats prediction;
- safety is the dividing line between game RL and real RL -- constrained MDPs, safe-action filters and PPO-Lagrangian exist because reality has no reset button, and production RL only thrives where failure is cheap or simulation is easy.
Exercises
Exercise 1: Take the DataCenterEnv above and train a small agent on it -- a DQN over a handful of discretized cooling levels (say 0.0, 0.25, 0.5, 0.75, 1.0) is plenty, and you already have a DQN from episode #107. Run it for a few thousand episodes, then plot two things over training: average power used and the fraction of steps spent outside the safe temperature band. You should watch the agent discover the classic trade-off -- cool just enough to stay in band while spending as little power as possible. Then bump the out-of-band penalty from 10.0 up to 100.0 and re-run: how does the agent's behaviour change? This is your first hands-on feel for how a reward weight encodes a priority.
Exercise 2: Simulate the reality gap yourself. Write a SimHand environment with friction = 0.5 and a RealHand with friction = 0.35, both otherwise identical (a 1D "slide the block to the target" task is enough). Train a policy on SimHand alone and measure its performance on RealHand -- it should degrade. Now train a second policy on a SimHand whose friction is re-rolled uniformly in [0.2, 0.8] every episode (domain randomization), and test that one on RealHand. Report both transfer scores. You should see the randomized policy survive the jump far better, and you'll have reproduced the core sim-to-real result on your laptop.
Exercise 3: Build the SafeRLAgent's missing safety_checker for a 1D "cliff" world: the agent lives on a line, position must stay in [0, 10], and any action that would push it past either edge is unsafe. Implement is_safe(state, action) and get_safe_actions(state), wrap a deliberately reckless random policy in the SafeRLAgent, and run it for a few hundred steps. Confirm the agent never once leaves the safe interval, no matter how badly the underlying policy misbehaves. Then log every time the safety layer had to override the policy -- that override count is exactly the kind of number a real safety engineer watches like a hawk.
Get at least the reality-gap exercise working before the next episode -- feeling a policy trained in one world stumble in a slightly different one is the single best intuition for everything that makes real-world RL hard. The next stop turns a question we kept brushing past into the main event: what happens when you can't write the reward function down at all, and have to infer it instead?
Congratulations @scipio! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)
Your next target is to reach 25000 upvotes.
You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word
STOPCheck out our last posts: