Learn AI Series (#113) - RL for Real-World Applications

avatar

Learn AI Series (#113) - RL for Real-World Applications

variant-a-14-amber.png

What will I learn

  • You will learn sim-to-real transfer -- training a policy in a physics simulator and then deploying it on real hardware without it falling over the moment it meets reality;
  • domain randomization, the wonderfully counter-intuitive trick of making your simulator deliberately wrong so the policy stops caring which world it's in;
  • how RL is actually used in robotics (manipulation, locomotion) and in resource management (data-center cooling, routing, scheduling) -- the domains where it genuinely pays the rent;
  • why financial RL is so seductive and so treacherous at the same time;
  • and the big one that separates the toy from the tool: safety -- why an agent that can break equipment, crash a car or lose real money needs a completely different mindset than one that just loses a game.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution with NumPy and PyTorch;
  • You've followed the RL arc so far -- especially episodes #107 (DQN), #109 (PPO) and #112 (RL for games) -- because today is where all of that finally leaves the sandbox.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#113) - RL for Real-World Applications

Solutions to Episode #112 Exercises

Before we drag reinforcement learning out of the arcade and into the factory, let's clear last episode's three exercises. They lean on the MCTS, MCTSNode, AlphaZeroNetwork, alphazero_training_loop and ICMModule classes from episode #112, so I'm assuming those are imported and in scope.

Exercise 1: Implement Tic-Tac-Toe as a tiny game object exposing the six methods MCTS expects, then run the MCTS class with a random evaluator (uniform policy, random value) and measure pure-search win rate against a random opponent over 200 games.

import numpy as np
import random
# Assumes MCTS and MCTSNode from episode #112.


class TicTacToe:
    """Minimal 3x3 board. current_player is +1 or -1."""
    def __init__(self):
        self.board = np.zeros(9, dtype=np.int8)
        self.current_player = 1

    def clone(self):
        g = TicTacToe()
        g.board = self.board.copy()
        g.current_player = self.current_player
        return g

    def get_legal_actions(self):
        return [i for i in range(9) if self.board[i] == 0]

    def apply_action(self, a):
        self.board[a] = self.current_player
        self.current_player *= -1        # hand the turn over

    def _winner(self):
        lines = [(0,1,2),(3,4,5),(6,7,8),(0,3,6),(1,4,7),
                 (2,5,8),(0,4,8),(2,4,6)]
        for x, y, z in lines:
            s = self.board[x] + self.board[y] + self.board[z]
            if s == 3:  return 1
            if s == -3: return -1
        return 0

    def is_terminal(self):
        return self._winner() != 0 or not self.get_legal_actions()

    def get_reward(self):
        # Reward from the perspective of the player who JUST moved.
        w = self._winner()
        return 0.0 if w == 0 else float(w * -self.current_player)

    def to_tensor(self):
        return np.concatenate([(self.board == 1), (self.board == -1)]).astype(np.float32)


class RandomEvalMCTS(MCTS):
    """MCTS with no network: uniform priors and a coin-flip value."""
    def _evaluate(self, state):
        actions = state.get_legal_actions()
        policy = np.zeros(9, dtype=np.float32)
        for a in actions:
            policy[a] = 1.0 / len(actions)
        return policy, random.uniform(-1.0, 1.0)


def play_match(searcher, n_games=200):
    wins = 0
    for g in range(n_games):
        game = TicTacToe()
        mcts_is_x = (g % 2 == 0)          # alternate who starts
        while not game.is_terminal():
            mcts_turn = (game.current_player == 1) == mcts_is_x
            if mcts_turn:
                pi = searcher.search(game)
                action = max(pi, key=pi.get)
            else:
                action = random.choice(game.get_legal_actions())
            game.apply_action(action)
        result = game._winner()
        if (result == 1) == mcts_is_x and result != 0:
            wins += 1
    return wins / n_games


print("pure-MCTS win rate vs random:",
      play_match(RandomEvalMCTS(network=None, n_simulations=100)))

The result is the point: even with a random value function and no learning whatsoever, MCTS crushes a random opponent -- you'll typically see a win rate north of 0.9. Sit with that for a second. The network in AlphaZero is the famous part, but this exercise shows that a big chunk of the strength was always the search. Simulating a hundred futures and trusting the move you kept returning to is already a formidable player. Planning is powerful long before intuition shows up.

Exercise 2: Wire up the full alphazero_training_loop with a small AlphaZeroNetwork on your Tic-Tac-Toe game, train for a few dozen iterations, then pit the trained agent against the pure-MCTS agent from Exercise 1 at the same simulation budget.

import torch
import torch.nn as nn
import numpy as np
# Assumes AlphaZeroNetwork, MCTS, alphazero_training_loop from episode #112,
# and TicTacToe / RandomEvalMCTS from Exercise 1.


class TicTacToeGame:
    """Adapter so alphazero_training_loop can call new_game()."""
    def new_game(self):
        return TicTacToe()


class TinyAZNet(nn.Module):
    """A dense stand-in for the conv network -- 3x3 is too small for convs."""
    def __init__(self):
        super().__init__()
        self.body = nn.Sequential(
            nn.Linear(18, 128), nn.ReLU(),
            nn.Linear(128, 128), nn.ReLU(),
        )
        self.policy_head = nn.Linear(128, 9)
        self.value_head = nn.Sequential(nn.Linear(128, 1), nn.Tanh())

    def forward(self, state):
        x = self.body(state)
        return self.policy_head(x), self.value_head(x)


net = TinyAZNet()
alphazero_training_loop(TicTacToeGame(), net, n_iterations=40,
                        games_per_iteration=50, mcts_sims=50)

# Head to head at an identical search budget.
trained = MCTS(net, n_simulations=50)
random_searcher = RandomEvalMCTS(network=None, n_simulations=50)

wins = draws = 0
for g in range(200):
    game = TicTacToe()
    trained_is_x = (g % 2 == 0)
    while not game.is_terminal():
        turn_is_trained = (game.current_player == 1) == trained_is_x
        searcher = trained if turn_is_trained else random_searcher
        pi = searcher.search(game)
        game.apply_action(max(pi, key=pi.get))
    w = game._winner()
    if w == 0:
        draws += 1
    elif (w == 1) == trained_is_x:
        wins += 1
print(f"trained vs pure-MCTS -> wins {wins}, draws {draws}, losses {200 - wins - draws}")

At an equal simulation budget the trained network pulls ahead, and the gap is the whole AlphaZero thesis in miniature: intuition plus search beats search alone. The random evaluator wastes its fifty simulations wandering down obviously losing lines; the trained network's prior steers those same fifty simulations toward moves that matter. Nota bene, Tic-Tac-Toe is a solved, drawish little game -- with enough search both sides should mostly draw -- so watch the losses column shrink toward zero rather than expecting a tidy landslide of wins.

Exercise 3: Bolt the ICMModule onto a DQN agent on MountainCar-v0 -- a notoriously sparse-reward environment -- and compare episodes-to-first-goal with and without the curiosity bonus.

import gymnasium as gym
import numpy as np
import torch
# Assumes a DQN agent from episode #107 and ICMModule from episode #112.


def onehot(a, n):
    v = np.zeros(n, dtype=np.float32)
    v[a] = 1.0
    return v


def episodes_to_first_goal(use_curiosity, seed, max_ep=400, eta=0.5):
    env = gym.make("MountainCar-v0")
    n_act = env.action_space.n
    agent = DQNAgent(env.observation_space.shape[0], n_act)   # your ep#107 DQN
    icm = ICMModule(env.observation_space.shape[0], n_act) if use_curiosity else None
    torch.manual_seed(seed); np.random.seed(seed)

    for ep in range(max_ep):
        s, _ = env.reset(seed=seed + ep)
        done = False
        while not done:
            a = agent.choose_action(s)
            ns, r, term, trunc, _ = env.step(a)
            done = term or trunc
            bonus = 0.0
            if icm is not None:
                st = torch.FloatTensor(s).unsqueeze(0)
                nst = torch.FloatTensor(ns).unsqueeze(0)
                act = torch.FloatTensor(onehot(a, n_act)).unsqueeze(0)
                bonus = eta * icm.compute_intrinsic_reward(st, act, nst).item()
            agent.store(s, a, r + bonus, ns, done)
            agent.learn()
            s = ns
            if term:                       # MountainCar terminates only at the flag
                return ep
    return max_ep


for use in (False, True):
    reached = [episodes_to_first_goal(use, seed=s) for s in range(3)]
    tag = "curiosity" if use else "plain    "
    print(f"{tag}: first-goal episode per seed = {reached}")

The plain agent frequently burns through all 400 episodes without once reaching the flag -- the reward of -1 per step tells it "everything is equally bad", so epsilon-greedy just rattles around the valley forever. Add the curiosity bonus and the agent gets paid for reaching states it can't yet predict, which quite literally means driving further up the hill than before. That intrinsic pull is what finally flings it over the top. You feel, first-hand, why sparse reward is the villain and why intrinsic motivation is one of the escapes -- exactly the lesson episode #112 promised but this time measured on your own machine ;-)

Right -- out of the arcade

So. For eleven episodes we've been playing games. Balancing sticks, winning at Go, box-surfing in hide-and-seek. And games are a wonderful laboratory -- but let's be honest about why they're wonderful: because in a game, failure is free. Lose, hit reset, try again a million times, no harm done. The whole edifice of RL as we've built it quietly assumes that you can afford to be terrible for a very long time before you get good.

The real world does not offer you that deal. A robot arm that "explores" by flailing wildly breaks the fifty-thousand-euro robot arm. A self-driving car that tries a random action to see what happens can kill someone. A trading agent that explores by buying and selling at random burns real money with every curious click. This is the gap that has kept RL a research darling far more than a production workhorse, and today is about the bridge across it.

Sim-to-real: train in the dream, act in the world

The obvious escape from "failure is expensive" is: don't fail in the real world -- fail in a simulation instead, where failure is free again. Train the policy in a physics simulator for millions of episodes, then deploy the finished policy on the real hardware. This is sim-to-real transfer, and it's the backbone of essentially all modern RL robotics.

import torch
import torch.nn as nn


class RobotPolicy(nn.Module):
    """A policy that can run unchanged in simulation AND on the real robot."""
    def __init__(self, obs_dim, action_dim, hidden=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, action_dim),
            nn.Tanh(),                    # continuous actions bounded to [-1, 1]
        )

    def forward(self, obs):
        return self.net(obs)

Notice there's nothing exotic there -- it's a bog-standard MLP policy of the sort we've built since episode #44. The hard part isn't the network, it's the honesty of the simulator. And here's the uncomfortable truth every roboticist learns the hard way: simulators are always wrong. Physics engines only approximate friction, contact forces, motor lag, sensor noise. A policy that scores a flawless 100% in simulation can faceplant the instant it touches real hardware, because the real world simply doesn't behave the way the simulator promised. That mismatch even has a name -- the reality gap -- and closing it is the whole game.

Domain randomization: make the dream deliberately unreliable

The fix is delightfully perverse. If the problem is that the simulator differs from reality in some unknown way, then don't train against one simulator -- train against thousands of randomly-perturbed simulators, so the policy has to work no matter what the physics happen to be. That's domain randomization: you jitter the friction, the masses, the gravity, the sensor noise, the actuator delay on every single episode.

import numpy as np


class RandomizedSimulator:
    """A simulator whose physics are re-rolled every episode."""
    def __init__(self):
        self.base_params = {
            "friction": 0.5, "mass": 1.0, "gravity": 9.81,
            "sensor_noise": 0.01, "actuator_delay": 0.0,
        }

    def randomize(self):
        return {
            "friction": np.random.uniform(0.2, 0.8),
            "mass": np.random.uniform(0.5, 2.0),
            "gravity": np.random.uniform(9.0, 10.5),
            "sensor_noise": np.random.uniform(0.0, 0.05),
            "actuator_delay": np.random.uniform(0.0, 0.02),
        }

    def reset(self):
        self.params = self.randomize()
        return self._get_observation()

    def step(self, action):
        noisy_action = action + np.random.normal(0, 0.01, size=action.shape)
        next_state = self._simulate(noisy_action, self.params)
        obs = next_state + np.random.normal(0, self.params["sensor_noise"],
                                            size=next_state.shape)
        reward = self._compute_reward(next_state)
        done = self._check_termination(next_state)
        return obs, reward, done

    def _simulate(self, action, params):   # real work: MuJoCo, Isaac Gym, PyBullet...
        pass
    def _compute_reward(self, state):
        pass
    def _check_termination(self, state):
        pass

The logic is lovely once it clicks. If your policy is forced to succeed across friction from 0.2 to 0.8 and masses from half to double, then it has learned a behaviour robust to any friction and any mass in those ranges -- and the real robot's true friction is just one more value inside the range it already handles. The reality gap dissolves because the policy stopped assuming there was a single "true" physics at all.

The famous demonstration is OpenAI's Rubik's Cube-solving robot hand from 2019. They randomized the simulated hand so aggressively -- physics, visuals, even simulated "gremlins" that yanked at the cube -- that the real world was, from the policy's point of view, just another mildly weird simulation. The hand didn't know it had crossed into reality. It didn't need to. Having said that, domain randomization isn't free lunch: crank the randomization too hard and you get a policy so cautious and generic it's mediocre everywhere. Tuning that dial is half the craft.

Robotics: reaching, grasping, walking

Robot manipulation -- picking things up, assembling, using tools -- is the poster child for real-world RL. The observation is joint angles, joint velocities, gripper state and object poses; the actions are joint torques or target positions. The tricky bit is the reward, because "pick up the object and put it there" is really several sub-tasks stacked on top of each other, and a naive single reward at the very end is far too sparse (we just felt that pain in Exercise 3).

class ManipulationReward:
    """Shaped, multi-phase reward for a pick-and-place task."""
    def __init__(self, target_pos):
        self.target_pos = np.array(target_pos)

    def compute(self, gripper_pos, object_pos, gripper_closed, object_grasped):
        reward = 0.0

        # Phase 1: get the gripper to the object.
        reach_dist = np.linalg.norm(gripper_pos - object_pos)
        reward -= 0.1 * reach_dist

        # Phase 2: close on it once you're there.
        if reach_dist < 0.05 and gripper_closed:
            reward += 1.0

        # Phase 3: carry it to the target.
        if object_grasped:
            place_dist = np.linalg.norm(object_pos - self.target_pos)
            reward -= 0.1 * place_dist
            if place_dist < 0.05:
                reward += 10.0            # task complete

        return reward

That's reward shaping from episode #112 doing real work -- gently paying the agent for approaching, then grasping, then placing, in stead of leaving it to stumble on the whole sequence by luck. Nota bene the danger we discussed: shape this carelessly and the agent will happily hover near the object collecting the approach reward forever, never bothering to actually grasp. The tiered structure, where each phase only unlocks after the previous one, is what keeps it honest.

Locomotion -- walking, running, climbing -- is the other big robotics arena, and here PPO (episode #109) and SAC (Soft Actor-Critic, its off-policy cousin) are the standard tools. The agent drives the joint motors and the reward is roughly "move forward, don't fall". One charming, slightly humbling result: RL-trained robots routinely discover gaits that look nothing like a human walk -- weird shuffles and lopes -- because the reward never said "walk like a person", only "go forward efficiently". The agent optimises what you actually asked for, not what you meant. Remember that; it comes back to bite people in the safety section.

Resource management: the domain where RL quietly wins

If robotics is where RL gets the headlines, resource management is where it quietly banks the money. Scheduling, routing, allocation -- any problem where a sequence of decisions shapes your future options -- is a natural fit, and crucially these problems are often simulable and forgiving in a way that autonomous driving is not.

The classic success story is data-center cooling. Around 2016 DeepMind pointed an RL agent at the cooling systems in Google's data centers: state was temperatures, power loads and weather; actions were cooling adjustments; reward was energy efficiency subject to staying within safe temperatures. The reported result was a ~40% cut in cooling energy -- an eye-watering sum at Google's scale. Here's a toy version of that control problem:

class DataCenterEnv:
    """Simplified data-center cooling control."""
    def __init__(self):
        self.temperature = 22.0           # server-room temp in degrees C
        self.target_range = (18.0, 27.0)
        self.power_usage = 0.0
        self.outside_temp = 25.0

    def step(self, action):
        """action: cooling power level from 0.0 to 1.0."""
        server_heat = np.random.uniform(2.0, 5.0)     # varies with load
        cooling_effect = action * 8.0

        self.temperature += server_heat - cooling_effect \
                            + 0.1 * (self.outside_temp - self.temperature)
        self.power_usage = action

        reward = -self.power_usage         # every watt costs
        # Hard penalty for drifting out of the safe band.
        if self.temperature > self.target_range[1]:
            reward -= 10.0 * (self.temperature - self.target_range[1])
        elif self.temperature < self.target_range[0]:
            reward -= 10.0 * (self.target_range[0] - self.temperature)

        obs = np.array([self.temperature, self.outside_temp,
                        server_heat, self.power_usage])
        return obs, reward, False

Look at the reward: minimise power, but pay a steep penalty for leaving the safe temperature band. That penalty gap between "wasting a little energy" and "cooking the servers" is exactly the kind of asymmetry that resource problems have and games don't. The other everyday example is vehicle routing: an agent deciding the order a delivery van visits its stops, re-planning live as traffic shifts and new orders drop in. Static optimisers solve yesterday's problem perfectly; RL adapts to the problem as it actually unfolds.

Finance: the siren song

Financial markets look tailor-made for RL. Observe market state, take an action (buy, sell, hold), collect a reward (profit or loss). Textbook MDP, right? Here's a stripped-down trading environment to make the shape concrete:

class TradingEnvironment:
    """A deliberately simplified single-asset trading environment."""
    def __init__(self, price_data, initial_balance=10000.0):
        self.prices = price_data
        self.initial_balance = initial_balance
        self.reset()

    def reset(self):
        self.step_idx = 0
        self.balance = self.initial_balance
        self.position = 0.0               # units held
        self.portfolio_value = self.balance
        return self._get_obs()

    def step(self, action):
        """action: -1 sell everything, 0 hold, 1 buy with everything."""
        price = self.prices[self.step_idx]
        next_price = self.prices[self.step_idx + 1]

        if action == 1 and self.balance > 0:
            self.position += self.balance / price
            self.balance = 0.0
        elif action == -1 and self.position > 0:
            self.balance += self.position * price
            self.position = 0.0

        self.step_idx += 1
        self.portfolio_value = self.balance + self.position * next_price
        prev_value = self.balance + self.position * price
        reward = np.log(self.portfolio_value / prev_value + 1e-8)   # log return

        done = self.step_idx >= len(self.prices) - 2
        return self._get_obs(), reward, done

    def _get_obs(self):
        idx = self.step_idx
        window = self.prices[max(0, idx - 20):idx + 1]
        returns = np.diff(window) / window[:-1] if len(window) > 1 else [0.0]
        returns = np.pad(returns, (20 - len(returns), 0))
        pos_ratio = (self.position * self.prices[idx]) / (self.portfolio_value + 1e-8)
        return np.append(returns, pos_ratio).astype(np.float32)

And now the honest part, because I've watched too many people set fire to their savings here. Financial RL is brutally hard, and the reasons are baked into the domain, not fixable with a bigger network. Markets are non-stationary -- the regime that trained your agent is gone by the time you deploy it (this is the non-stationarity from episode #111, except nobody warned you). They're partially observable -- you never see the other players' hands. They're adversarial -- other participants react to your trades. And backtests lie: transaction costs eat theoretical profits, and market impact (your own orders moving the price) never shows up in a simulation this simple. The dirty secret of the literature is that most impressive RL-trading papers show gorgeous backtests that quietly die in live trading. Where RL does earn its keep in finance is the humbler job of execution -- filling a large order at the best average price -- rather than the glamorous job of predicting direction. I'll say no more than that; this is a tutorial about learning agents, not a place for anyone to go hunting trade signals.

The safety problem: the whole ballgame

Everything above circles one central fact. In a game, the worst outcome is a loss on a scoreboard. In the real world, an unsafe action can be irreversible -- broken equipment, an injured person, money that isn't coming back. So real-world RL needs a concept games never forced on us: safety as a hard constraint, not a soft preference. One common pattern is a safety layer that sits between the policy and the world and vetoes anything dangerous:

class SafeRLAgent:
    """Wrap a policy in a hard safety filter."""
    def __init__(self, policy, safety_checker):
        self.policy = policy
        self.safety_checker = safety_checker

    def act(self, state):
        proposed = self.policy(state)
        if self.safety_checker.is_safe(state, proposed):
            return proposed
        # Otherwise fall back to the safe action nearest the policy's wish.
        safe_actions = self.safety_checker.get_safe_actions(state)
        return self._nearest_safe_action(proposed, safe_actions)

    def _nearest_safe_action(self, preferred, safe_set):
        distances = [np.linalg.norm(preferred - safe) for safe in safe_set]
        return safe_set[np.argmin(distances)]

The idea is simple and powerful: the learning agent can be as adventurous as it likes inside a fence that a separate, trusted checker draws. The policy proposes; the safety layer disposes. More formally, the field frames this with constrained MDPs -- maximise reward subject to keeping some cost (collision probability, energy use, patient risk) below a threshold. PPO-Lagrangian is the workhorse here: it bolts Lagrange multipliers onto PPO so that violating a constraint gets automatically more expensive until the agent stops doing it. This connects straight back to the locomotion lesson -- an agent optimises what you measured, so if safety isn't in the optimisation, it doesn't exist.

So what's actually deployed?

Time for a reality check, because the hype badly outruns the shipping products. The genuinely-in-production list is shorter than you'd guess:

  • Data-center cooling (Google/DeepMind) -- proven, deployed, saving real energy;
  • Chip floorplanning (Google) -- RL placing components on silicon;
  • Recommendation and ad systems (YouTube, TikTok, ad auctions) -- mostly bandits and short-horizon RL, learned from oceans of data;
  • LLM alignment (RLHF/PPO) -- yes, the assistant-tuning pipeline is reinforcement learning, and it's arguably RL's biggest commercial hit;
  • Robotics -- real, but mostly in controlled industrial cells, not your living room;
  • Game AI -- well established and genuinely shipped.

Spot the common thread? Every winner is either cheaply simulable (games, chips, cooling, language models) or data-rich enough to learn safely offline (recommendations, ads). The domains where failure is catastrophic and simulation is hard -- full autonomous driving, clinical treatment -- are exactly the ones still stuck in the lab. That's not a coincidence; it's the sim-to-real gap and the safety problem, restated as a business fact.

Pull the threads together and the shape of real-world RL is clear: simulate whatever you can, randomize it so the policy stops trusting the simulator, shape rewards to guide multi-step tasks, and wrap the whole thing in hard safety constraints because reality doesn't offer a reset button. The machinery is the same PPO, DQN and self-play we've built all series -- what changes is the discipline you have to bring to deploying it. And there's a related question we've been dancing around this whole time: we keep hand-designing reward functions, but what if the reward is the very thing we don't know how to write down? That's a thread worth pulling on soon.

In case you skimmed...

  • Sim-to-real transfer trains policies in a simulator and deploys them on real hardware, but the ever-present reality gap means a simulator-perfect policy can still fail in the flesh;
  • domain randomization closes that gap by training across thousands of randomly-perturbed physics, so the real world becomes just one more variation the policy already handles;
  • robotics uses PPO/SAC for manipulation and locomotion, leaning hard on multi-phase reward shaping -- and cheerfully finds weird, un-human solutions because it optimises what you measured, not what you meant;
  • resource management (data-center cooling, routing, scheduling) is RL's quiet success story: simulable, forgiving, and with rewards you can actually write down;
  • financial RL is seductive but treacherous -- non-stationary, partially observable, adversarial, and prone to backtests that die on contact with reality; execution beats prediction;
  • safety is the dividing line between game RL and real RL -- constrained MDPs, safe-action filters and PPO-Lagrangian exist because reality has no reset button, and production RL only thrives where failure is cheap or simulation is easy.

Exercises

Exercise 1: Take the DataCenterEnv above and train a small agent on it -- a DQN over a handful of discretized cooling levels (say 0.0, 0.25, 0.5, 0.75, 1.0) is plenty, and you already have a DQN from episode #107. Run it for a few thousand episodes, then plot two things over training: average power used and the fraction of steps spent outside the safe temperature band. You should watch the agent discover the classic trade-off -- cool just enough to stay in band while spending as little power as possible. Then bump the out-of-band penalty from 10.0 up to 100.0 and re-run: how does the agent's behaviour change? This is your first hands-on feel for how a reward weight encodes a priority.

Exercise 2: Simulate the reality gap yourself. Write a SimHand environment with friction = 0.5 and a RealHand with friction = 0.35, both otherwise identical (a 1D "slide the block to the target" task is enough). Train a policy on SimHand alone and measure its performance on RealHand -- it should degrade. Now train a second policy on a SimHand whose friction is re-rolled uniformly in [0.2, 0.8] every episode (domain randomization), and test that one on RealHand. Report both transfer scores. You should see the randomized policy survive the jump far better, and you'll have reproduced the core sim-to-real result on your laptop.

Exercise 3: Build the SafeRLAgent's missing safety_checker for a 1D "cliff" world: the agent lives on a line, position must stay in [0, 10], and any action that would push it past either edge is unsafe. Implement is_safe(state, action) and get_safe_actions(state), wrap a deliberately reckless random policy in the SafeRLAgent, and run it for a few hundred steps. Confirm the agent never once leaves the safe interval, no matter how badly the underlying policy misbehaves. Then log every time the safety layer had to override the policy -- that override count is exactly the kind of number a real safety engineer watches like a hawk.

Get at least the reality-gap exercise working before the next episode -- feeling a policy trained in one world stumble in a slightly different one is the single best intuition for everything that makes real-world RL hard. The next stop turns a question we kept brushing past into the main event: what happens when you can't write the reward function down at all, and have to infer it instead?

Thanks for your time -- de groeten!

@scipio



0
0
0.000
1 comments
avatar

Congratulations @scipio! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)

You distributed more than 24000 upvotes.
Your next target is to reach 25000 upvotes.

You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word STOP

Check out our last posts:

Hive Power Up Month Challenge - June 2026 Winners List
Be ready for the July edition of the Hive Power Up Month!
Hive Power Up Day - July 1st 2026
0
0
0.000