Learn AI Series (#102) - What Is Reinforcement Learning?

avatar

Learn AI Series (#102) - What Is Reinforcement Learning?

variant-b-06-magenta.png

What will I learn

  • You will learn the RL framework: agent, environment, state, action, and reward - the five building blocks every RL system is made of;
  • how reinforcement learning differs fundamentally from supervised and unsupervised learning we've been doing for 100 episodes;
  • exploration vs exploitation: the central dilemma that makes RL uniquely difficult;
  • Markov Decision Processes (MDPs): the formal mathematical foundation for sequential decision-making;
  • cumulative reward and discounting: why a reward now is worth more than a reward later, and what gamma controls;
  • policies and value functions: the two key objects that RL algorithms learn;
  • the landscape of RL algorithms: model-free vs model-based, value vs policy methods, and where each fits;
  • building a complete grid world environment and Q-learning agent from scratch using only NumPy.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#102) - What Is Reinforcement Learning?

Solutions to Episode #101 Exercises

Exercise 1: Voice activity detection pipeline -- generate 5s of audio (sr=16000) with speech segments at 0.5-1.5s and 2.5-4.0s (multi-harmonic tones with amplitude envelope) and silence+noise elsewhere. Compute per-frame RMS energy in 25ms windows with 10ms hop. Set an adaptive threshold at 3x the median frame energy. Frames above threshold are classified as speech, below as silence. Applying a minimum-duration filter (merge gaps shorter than 100ms, discard speech segments shorter than 50ms) cleans up the result. Detection accuracy: the VAD correctly identifies both speech regions with onset/offset errors under 30ms. The energy ratio between speech frames and silence frames is typically 15-25 dB, making the threshold relatively easy to set -- the harder problem (in production) is distinguishing speech from non-speech noise like music or traffic, which this simple energy-based approach can't do.

Exercise 2: Multi-wake-word detector -- build a system that recognizes 3 wake words ("hey computer", "okay device", "listen up") using template matching against pre-computed MFCC templates. Generate 13-coefficient MFCCs for each wake word (synthesized as distinct frequency patterns: wake1 = [200, 400, 800] Hz, wake2 = [300, 500, 700] Hz, wake3 = [150, 350, 900] Hz). For each test utterance, compute DTW distance against all 3 templates (using Euclidean frame distance and cumulative cost matrix). The wake word with minimum DTW distance below threshold wins. Testing with 5 samples per wake word at 3 noise levels (clean, SNR=10dB, SNR=5dB): clean accuracy is 100%, SNR=10dB drops to 80-90%, SNR=5dB drops to 50-70%. The DTW distances for correct matches average 2-3x lower than cross-template distances, providing clear separation at low noise. At high noise the distance distributions overlap, causing misclassifications.

Exercise 3: End-to-end voice command executor -- combine VAD, wake word detection, intent classification, and slot filling into a single pipeline. Generate a 10s audio stream with 4 command utterances embedded at different times. Pipeline stages: (1) VAD segments the audio into speech chunks using energy thresholding, (2) the first chunk is checked for wake word match via DTW, (3) if wake word detected, the next chunk is classified into one of 5 intents (weather, timer, music, lights, search) using keyword overlap scoring from #98, (4) slot filling extracts parameters (city name, duration, song name) via pattern matching. Processing the 4-command stream: VAD finds all 4 segments (2 true commands after wake words, 2 background speech to reject). Wake word detection correctly gates commands 1 and 3 (which start with "hey computer") and correctly rejects commands 2 and 4 (no wake word). Intent classification on the accepted commands achieves 100% on the clean versions. The full pipeline latency (per command) is dominated by DTW wake word matching, which scales as O(T*R) where T is the template length and R is the speech segment length.

On to today's episode

Alright. One hundred and two. We're entering completely new territory with this one, and I'm genuinely excited about it because reinforcement learning is probably the area of AI that feels the most... alive.

Everything we've built over the last 101 episodes -- classifiers, neural networks, language models, vision systems, audio pipelines -- shares a common pattern. You provide training data. The model learns patterns from that data. Then it applies those patterns to new inputs. The data tells the model what's correct. Whether it's a labeled image dataset, a corpus of text, or paired audio samples, there's always a teacher providing the right answers.

Reinforcement learning flips this completely on its head. There's no dataset. There's no teacher showing the correct answer. Instead, an agent takes actions in an environment, receives rewards (or penalties), and learns through trial and error what works and what doesn't. Nobody tells it the right move. It discovers the right move by playing the game thousands or millions of times.

This is how dogs learn tricks -- try something, get a treat (or don't), adjust behavior accordingly. It's how children learn to walk -- fall down, get back up, fall less next time. And it's how AlphaGo learned to beat the world champion at Go -- by playing against itself literally billions of times until it discovered strategies that no human had ever considered. The supervised learning approach we used in episodes #1-101 would require someone to label every possible board position with the "correct" move. For Go, with more possible board positions than atoms in the universe, that's obviously impossible. RL sidesteps the problem entirely ;-)

The RL framework

Every reinforcement learning problem has the same five elements:

  1. Agent: the learner and decision-maker (your AI)
  2. Environment: everything the agent interacts with (the game, the robot's world, the market)
  3. State (s): what the agent observes right now (board position, sensor readings, current portfolio)
  4. Action (a): what the agent can do (move left, buy, fire thruster)
  5. Reward (r): the feedback signal (points scored, profit made, distance traveled)

The interaction loop is deceptively simple:

Agent observes state s_t
Agent chooses action a_t
Environment transitions to state s_{t+1}
Environment gives reward r_t
Repeat

That's it. That's the entire framework. Every RL problem -- from Atari games to self-driving cars to protein folding -- follows this exact loop. The complexity is in the details: how large is the state space? How does the agent choose actions? How does it learn from the rewards it receives?

Let's build this concretly with a tiny grid world. The agent starts in the top-left corner and needs to reach the bottom-right corner:

import numpy as np


class SimpleGridWorld:
    """4x4 grid: agent starts top-left,
    goal is bottom-right."""

    def __init__(self, size=4):
        self.size = size
        self.state = (0, 0)
        self.goal = (size - 1, size - 1)

    def reset(self):
        self.state = (0, 0)
        return self.state

    def step(self, action):
        """Take action: 0=up, 1=right,
        2=down, 3=left.
        Returns: (next_state, reward, done)"""
        row, col = self.state
        if action == 0:
            row = max(0, row - 1)
        elif action == 1:
            col = min(self.size - 1, col + 1)
        elif action == 2:
            row = min(self.size - 1, row + 1)
        elif action == 3:
            col = max(0, col - 1)

        self.state = (row, col)

        if self.state == self.goal:
            return self.state, 1.0, True
        else:
            return self.state, -0.01, False

    @property
    def n_states(self):
        return self.size * self.size

    @property
    def n_actions(self):
        return 4


# Quick test
env = SimpleGridWorld()
state = env.reset()
print(f"Start: {state}")
state, reward, done = env.step(1)  # right
print(f"After right: {state}, "
      f"reward={reward}")
state, reward, done = env.step(2)  # down
print(f"After down: {state}, "
      f"reward={reward}")

The agent gets +1 for reaching the goal and -0.01 for every step taken. This encourages finding the shortest path -- taking 100 steps to reach the goal earns 1.0 - 0.99 = 0.01, while taking 6 steps earns 1.0 - 0.05 = 0.95. The agent isn't told which direction to go. It discovers through experience that going right and down is better than going left and up. Nobody programs the solution. The agent finds it.

How RL differs from supervised learning

In supervised learning (episodes #10-16), you have a dataset of input-output pairs: (image, label), (text, translation), (features, price). The model minimizes the difference between its predictions and the known correct answers. The data is static and pre-collected. The correct answer exists before the model runs.

In RL, there are no correct answers upfront. The agent generates its own training data by interacting with the environment. And crucially -- its actions affect what data it sees next. Choose to go left, and you'll see different states than if you chose right. This creates a fundamentally different learning problem:

class LearningComparison:
    """Show the key difference between
    supervised and RL learning."""

    def __init__(self):
        self.rng = np.random.RandomState(42)

    def supervised_demo(self):
        """Supervised: data is fixed,
        answers are known."""
        X = self.rng.randn(100, 2)
        y = (X[:, 0] + X[:, 1] > 0).astype(int)

        # Model sees ALL data upfront
        print("=== Supervised Learning ===")
        print(f"Dataset: {len(X)} samples")
        print(f"Labels known: {len(y)}")
        print(f"Data changes during training: No")
        print(f"Feedback: immediate, per-sample")

    def rl_demo(self):
        """RL: data depends on actions,
        feedback is delayed."""
        env = SimpleGridWorld()
        total_steps = 0
        rewards_seen = []

        print("\n=== Reinforcement Learning ===")
        for ep in range(3):
            state = env.reset()
            ep_reward = 0
            steps = 0
            done = False
            while not done and steps < 20:
                action = self.rng.randint(4)
                state, reward, done = env.step(
                    action)
                ep_reward += reward
                steps += 1
                total_steps += 1
            rewards_seen.append(ep_reward)

        print(f"Episodes run: 3")
        print(f"Total steps: {total_steps}")
        print(f"Data changes during training: "
              f"Yes (agent controls it)")
        print(f"Feedback: delayed, per-episode")
        print(f"Rewards: "
              f"{[f'{r:.2f}' for r in rewards_seen]}")

    def run(self):
        self.supervised_demo()
        self.rl_demo()


comp = LearningComparison()
comp.run()

The credit assignment problem is particularly tricky and something supervised learning never has to deal with. In chess, you might make a brilliant move on turn 5 that only pays off on turn 40. The reward arrives 35 turns later -- which of the 35 intervening actions deserves credit for the final outcome? In supervised learning, every input has an immediate label. In RL, you often don't know which specific action was the good one until much, much later. This is one of the things that makes RL fundamentally harder than supervised learning (and also more interesting, if you ask me).

Exploration vs exploitation

You're in a new city. Hungry. You found a decent restaurant yesterday -- solid 7 out of 10. Do you go back to the safe choice (exploit what you already know works), or try somewhere new that might be a 9/10 or a 3/10 (explore the unknown)?

This is the exploration-exploitation dilemma -- the central tension in all of reinforcement learning. Too much exploitation and you get stuck with suboptimal strategies because you never discover better ones. Too much exploration and you waste time on random actions when you already know what works. The balance between the two is everything in RL.

The simplest solution is epsilon-greedy: with probability epsilon, take a random action (explore); otherwise, take the best known action (exploit):

class EpsilonGreedyAgent:
    """Explore with probability epsilon,
    exploit otherwise."""

    def __init__(self, n_actions, epsilon=0.1):
        self.n_actions = n_actions
        self.epsilon = epsilon
        self.q_values = {}
        self.rng = np.random.RandomState(42)

    def choose_action(self, state):
        if self.rng.random() < self.epsilon:
            return self.rng.randint(
                self.n_actions)
        else:
            values = self.q_values.get(
                state,
                np.zeros(self.n_actions))
            return int(np.argmax(values))

    def update(self, state, action, reward,
               next_state, alpha=0.1,
               gamma=0.99):
        """Q-learning update."""
        if state not in self.q_values:
            self.q_values[state] = np.zeros(
                self.n_actions)
        if next_state not in self.q_values:
            self.q_values[next_state] = (
                np.zeros(self.n_actions))

        current_q = self.q_values[state][action]
        max_next_q = np.max(
            self.q_values[next_state])
        target = reward + gamma * max_next_q
        self.q_values[state][action] += (
            alpha * (target - current_q))


# Compare different epsilon values
env = SimpleGridWorld()
print("=== Exploration vs Exploitation ===")

for eps in [0.0, 0.1, 0.3, 1.0]:
    agent = EpsilonGreedyAgent(4, epsilon=eps)
    total_rewards = []

    for episode in range(500):
        state = env.reset()
        ep_reward = 0
        for step in range(50):
            action = agent.choose_action(state)
            next_s, reward, done = env.step(
                action)
            agent.update(state, action,
                         reward, next_s)
            ep_reward += reward
            state = next_s
            if done:
                break
        total_rewards.append(ep_reward)

    last_100 = np.mean(total_rewards[-100:])
    print(f"  eps={eps:.1f}: "
          f"avg reward (last 100)="
          f"{last_100:.3f}, "
          f"states visited="
          f"{len(agent.q_values)}")

With epsilon=0.0, the agent never explores -- it picks the first action that worked and sticks with it, missing potentially better strategies. With epsilon=1.0, the agent always explores randomly and never exploits what it's learned. The sweet spot (usually around 0.1) lets the agent mostly exploit its best known strategy while occasionally trying something new.

A common improvement is epsilon decay: start with high exploration (epsilon=1.0) and gradually reduce it over time. Explore a lot early when you know nothing, exploit more later when you've learned the environment. We'll see more sophisticated exploration strategies in the next few episodes.

Markov Decision Processes (MDPs)

The formal mathematical framework for RL is the Markov Decision Process. An MDP is defined by five elements:

  • S: the set of all possible states
  • A: the set of all possible actions
  • P(s'|s,a): transition function -- the probability of reaching state s' when taking action a in state s
  • R(s,a): reward function -- expected reward for taking action a in state s
  • gamma: discount factor (0 to 1) -- how much future rewards are worth

The Markov property is the key assumption that makes everything tractable: the future depends only on the current state, not on how you got there. The state must contain all relevant information for making decisions. In chess, the current board position is all you need -- the sequence of moves that led to it is irrelevant for deciding the next move.

class ExplicitMDP:
    """MDP with explicit transition
    probabilities for a 3-state example."""

    def __init__(self):
        self.n_states = 3
        self.n_actions = 2
        self.gamma = 0.9

        # State 0: start
        # State 1: intermediate
        # State 2: goal (terminal)
        # Action 0: "safe", Action 1: "risky"

        # transitions[s][a] = [(prob, next_s)]
        self.transitions = {
            0: {
                0: [(1.0, 1)],
                1: [(0.7, 2), (0.3, 0)],
            },
            1: {
                0: [(1.0, 2)],
                1: [(0.5, 2), (0.5, 0)],
            },
            2: {
                0: [(1.0, 2)],
                1: [(1.0, 2)],
            },
        }

        # rewards[s][a]
        self.rewards = {
            0: {0: 1.0, 1: 5.0},
            1: {0: 2.0, 1: 10.0},
            2: {0: 0.0, 1: 0.0},
        }

    def simulate(self, policy, n_episodes=1000):
        """Run policy and measure
        average return."""
        rng = np.random.RandomState(42)
        returns = []

        for _ in range(n_episodes):
            state = 0
            total = 0.0
            discount = 1.0

            for _ in range(20):
                if state == 2:
                    break
                action = policy[state]
                reward = self.rewards[state][
                    action]
                total += discount * reward
                discount *= self.gamma

                # Sample next state
                trans = self.transitions[
                    state][action]
                probs = [t[0] for t in trans]
                states = [t[1] for t in trans]
                idx = rng.choice(
                    len(trans), p=probs)
                state = states[idx]

            returns.append(total)

        return np.mean(returns)

    def run(self):
        policies = {
            "Always safe": {0: 0, 1: 0, 2: 0},
            "Always risky": {0: 1, 1: 1, 2: 0},
            "Safe then risky": {
                0: 0, 1: 1, 2: 0},
            "Risky then safe": {
                0: 1, 1: 0, 2: 0},
        }

        print("=== MDP Policy Comparison ===")
        for name, policy in policies.items():
            avg_return = self.simulate(policy)
            print(f"  {name}: "
                  f"avg return = "
                  f"{avg_return:.2f}")


mdp = ExplicitMDP()
mdp.run()

This is a small enough MDP that we can enumerate all possible policies and compare them directly. The "risky" action gives higher rewards but has a chance of sending the agent back to the start. The optimal policy depends on the discount factor -- with high gamma (patient agent), the risky approach is worth it because the expected value is higher even accounting for failures. With low gamma (impatient agent), the safe path wins because the agent strongly prefers guaranteed sooner rewards.

Not all environments are fully Markov, by the way. In many real problems, the agent sees only a partial observation (this is called a POMDP -- Partially Observable MDP). In a card game, you see your hand but not your opponent's cards. In robotics, sensors have limited range. The Markov assumption is often good enough as an approximation, and you can work around partial observability by stacking recent observations to create a more complete state representation -- something we'll see when we get to deep RL in a few episodes.

Cumulative reward and discounting

The agent's objective is to maximize cumulative reward -- the total reward collected over the entire episode, not just the immediate reward at any single step. This is what separates RL from greedy optimization:

G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + gamma^3 * r_{t+3} + ...

The discount factor gamma determines how much the agent values future rewards versus immediate ones:

class DiscountExplorer:
    """Explore how gamma affects
    decision-making."""

    def __init__(self):
        pass

    def compute_return(self, rewards,
                       gamma=0.99):
        """Compute discounted return from
        a sequence of rewards."""
        G = 0.0
        returns = []
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)
        return returns

    def run(self):
        # Scenario: reward of 1.0 arrives
        # after N steps
        print("=== Discount Factor Effects ===")
        print(f"{'gamma':<8}"
              f"{'1 step':>10}"
              f"{'5 steps':>10}"
              f"{'10 steps':>10}"
              f"{'50 steps':>10}"
              f"{'100 steps':>10}")

        for gamma in [0.5, 0.9, 0.95, 0.99, 1.0]:
            values = []
            for delay in [1, 5, 10, 50, 100]:
                rewards = [0.0] * delay + [1.0]
                returns = self.compute_return(
                    rewards, gamma)
                values.append(returns[0])
            print(f"{gamma:<8}"
                  + "".join(
                      f"{v:>10.4f}"
                      for v in values))

        # Two scenarios
        print("\n=== Patience Test ===")
        # Option A: small reward soon
        option_a = [0, 0, 1, 0, 0]
        # Option B: big reward later
        option_b = [0, 0, 0, 0, 5]

        for gamma in [0.5, 0.9, 0.99]:
            ga = self.compute_return(
                option_a, gamma)[0]
            gb = self.compute_return(
                option_b, gamma)[0]
            winner = ("A (soon)" if ga > gb
                      else "B (later)")
            print(f"  gamma={gamma}: "
                  f"A={ga:.3f}, B={gb:.3f}"
                  f" -> {winner}")


disc = DiscountExplorer()
disc.run()

The discount factor serves two purposes. Mathematically, it ensures the infinite sum converges (any geometric series with gamma < 1 converges). Practically, it encodes a preference for sooner rewards -- a dollar today really is worth more than a dollar next year, and this is true in most real-world settings. Most RL applications use gamma between 0.95 and 0.99. The "patience test" in the code above shows exactly where the crossover happens: impatient agents (low gamma) prefer the small-soon reward, patient agents (high gamma) prefer the large-later reward. This is not just a mathematical curiosity -- it fundamentally changes what strategies the agent learns ;-)

Policies and value functions

A policy (pi) is the agent's strategy -- it maps states to actions. It can be deterministic ("in state X, always do action Y") or stochastic ("in state X, do action Y with 70% probability and action Z with 30%"). The entire point of RL is to find a good policy.

A value function V(s) tells you how good a state is -- the expected cumulative reward from that state onward, following a given policy. An action-value function Q(s,a) tells you how good a specific action is in a specific state. If you know Q(s,a) for all states and actions, the optimal policy is trivial: in every state, just pick the action with the highest Q value.

The entire challenge of RL is learning good estimates of these value functions. Let's see this in action:

class ValueFunctionDemo:
    """Compute value functions for our
    grid world via Monte Carlo."""

    def __init__(self, size=4, gamma=0.99):
        self.size = size
        self.gamma = gamma
        self.env = SimpleGridWorld(size)

    def random_policy_action(self):
        return np.random.randint(4)

    def estimate_values(self, n_episodes=5000):
        """Estimate V(s) for random policy
        using Monte Carlo averaging."""
        state_returns = {}

        for _ in range(n_episodes):
            state = self.env.reset()
            episode = []

            for _ in range(100):
                action = (
                    self.random_policy_action())
                next_s, reward, done = (
                    self.env.step(action))
                episode.append(
                    (state, reward))
                state = next_s
                if done:
                    episode.append(
                        (state, 0.0))
                    break

            # Compute returns backwards
            G = 0.0
            for i in range(
                    len(episode) - 1, -1, -1):
                s, r = episode[i]
                G = r + self.gamma * G
                if s not in state_returns:
                    state_returns[s] = []
                state_returns[s].append(G)

        # Average returns per state
        V = {}
        for s, returns in (
                state_returns.items()):
            V[s] = np.mean(returns)
        return V

    def run(self):
        V = self.estimate_values()

        print("=== Value Function "
              "(Random Policy) ===")
        print("V(s) for each grid cell:\n")
        for row in range(self.size):
            for col in range(self.size):
                state = (row, col)
                v = V.get(state, 0.0)
                print(f"{v:>7.3f}", end=" ")
            print()

        print(f"\nGoal {self.env.goal}: "
              f"V={V.get(self.env.goal, 0):.3f}")
        print(f"Start (0,0): "
              f"V={V.get((0,0), 0):.3f}")
        print(f"States with estimates: "
              f"{len(V)}")


vf = ValueFunctionDemo()
vf.run()

Even under a completely random policy, the value function reveals structure: states closer to the goal have higher values because the expected reward is higher. States in the corner opposite the goal have the lowest values. This is the foundation of all value-based RL -- compute (or estimate) these values, then use them to make better decisions. The Monte Carlo method we used here is one of the simplest approaches: just average the actual returns observed from each state across many episodes. We'll cover more sophisticated methods (temporal difference learning, function approximation) in the comming episodes.

The RL algorithm landscape

There are many RL algorithms, but they all fit into a handful of categories. Understanding the landscape helps you pick the right approach for a given problem:

class RLLandscape:
    """Map the RL algorithm families with
    simple demonstrations."""

    def __init__(self):
        self.rng = np.random.RandomState(42)

    def value_based_demo(self):
        """Value-based: learn Q(s,a),
        derive policy from it."""
        print("=== Value-Based Methods ===")
        print("Learn Q(s,a) for all "
              "state-action pairs")
        print("Policy: argmax_a Q(s,a)")
        print("Examples: Q-Learning, DQN, "
              "Double DQN")
        print("Best for: discrete actions, "
              "clear state values")

        # Mini Q-table
        Q = np.zeros((3, 2))
        Q[0] = [0.5, 0.8]
        Q[1] = [0.9, 0.3]
        Q[2] = [0.1, 0.7]

        print("\nSample Q-table:")
        print(f"  {'State':<8}{'Act 0':>8}"
              f"{'Act 1':>8}{'Best':>6}")
        for s in range(3):
            best = int(np.argmax(Q[s]))
            print(f"  s={s:<6}{Q[s,0]:>8.1f}"
                  f"{Q[s,1]:>8.1f}"
                  f"{best:>6}")

    def policy_based_demo(self):
        """Policy-based: directly learn
        pi(a|s)."""
        print("\n=== Policy-Based Methods ===")
        print("Directly parameterize the "
              "policy pi(a|s)")
        print("Learn via policy gradient: "
              "increase prob of good actions")
        print("Examples: REINFORCE, PPO, "
              "TRPO")
        print("Best for: continuous actions, "
              "stochastic policies")

        # Stochastic policy example
        probs = np.array([
            [0.7, 0.3],
            [0.2, 0.8],
            [0.5, 0.5],
        ])
        print("\nSample policy pi(a|s):")
        print(f"  {'State':<8}{'P(a=0)':>8}"
              f"{'P(a=1)':>8}")
        for s in range(3):
            print(f"  s={s:<6}"
                  f"{probs[s,0]:>8.1f}"
                  f"{probs[s,1]:>8.1f}")

    def actor_critic_demo(self):
        """Actor-Critic: learn both."""
        print("\n=== Actor-Critic Methods ===")
        print("Actor: learns the policy "
              "(which actions to take)")
        print("Critic: learns the value "
              "function (how good states are)")
        print("Critic helps reduce variance "
              "of actor's gradient estimates")
        print("Examples: A2C, A3C, SAC, TD3")
        print("Best for: combining strengths "
              "of both approaches")

    def model_based_demo(self):
        """Model-based: learn environment
        dynamics, then plan."""
        print("\n=== Model-Based Methods ===")
        print("Learn a model of the "
              "environment: P(s'|s,a)")
        print("Use the model to plan "
              "without real interaction")
        print("Examples: Dyna-Q, MuZero, "
              "World Models")
        print("Best for: sample efficiency, "
              "simulation")

        # Simple learned model
        model = {
            ((0, 0), 1): ((0, 1), -0.01),
            ((0, 0), 2): ((1, 0), -0.01),
            ((0, 1), 1): ((0, 2), -0.01),
        }
        print("\nLearned model predictions:")
        for (s, a), (ns, r) in model.items():
            dirs = ["up", "right",
                    "down", "left"]
            print(f"  {s} + {dirs[a]} "
                  f"-> {ns}, r={r}")

    def run(self):
        self.value_based_demo()
        self.policy_based_demo()
        self.actor_critic_demo()
        self.model_based_demo()


landscape = RLLandscape()
landscape.run()

The key tradeoff between these families: model-free methods (value-based and policy-based) are simpler and work well when you can afford millions of environment interactions, but they're sample-inefficient -- they need a LOT of experience. Model-based methods are much more sample-efficient (because they can plan using the learned model instead of interacting with the real environment), but they add the complexity of learning an accurate model and the risk of planning with a wrong model. We'll dig deep into each family over the next several episodes.

Putting it all together: Q-learning on grid world

Now let's train our agent properly and watch it learn. Q-learning (which we'll derive formally in a later episode) is the classic value-based algorithm: it learns Q(s,a) values and derives the policy from them:

class QLearningDemo:
    """Complete Q-learning on grid world
    with training analysis."""

    def __init__(self, size=4, gamma=0.99,
                 alpha=0.1, epsilon=0.2):
        self.env = SimpleGridWorld(size)
        self.agent = EpsilonGreedyAgent(
            4, epsilon=epsilon)
        self.gamma = gamma
        self.alpha = alpha

    def train(self, n_episodes=1000):
        rewards_history = []
        steps_history = []

        for episode in range(n_episodes):
            state = self.env.reset()
            total_reward = 0
            steps = 0

            for step in range(100):
                action = (
                    self.agent.choose_action(
                        state))
                next_s, reward, done = (
                    self.env.step(action))
                self.agent.update(
                    state, action, reward,
                    next_s, self.alpha,
                    self.gamma)
                total_reward += reward
                state = next_s
                steps += 1
                if done:
                    break

            rewards_history.append(
                total_reward)
            steps_history.append(steps)

        return rewards_history, steps_history

    def show_policy(self):
        arrows = ["^", ">", "v", "<"]
        print("\nLearned policy:")
        for row in range(self.env.size):
            for col in range(self.env.size):
                state = (row, col)
                if state == self.env.goal:
                    print(" G", end=" ")
                    continue
                values = (
                    self.agent.q_values.get(
                        state, np.zeros(4)))
                best = int(np.argmax(values))
                print(f" {arrows[best]}",
                      end=" ")
            print()

    def show_q_values(self):
        print("\nQ-values (max per state):")
        for row in range(self.env.size):
            for col in range(self.env.size):
                state = (row, col)
                values = (
                    self.agent.q_values.get(
                        state, np.zeros(4)))
                print(f"{np.max(values):>7.3f}",
                      end=" ")
            print()

    def run(self):
        rewards, steps = self.train(1000)

        # Report learning progress
        print("=== Q-Learning Training ===")
        for i in [0, 100, 300, 500, 900]:
            window = rewards[i:i+100]
            print(
                f"  Episodes {i}-{i+99}: "
                f"avg reward="
                f"{np.mean(window):.3f}, "
                f"avg steps="
                f"{np.mean(steps[i:i+100]):.1f}")

        self.show_policy()
        self.show_q_values()


demo = QLearningDemo()
demo.run()

After 1000 episodes, the agent has learned to navigate efficiently -- arrows should point right and down along the optimal path, and the Q-values should be highest near the goal and decrease smoothly toward the start. The agent discovered this strategy purely from reward signals, without anyone programming the solution. That's the magic of RL.

Real-world RL: beyond grid worlds

Our grid world is a toy example, obviously, but the same framework scales to problems that are genuinely hard. AlphaGo and AlphaZero (DeepMind) used RL combined with Monte Carlo tree search to master Go, chess, and shogi -- all from self-play, no human game databases needed. OpenAI Five played Dota 2 at the professional level using a scaled-up version of PPO (Proximal Policy Optimization). Robotics labs use RL to teach robots to walk, grasp objects, and navigate without explicit programming of every motion. RLHF (Reinforcement Learning from Human Feedback) is how language models we discussed in episode #61 are aligned with human preferences -- the reward signal comes from human evaluators rating model outputs.

The gap between our grid world and these applications is mostly about scale: larger state spaces, continuous actions, deeper neural networks as function approximators, and vastly more compute. The fundamental framework -- agent, environment, state, action, reward, policy, value function -- stays exactly the same.

Samengevat

  • Reinforcement learning is learning by interaction: an agent takes actions in an environment, receives rewards, and adjusts its strategy -- no labeled dataset required, no teacher showing the correct answer;
  • the RL framework has five elements: agent, environment, state, action, and reward -- every RL problem from Atari to robotics fits this same loop;
  • RL differs from supervised learning in three fundamental ways: the agent generates its own data, feedback is delayed (credit assignment problem), and the data distribution changes as the agent's policy improves;
  • the exploration-exploitation dilemma is the central challenge: explore to discover new strategies vs exploit what already works -- epsilon-greedy is the simplest balance, and epsilon decay improves it;
  • Markov Decision Processes formalize RL with states, actions, transition probabilities, rewards, and a discount factor -- the Markov property assumes the future depends only on the current state;
  • the discount factor gamma controls how much the agent values future rewards -- typical values are 0.95-0.99, and it fundamentally changes what strategies the agent discovers;
  • policies map states to actions; value functions estimate expected cumulative reward; learning good value functions (V or Q) is the core problem in most RL;
  • the RL landscape spans value-based (Q-learning, DQN), policy-based (REINFORCE, PPO), actor-critic (A2C, SAC), and model-based (Dyna-Q, MuZero) methods -- each with different tradeoffs between sample efficiency and simplicity.

We've just scratched the surface. The next few episodes will go deep into each of these algorithm families, starting with the mathematical foundations and then building up to the core algorithms that modern RL is built on. There's a lot of interesting ground to cover and it only gets better from here.

Exercises

Exercise 1: Build a windy grid world environment. Create a class WindyGridWorld (7x10 grid, start=(3,0), goal=(3,7)) where columns 3-8 have an upward "wind" that pushes the agent up by 1 cell after each action (columns 6-7 push up by 2 cells). Actions: 4 cardinal directions. Implement reset() and step(action) with -1 reward per step, 0 at goal. Train a Q-learning agent (epsilon=0.1, alpha=0.5, gamma=1.0) for 500 episodes. Print: (a) the learned policy as arrows, (b) the average steps-to-goal for the last 50 episodes, (c) the number of states visited during training. The optimal path should curve to compensate for the wind.

Exercise 2: Build an epsilon decay comparison. Create a class EpsilonDecayExperiment that trains 4 agents on our 4x4 grid world, each with a different epsilon schedule: (a) fixed epsilon=0.1, (b) linear decay from 1.0 to 0.01 over 1000 episodes, (c) exponential decay epsilon = 0.99^episode, (d) fixed epsilon=0.5. Train each for 1000 episodes (alpha=0.1, gamma=0.99). For each agent, track and print: average reward per 100-episode block, total exploration actions taken, final policy quality (measured as average reward over 100 greedy-only test episodes at the end). The linear and exponential decay agents should outperform both fixed schedules.

Exercise 3: Build a stochastic environment comparison. Create a class StochasticGridWorld identical to SimpleGridWorld but where actions succeed with probability p_success and with probability (1-p_success) a random different action is taken instead (simulating noisy actuators). Train Q-learning agents (epsilon=0.1, alpha=0.1, gamma=0.99) on three versions: p_success=1.0 (deterministic), p_success=0.8, p_success=0.5. For each, run 2000 episodes and report: (a) learning curve (average reward per 200-episode block), (b) final policy (as arrows), (c) the Q-value spread -- max(Q) minus min(Q) averaged across states. The deterministic environment should converge fastest with the highest Q-value spread; stochastic environments should converge slower with compressed Q-values because the agent can't fully control its trajectory.

Bedankt en tot de volgende keer!

@scipio



0
0
0.000
0 comments