Learn AI Series (#102) - What Is Reinforcement Learning?
Learn AI Series (#102) - What Is Reinforcement Learning?

What will I learn
- You will learn the RL framework: agent, environment, state, action, and reward - the five building blocks every RL system is made of;
- how reinforcement learning differs fundamentally from supervised and unsupervised learning we've been doing for 100 episodes;
- exploration vs exploitation: the central dilemma that makes RL uniquely difficult;
- Markov Decision Processes (MDPs): the formal mathematical foundation for sequential decision-making;
- cumulative reward and discounting: why a reward now is worth more than a reward later, and what gamma controls;
- policies and value functions: the two key objects that RL algorithms learn;
- the landscape of RL algorithms: model-free vs model-based, value vs policy methods, and where each fits;
- building a complete grid world environment and Q-learning agent from scratch using only NumPy.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals
- Learn AI Series (#78) - Object Detection (Part 1) - Foundations
- Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches
- Learn AI Series (#80) - Image Segmentation
- Learn AI Series (#81) - Pose Estimation and Tracking
- Learn AI Series (#82) - Optical Character Recognition
- Learn AI Series (#83) - Video Understanding
- Learn AI Series (#84) - Generative Images - Diffusion Models (Part 1)
- Learn AI Series (#85) - Generative Images - Diffusion Models (Part 2)
- Learn AI Series (#86) - Image-to-Image and Editing
- Learn AI Series (#87) - 3D Vision
- Learn AI Series (#88) - Face Analysis
- Learn AI Series (#89) - Medical and Scientific Imaging
- Learn AI Series (#90) - Self-Supervised Learning for Vision
- Learn AI Series (#91) - Mini Project - Building a Visual AI System
- Learn AI Series (#92) - Audio Fundamentals for AI
- Learn AI Series (#93) - Speech Recognition
- Learn AI Series (#94) - Text-to-Speech (TTS)
- Learn AI Series (#95) - Audio Classification
- Learn AI Series (#96) - Music Generation
- Learn AI Series (#97) - Speaker Recognition and Diarization
- Learn AI Series (#98) - Natural Language Understanding for Voice
- Learn AI Series (#99) - Audio Enhancement
- Learn AI Series (#100) - Multimodal Audio-Visual Models
- Learn AI Series (#101) - Mini Project: Voice-Controlled AI Assistant
- Learn AI Series (#102) - What Is Reinforcement Learning? (this post)
Learn AI Series (#102) - What Is Reinforcement Learning?
Solutions to Episode #101 Exercises
Exercise 1: Voice activity detection pipeline -- generate 5s of audio (sr=16000) with speech segments at 0.5-1.5s and 2.5-4.0s (multi-harmonic tones with amplitude envelope) and silence+noise elsewhere. Compute per-frame RMS energy in 25ms windows with 10ms hop. Set an adaptive threshold at 3x the median frame energy. Frames above threshold are classified as speech, below as silence. Applying a minimum-duration filter (merge gaps shorter than 100ms, discard speech segments shorter than 50ms) cleans up the result. Detection accuracy: the VAD correctly identifies both speech regions with onset/offset errors under 30ms. The energy ratio between speech frames and silence frames is typically 15-25 dB, making the threshold relatively easy to set -- the harder problem (in production) is distinguishing speech from non-speech noise like music or traffic, which this simple energy-based approach can't do.
Exercise 2: Multi-wake-word detector -- build a system that recognizes 3 wake words ("hey computer", "okay device", "listen up") using template matching against pre-computed MFCC templates. Generate 13-coefficient MFCCs for each wake word (synthesized as distinct frequency patterns: wake1 = [200, 400, 800] Hz, wake2 = [300, 500, 700] Hz, wake3 = [150, 350, 900] Hz). For each test utterance, compute DTW distance against all 3 templates (using Euclidean frame distance and cumulative cost matrix). The wake word with minimum DTW distance below threshold wins. Testing with 5 samples per wake word at 3 noise levels (clean, SNR=10dB, SNR=5dB): clean accuracy is 100%, SNR=10dB drops to 80-90%, SNR=5dB drops to 50-70%. The DTW distances for correct matches average 2-3x lower than cross-template distances, providing clear separation at low noise. At high noise the distance distributions overlap, causing misclassifications.
Exercise 3: End-to-end voice command executor -- combine VAD, wake word detection, intent classification, and slot filling into a single pipeline. Generate a 10s audio stream with 4 command utterances embedded at different times. Pipeline stages: (1) VAD segments the audio into speech chunks using energy thresholding, (2) the first chunk is checked for wake word match via DTW, (3) if wake word detected, the next chunk is classified into one of 5 intents (weather, timer, music, lights, search) using keyword overlap scoring from #98, (4) slot filling extracts parameters (city name, duration, song name) via pattern matching. Processing the 4-command stream: VAD finds all 4 segments (2 true commands after wake words, 2 background speech to reject). Wake word detection correctly gates commands 1 and 3 (which start with "hey computer") and correctly rejects commands 2 and 4 (no wake word). Intent classification on the accepted commands achieves 100% on the clean versions. The full pipeline latency (per command) is dominated by DTW wake word matching, which scales as O(T*R) where T is the template length and R is the speech segment length.
On to today's episode
Alright. One hundred and two. We're entering completely new territory with this one, and I'm genuinely excited about it because reinforcement learning is probably the area of AI that feels the most... alive.
Everything we've built over the last 101 episodes -- classifiers, neural networks, language models, vision systems, audio pipelines -- shares a common pattern. You provide training data. The model learns patterns from that data. Then it applies those patterns to new inputs. The data tells the model what's correct. Whether it's a labeled image dataset, a corpus of text, or paired audio samples, there's always a teacher providing the right answers.
Reinforcement learning flips this completely on its head. There's no dataset. There's no teacher showing the correct answer. Instead, an agent takes actions in an environment, receives rewards (or penalties), and learns through trial and error what works and what doesn't. Nobody tells it the right move. It discovers the right move by playing the game thousands or millions of times.
This is how dogs learn tricks -- try something, get a treat (or don't), adjust behavior accordingly. It's how children learn to walk -- fall down, get back up, fall less next time. And it's how AlphaGo learned to beat the world champion at Go -- by playing against itself literally billions of times until it discovered strategies that no human had ever considered. The supervised learning approach we used in episodes #1-101 would require someone to label every possible board position with the "correct" move. For Go, with more possible board positions than atoms in the universe, that's obviously impossible. RL sidesteps the problem entirely ;-)
The RL framework
Every reinforcement learning problem has the same five elements:
- Agent: the learner and decision-maker (your AI)
- Environment: everything the agent interacts with (the game, the robot's world, the market)
- State (s): what the agent observes right now (board position, sensor readings, current portfolio)
- Action (a): what the agent can do (move left, buy, fire thruster)
- Reward (r): the feedback signal (points scored, profit made, distance traveled)
The interaction loop is deceptively simple:
Agent observes state s_t
Agent chooses action a_t
Environment transitions to state s_{t+1}
Environment gives reward r_t
Repeat
That's it. That's the entire framework. Every RL problem -- from Atari games to self-driving cars to protein folding -- follows this exact loop. The complexity is in the details: how large is the state space? How does the agent choose actions? How does it learn from the rewards it receives?
Let's build this concretly with a tiny grid world. The agent starts in the top-left corner and needs to reach the bottom-right corner:
import numpy as np
class SimpleGridWorld:
"""4x4 grid: agent starts top-left,
goal is bottom-right."""
def __init__(self, size=4):
self.size = size
self.state = (0, 0)
self.goal = (size - 1, size - 1)
def reset(self):
self.state = (0, 0)
return self.state
def step(self, action):
"""Take action: 0=up, 1=right,
2=down, 3=left.
Returns: (next_state, reward, done)"""
row, col = self.state
if action == 0:
row = max(0, row - 1)
elif action == 1:
col = min(self.size - 1, col + 1)
elif action == 2:
row = min(self.size - 1, row + 1)
elif action == 3:
col = max(0, col - 1)
self.state = (row, col)
if self.state == self.goal:
return self.state, 1.0, True
else:
return self.state, -0.01, False
@property
def n_states(self):
return self.size * self.size
@property
def n_actions(self):
return 4
# Quick test
env = SimpleGridWorld()
state = env.reset()
print(f"Start: {state}")
state, reward, done = env.step(1) # right
print(f"After right: {state}, "
f"reward={reward}")
state, reward, done = env.step(2) # down
print(f"After down: {state}, "
f"reward={reward}")
The agent gets +1 for reaching the goal and -0.01 for every step taken. This encourages finding the shortest path -- taking 100 steps to reach the goal earns 1.0 - 0.99 = 0.01, while taking 6 steps earns 1.0 - 0.05 = 0.95. The agent isn't told which direction to go. It discovers through experience that going right and down is better than going left and up. Nobody programs the solution. The agent finds it.
How RL differs from supervised learning
In supervised learning (episodes #10-16), you have a dataset of input-output pairs: (image, label), (text, translation), (features, price). The model minimizes the difference between its predictions and the known correct answers. The data is static and pre-collected. The correct answer exists before the model runs.
In RL, there are no correct answers upfront. The agent generates its own training data by interacting with the environment. And crucially -- its actions affect what data it sees next. Choose to go left, and you'll see different states than if you chose right. This creates a fundamentally different learning problem:
class LearningComparison:
"""Show the key difference between
supervised and RL learning."""
def __init__(self):
self.rng = np.random.RandomState(42)
def supervised_demo(self):
"""Supervised: data is fixed,
answers are known."""
X = self.rng.randn(100, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
# Model sees ALL data upfront
print("=== Supervised Learning ===")
print(f"Dataset: {len(X)} samples")
print(f"Labels known: {len(y)}")
print(f"Data changes during training: No")
print(f"Feedback: immediate, per-sample")
def rl_demo(self):
"""RL: data depends on actions,
feedback is delayed."""
env = SimpleGridWorld()
total_steps = 0
rewards_seen = []
print("\n=== Reinforcement Learning ===")
for ep in range(3):
state = env.reset()
ep_reward = 0
steps = 0
done = False
while not done and steps < 20:
action = self.rng.randint(4)
state, reward, done = env.step(
action)
ep_reward += reward
steps += 1
total_steps += 1
rewards_seen.append(ep_reward)
print(f"Episodes run: 3")
print(f"Total steps: {total_steps}")
print(f"Data changes during training: "
f"Yes (agent controls it)")
print(f"Feedback: delayed, per-episode")
print(f"Rewards: "
f"{[f'{r:.2f}' for r in rewards_seen]}")
def run(self):
self.supervised_demo()
self.rl_demo()
comp = LearningComparison()
comp.run()
The credit assignment problem is particularly tricky and something supervised learning never has to deal with. In chess, you might make a brilliant move on turn 5 that only pays off on turn 40. The reward arrives 35 turns later -- which of the 35 intervening actions deserves credit for the final outcome? In supervised learning, every input has an immediate label. In RL, you often don't know which specific action was the good one until much, much later. This is one of the things that makes RL fundamentally harder than supervised learning (and also more interesting, if you ask me).
Exploration vs exploitation
You're in a new city. Hungry. You found a decent restaurant yesterday -- solid 7 out of 10. Do you go back to the safe choice (exploit what you already know works), or try somewhere new that might be a 9/10 or a 3/10 (explore the unknown)?
This is the exploration-exploitation dilemma -- the central tension in all of reinforcement learning. Too much exploitation and you get stuck with suboptimal strategies because you never discover better ones. Too much exploration and you waste time on random actions when you already know what works. The balance between the two is everything in RL.
The simplest solution is epsilon-greedy: with probability epsilon, take a random action (explore); otherwise, take the best known action (exploit):
class EpsilonGreedyAgent:
"""Explore with probability epsilon,
exploit otherwise."""
def __init__(self, n_actions, epsilon=0.1):
self.n_actions = n_actions
self.epsilon = epsilon
self.q_values = {}
self.rng = np.random.RandomState(42)
def choose_action(self, state):
if self.rng.random() < self.epsilon:
return self.rng.randint(
self.n_actions)
else:
values = self.q_values.get(
state,
np.zeros(self.n_actions))
return int(np.argmax(values))
def update(self, state, action, reward,
next_state, alpha=0.1,
gamma=0.99):
"""Q-learning update."""
if state not in self.q_values:
self.q_values[state] = np.zeros(
self.n_actions)
if next_state not in self.q_values:
self.q_values[next_state] = (
np.zeros(self.n_actions))
current_q = self.q_values[state][action]
max_next_q = np.max(
self.q_values[next_state])
target = reward + gamma * max_next_q
self.q_values[state][action] += (
alpha * (target - current_q))
# Compare different epsilon values
env = SimpleGridWorld()
print("=== Exploration vs Exploitation ===")
for eps in [0.0, 0.1, 0.3, 1.0]:
agent = EpsilonGreedyAgent(4, epsilon=eps)
total_rewards = []
for episode in range(500):
state = env.reset()
ep_reward = 0
for step in range(50):
action = agent.choose_action(state)
next_s, reward, done = env.step(
action)
agent.update(state, action,
reward, next_s)
ep_reward += reward
state = next_s
if done:
break
total_rewards.append(ep_reward)
last_100 = np.mean(total_rewards[-100:])
print(f" eps={eps:.1f}: "
f"avg reward (last 100)="
f"{last_100:.3f}, "
f"states visited="
f"{len(agent.q_values)}")
With epsilon=0.0, the agent never explores -- it picks the first action that worked and sticks with it, missing potentially better strategies. With epsilon=1.0, the agent always explores randomly and never exploits what it's learned. The sweet spot (usually around 0.1) lets the agent mostly exploit its best known strategy while occasionally trying something new.
A common improvement is epsilon decay: start with high exploration (epsilon=1.0) and gradually reduce it over time. Explore a lot early when you know nothing, exploit more later when you've learned the environment. We'll see more sophisticated exploration strategies in the next few episodes.
Markov Decision Processes (MDPs)
The formal mathematical framework for RL is the Markov Decision Process. An MDP is defined by five elements:
- S: the set of all possible states
- A: the set of all possible actions
- P(s'|s,a): transition function -- the probability of reaching state s' when taking action a in state s
- R(s,a): reward function -- expected reward for taking action a in state s
- gamma: discount factor (0 to 1) -- how much future rewards are worth
The Markov property is the key assumption that makes everything tractable: the future depends only on the current state, not on how you got there. The state must contain all relevant information for making decisions. In chess, the current board position is all you need -- the sequence of moves that led to it is irrelevant for deciding the next move.
class ExplicitMDP:
"""MDP with explicit transition
probabilities for a 3-state example."""
def __init__(self):
self.n_states = 3
self.n_actions = 2
self.gamma = 0.9
# State 0: start
# State 1: intermediate
# State 2: goal (terminal)
# Action 0: "safe", Action 1: "risky"
# transitions[s][a] = [(prob, next_s)]
self.transitions = {
0: {
0: [(1.0, 1)],
1: [(0.7, 2), (0.3, 0)],
},
1: {
0: [(1.0, 2)],
1: [(0.5, 2), (0.5, 0)],
},
2: {
0: [(1.0, 2)],
1: [(1.0, 2)],
},
}
# rewards[s][a]
self.rewards = {
0: {0: 1.0, 1: 5.0},
1: {0: 2.0, 1: 10.0},
2: {0: 0.0, 1: 0.0},
}
def simulate(self, policy, n_episodes=1000):
"""Run policy and measure
average return."""
rng = np.random.RandomState(42)
returns = []
for _ in range(n_episodes):
state = 0
total = 0.0
discount = 1.0
for _ in range(20):
if state == 2:
break
action = policy[state]
reward = self.rewards[state][
action]
total += discount * reward
discount *= self.gamma
# Sample next state
trans = self.transitions[
state][action]
probs = [t[0] for t in trans]
states = [t[1] for t in trans]
idx = rng.choice(
len(trans), p=probs)
state = states[idx]
returns.append(total)
return np.mean(returns)
def run(self):
policies = {
"Always safe": {0: 0, 1: 0, 2: 0},
"Always risky": {0: 1, 1: 1, 2: 0},
"Safe then risky": {
0: 0, 1: 1, 2: 0},
"Risky then safe": {
0: 1, 1: 0, 2: 0},
}
print("=== MDP Policy Comparison ===")
for name, policy in policies.items():
avg_return = self.simulate(policy)
print(f" {name}: "
f"avg return = "
f"{avg_return:.2f}")
mdp = ExplicitMDP()
mdp.run()
This is a small enough MDP that we can enumerate all possible policies and compare them directly. The "risky" action gives higher rewards but has a chance of sending the agent back to the start. The optimal policy depends on the discount factor -- with high gamma (patient agent), the risky approach is worth it because the expected value is higher even accounting for failures. With low gamma (impatient agent), the safe path wins because the agent strongly prefers guaranteed sooner rewards.
Not all environments are fully Markov, by the way. In many real problems, the agent sees only a partial observation (this is called a POMDP -- Partially Observable MDP). In a card game, you see your hand but not your opponent's cards. In robotics, sensors have limited range. The Markov assumption is often good enough as an approximation, and you can work around partial observability by stacking recent observations to create a more complete state representation -- something we'll see when we get to deep RL in a few episodes.
Cumulative reward and discounting
The agent's objective is to maximize cumulative reward -- the total reward collected over the entire episode, not just the immediate reward at any single step. This is what separates RL from greedy optimization:
G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + gamma^3 * r_{t+3} + ...
The discount factor gamma determines how much the agent values future rewards versus immediate ones:
class DiscountExplorer:
"""Explore how gamma affects
decision-making."""
def __init__(self):
pass
def compute_return(self, rewards,
gamma=0.99):
"""Compute discounted return from
a sequence of rewards."""
G = 0.0
returns = []
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
return returns
def run(self):
# Scenario: reward of 1.0 arrives
# after N steps
print("=== Discount Factor Effects ===")
print(f"{'gamma':<8}"
f"{'1 step':>10}"
f"{'5 steps':>10}"
f"{'10 steps':>10}"
f"{'50 steps':>10}"
f"{'100 steps':>10}")
for gamma in [0.5, 0.9, 0.95, 0.99, 1.0]:
values = []
for delay in [1, 5, 10, 50, 100]:
rewards = [0.0] * delay + [1.0]
returns = self.compute_return(
rewards, gamma)
values.append(returns[0])
print(f"{gamma:<8}"
+ "".join(
f"{v:>10.4f}"
for v in values))
# Two scenarios
print("\n=== Patience Test ===")
# Option A: small reward soon
option_a = [0, 0, 1, 0, 0]
# Option B: big reward later
option_b = [0, 0, 0, 0, 5]
for gamma in [0.5, 0.9, 0.99]:
ga = self.compute_return(
option_a, gamma)[0]
gb = self.compute_return(
option_b, gamma)[0]
winner = ("A (soon)" if ga > gb
else "B (later)")
print(f" gamma={gamma}: "
f"A={ga:.3f}, B={gb:.3f}"
f" -> {winner}")
disc = DiscountExplorer()
disc.run()
The discount factor serves two purposes. Mathematically, it ensures the infinite sum converges (any geometric series with gamma < 1 converges). Practically, it encodes a preference for sooner rewards -- a dollar today really is worth more than a dollar next year, and this is true in most real-world settings. Most RL applications use gamma between 0.95 and 0.99. The "patience test" in the code above shows exactly where the crossover happens: impatient agents (low gamma) prefer the small-soon reward, patient agents (high gamma) prefer the large-later reward. This is not just a mathematical curiosity -- it fundamentally changes what strategies the agent learns ;-)
Policies and value functions
A policy (pi) is the agent's strategy -- it maps states to actions. It can be deterministic ("in state X, always do action Y") or stochastic ("in state X, do action Y with 70% probability and action Z with 30%"). The entire point of RL is to find a good policy.
A value function V(s) tells you how good a state is -- the expected cumulative reward from that state onward, following a given policy. An action-value function Q(s,a) tells you how good a specific action is in a specific state. If you know Q(s,a) for all states and actions, the optimal policy is trivial: in every state, just pick the action with the highest Q value.
The entire challenge of RL is learning good estimates of these value functions. Let's see this in action:
class ValueFunctionDemo:
"""Compute value functions for our
grid world via Monte Carlo."""
def __init__(self, size=4, gamma=0.99):
self.size = size
self.gamma = gamma
self.env = SimpleGridWorld(size)
def random_policy_action(self):
return np.random.randint(4)
def estimate_values(self, n_episodes=5000):
"""Estimate V(s) for random policy
using Monte Carlo averaging."""
state_returns = {}
for _ in range(n_episodes):
state = self.env.reset()
episode = []
for _ in range(100):
action = (
self.random_policy_action())
next_s, reward, done = (
self.env.step(action))
episode.append(
(state, reward))
state = next_s
if done:
episode.append(
(state, 0.0))
break
# Compute returns backwards
G = 0.0
for i in range(
len(episode) - 1, -1, -1):
s, r = episode[i]
G = r + self.gamma * G
if s not in state_returns:
state_returns[s] = []
state_returns[s].append(G)
# Average returns per state
V = {}
for s, returns in (
state_returns.items()):
V[s] = np.mean(returns)
return V
def run(self):
V = self.estimate_values()
print("=== Value Function "
"(Random Policy) ===")
print("V(s) for each grid cell:\n")
for row in range(self.size):
for col in range(self.size):
state = (row, col)
v = V.get(state, 0.0)
print(f"{v:>7.3f}", end=" ")
print()
print(f"\nGoal {self.env.goal}: "
f"V={V.get(self.env.goal, 0):.3f}")
print(f"Start (0,0): "
f"V={V.get((0,0), 0):.3f}")
print(f"States with estimates: "
f"{len(V)}")
vf = ValueFunctionDemo()
vf.run()
Even under a completely random policy, the value function reveals structure: states closer to the goal have higher values because the expected reward is higher. States in the corner opposite the goal have the lowest values. This is the foundation of all value-based RL -- compute (or estimate) these values, then use them to make better decisions. The Monte Carlo method we used here is one of the simplest approaches: just average the actual returns observed from each state across many episodes. We'll cover more sophisticated methods (temporal difference learning, function approximation) in the comming episodes.
The RL algorithm landscape
There are many RL algorithms, but they all fit into a handful of categories. Understanding the landscape helps you pick the right approach for a given problem:
class RLLandscape:
"""Map the RL algorithm families with
simple demonstrations."""
def __init__(self):
self.rng = np.random.RandomState(42)
def value_based_demo(self):
"""Value-based: learn Q(s,a),
derive policy from it."""
print("=== Value-Based Methods ===")
print("Learn Q(s,a) for all "
"state-action pairs")
print("Policy: argmax_a Q(s,a)")
print("Examples: Q-Learning, DQN, "
"Double DQN")
print("Best for: discrete actions, "
"clear state values")
# Mini Q-table
Q = np.zeros((3, 2))
Q[0] = [0.5, 0.8]
Q[1] = [0.9, 0.3]
Q[2] = [0.1, 0.7]
print("\nSample Q-table:")
print(f" {'State':<8}{'Act 0':>8}"
f"{'Act 1':>8}{'Best':>6}")
for s in range(3):
best = int(np.argmax(Q[s]))
print(f" s={s:<6}{Q[s,0]:>8.1f}"
f"{Q[s,1]:>8.1f}"
f"{best:>6}")
def policy_based_demo(self):
"""Policy-based: directly learn
pi(a|s)."""
print("\n=== Policy-Based Methods ===")
print("Directly parameterize the "
"policy pi(a|s)")
print("Learn via policy gradient: "
"increase prob of good actions")
print("Examples: REINFORCE, PPO, "
"TRPO")
print("Best for: continuous actions, "
"stochastic policies")
# Stochastic policy example
probs = np.array([
[0.7, 0.3],
[0.2, 0.8],
[0.5, 0.5],
])
print("\nSample policy pi(a|s):")
print(f" {'State':<8}{'P(a=0)':>8}"
f"{'P(a=1)':>8}")
for s in range(3):
print(f" s={s:<6}"
f"{probs[s,0]:>8.1f}"
f"{probs[s,1]:>8.1f}")
def actor_critic_demo(self):
"""Actor-Critic: learn both."""
print("\n=== Actor-Critic Methods ===")
print("Actor: learns the policy "
"(which actions to take)")
print("Critic: learns the value "
"function (how good states are)")
print("Critic helps reduce variance "
"of actor's gradient estimates")
print("Examples: A2C, A3C, SAC, TD3")
print("Best for: combining strengths "
"of both approaches")
def model_based_demo(self):
"""Model-based: learn environment
dynamics, then plan."""
print("\n=== Model-Based Methods ===")
print("Learn a model of the "
"environment: P(s'|s,a)")
print("Use the model to plan "
"without real interaction")
print("Examples: Dyna-Q, MuZero, "
"World Models")
print("Best for: sample efficiency, "
"simulation")
# Simple learned model
model = {
((0, 0), 1): ((0, 1), -0.01),
((0, 0), 2): ((1, 0), -0.01),
((0, 1), 1): ((0, 2), -0.01),
}
print("\nLearned model predictions:")
for (s, a), (ns, r) in model.items():
dirs = ["up", "right",
"down", "left"]
print(f" {s} + {dirs[a]} "
f"-> {ns}, r={r}")
def run(self):
self.value_based_demo()
self.policy_based_demo()
self.actor_critic_demo()
self.model_based_demo()
landscape = RLLandscape()
landscape.run()
The key tradeoff between these families: model-free methods (value-based and policy-based) are simpler and work well when you can afford millions of environment interactions, but they're sample-inefficient -- they need a LOT of experience. Model-based methods are much more sample-efficient (because they can plan using the learned model instead of interacting with the real environment), but they add the complexity of learning an accurate model and the risk of planning with a wrong model. We'll dig deep into each family over the next several episodes.
Putting it all together: Q-learning on grid world
Now let's train our agent properly and watch it learn. Q-learning (which we'll derive formally in a later episode) is the classic value-based algorithm: it learns Q(s,a) values and derives the policy from them:
class QLearningDemo:
"""Complete Q-learning on grid world
with training analysis."""
def __init__(self, size=4, gamma=0.99,
alpha=0.1, epsilon=0.2):
self.env = SimpleGridWorld(size)
self.agent = EpsilonGreedyAgent(
4, epsilon=epsilon)
self.gamma = gamma
self.alpha = alpha
def train(self, n_episodes=1000):
rewards_history = []
steps_history = []
for episode in range(n_episodes):
state = self.env.reset()
total_reward = 0
steps = 0
for step in range(100):
action = (
self.agent.choose_action(
state))
next_s, reward, done = (
self.env.step(action))
self.agent.update(
state, action, reward,
next_s, self.alpha,
self.gamma)
total_reward += reward
state = next_s
steps += 1
if done:
break
rewards_history.append(
total_reward)
steps_history.append(steps)
return rewards_history, steps_history
def show_policy(self):
arrows = ["^", ">", "v", "<"]
print("\nLearned policy:")
for row in range(self.env.size):
for col in range(self.env.size):
state = (row, col)
if state == self.env.goal:
print(" G", end=" ")
continue
values = (
self.agent.q_values.get(
state, np.zeros(4)))
best = int(np.argmax(values))
print(f" {arrows[best]}",
end=" ")
print()
def show_q_values(self):
print("\nQ-values (max per state):")
for row in range(self.env.size):
for col in range(self.env.size):
state = (row, col)
values = (
self.agent.q_values.get(
state, np.zeros(4)))
print(f"{np.max(values):>7.3f}",
end=" ")
print()
def run(self):
rewards, steps = self.train(1000)
# Report learning progress
print("=== Q-Learning Training ===")
for i in [0, 100, 300, 500, 900]:
window = rewards[i:i+100]
print(
f" Episodes {i}-{i+99}: "
f"avg reward="
f"{np.mean(window):.3f}, "
f"avg steps="
f"{np.mean(steps[i:i+100]):.1f}")
self.show_policy()
self.show_q_values()
demo = QLearningDemo()
demo.run()
After 1000 episodes, the agent has learned to navigate efficiently -- arrows should point right and down along the optimal path, and the Q-values should be highest near the goal and decrease smoothly toward the start. The agent discovered this strategy purely from reward signals, without anyone programming the solution. That's the magic of RL.
Real-world RL: beyond grid worlds
Our grid world is a toy example, obviously, but the same framework scales to problems that are genuinely hard. AlphaGo and AlphaZero (DeepMind) used RL combined with Monte Carlo tree search to master Go, chess, and shogi -- all from self-play, no human game databases needed. OpenAI Five played Dota 2 at the professional level using a scaled-up version of PPO (Proximal Policy Optimization). Robotics labs use RL to teach robots to walk, grasp objects, and navigate without explicit programming of every motion. RLHF (Reinforcement Learning from Human Feedback) is how language models we discussed in episode #61 are aligned with human preferences -- the reward signal comes from human evaluators rating model outputs.
The gap between our grid world and these applications is mostly about scale: larger state spaces, continuous actions, deeper neural networks as function approximators, and vastly more compute. The fundamental framework -- agent, environment, state, action, reward, policy, value function -- stays exactly the same.
Samengevat
- Reinforcement learning is learning by interaction: an agent takes actions in an environment, receives rewards, and adjusts its strategy -- no labeled dataset required, no teacher showing the correct answer;
- the RL framework has five elements: agent, environment, state, action, and reward -- every RL problem from Atari to robotics fits this same loop;
- RL differs from supervised learning in three fundamental ways: the agent generates its own data, feedback is delayed (credit assignment problem), and the data distribution changes as the agent's policy improves;
- the exploration-exploitation dilemma is the central challenge: explore to discover new strategies vs exploit what already works -- epsilon-greedy is the simplest balance, and epsilon decay improves it;
- Markov Decision Processes formalize RL with states, actions, transition probabilities, rewards, and a discount factor -- the Markov property assumes the future depends only on the current state;
- the discount factor gamma controls how much the agent values future rewards -- typical values are 0.95-0.99, and it fundamentally changes what strategies the agent discovers;
- policies map states to actions; value functions estimate expected cumulative reward; learning good value functions (V or Q) is the core problem in most RL;
- the RL landscape spans value-based (Q-learning, DQN), policy-based (REINFORCE, PPO), actor-critic (A2C, SAC), and model-based (Dyna-Q, MuZero) methods -- each with different tradeoffs between sample efficiency and simplicity.
We've just scratched the surface. The next few episodes will go deep into each of these algorithm families, starting with the mathematical foundations and then building up to the core algorithms that modern RL is built on. There's a lot of interesting ground to cover and it only gets better from here.
Exercises
Exercise 1: Build a windy grid world environment. Create a class WindyGridWorld (7x10 grid, start=(3,0), goal=(3,7)) where columns 3-8 have an upward "wind" that pushes the agent up by 1 cell after each action (columns 6-7 push up by 2 cells). Actions: 4 cardinal directions. Implement reset() and step(action) with -1 reward per step, 0 at goal. Train a Q-learning agent (epsilon=0.1, alpha=0.5, gamma=1.0) for 500 episodes. Print: (a) the learned policy as arrows, (b) the average steps-to-goal for the last 50 episodes, (c) the number of states visited during training. The optimal path should curve to compensate for the wind.
Exercise 2: Build an epsilon decay comparison. Create a class EpsilonDecayExperiment that trains 4 agents on our 4x4 grid world, each with a different epsilon schedule: (a) fixed epsilon=0.1, (b) linear decay from 1.0 to 0.01 over 1000 episodes, (c) exponential decay epsilon = 0.99^episode, (d) fixed epsilon=0.5. Train each for 1000 episodes (alpha=0.1, gamma=0.99). For each agent, track and print: average reward per 100-episode block, total exploration actions taken, final policy quality (measured as average reward over 100 greedy-only test episodes at the end). The linear and exponential decay agents should outperform both fixed schedules.
Exercise 3: Build a stochastic environment comparison. Create a class StochasticGridWorld identical to SimpleGridWorld but where actions succeed with probability p_success and with probability (1-p_success) a random different action is taken instead (simulating noisy actuators). Train Q-learning agents (epsilon=0.1, alpha=0.1, gamma=0.99) on three versions: p_success=1.0 (deterministic), p_success=0.8, p_success=0.5. For each, run 2000 episodes and report: (a) learning curve (average reward per 200-episode block), (b) final policy (as arrows), (c) the Q-value spread -- max(Q) minus min(Q) averaged across states. The deterministic environment should converge fastest with the highest Q-value spread; stochastic environments should converge slower with compressed Q-values because the agent can't fully control its trajectory.