

RL Practice (1): Framework + Q-Learning & Sarsa
A minimal, runnable practice of Q-Learning and Sarsa, epsilon-greedy, an episode-based training loop, and saving/loading.
Preface#
When I run reinforcement learning experiments, the most painful part is not deriving equations, but this: the same algorithm in a different environment often forces you to rewrite a lot of training scaffolding. Later I simply summarized the common patterns into a “reusable algorithm skeleton”, and then filled in the differences of each algorithm (which are almost always concentrated in sample and update).
More specifically, most of the pitfalls I’ve hit came from “engineering details being ignored”: sometimes the formula isn’t wrong, but your training loop didn’t handle terminated/truncated properly, epsilon decayed too fast and converged prematurely, or you didn’t cap the max steps per episode and your training statistics became incomparable. In the end, you can only stare at a messy reward curve, with no idea whether the problem is the environment or the code.
So in this post I will deliberately make the “program skeleton” more rigid: what functions you should have, how to arrange train/eval loops, what to print at each step, so that when you write the second and third algorithms you won’t have to build scaffolding from scratch.
This is the first article in the series, with a very clear goal:
- Provide a reusable RL project code structure (train / test / env / configs)
- Implement the classic Q-Learning (off-policy)
- And Sarsa (on-policy), which differs from it in one key place
In later posts, I’ll continue by splitting into “value-function family (DQN family)” and “policy gradient / Actor-Critic family”:
- Part 2: DQN / Double DQN / Dueling DQN / Noisy DQN / PER-DQN
- Part 3: Policy Gradient (REINFORCE) / PPO / A2C
- Part 4: DDPG / TD3 / SAC (the three classics for continuous control)
A General RL Code Skeleton#
Whether it’s tabular methods (Q-table) or deep RL (DQN and variants), I like to abstract an agent into four core actions:
sample(state): sample an action during training (with exploration)predict(state): output an action during evaluation (no exploration, exploitation)update(transition): update the policy using interaction datasave()/load(): save and load (optional, but strongly recommended)
The most important reminder is: across algorithms, the implementations of sample and update vary a lot; other parts (training loop, logging, save/load) are usually similar.
Training Loop (Define Training)#
An “episode-based” training loop is basically the following sequence:
- Episode start:
state = env.reset() - Set a per-episode max step limit
max_steps(helps convergence and avoids never-ending episodes) - Interact in a loop until termination:
action = agent.sample(state)(exploration policy)next_state, reward, terminated, truncated, info = env.step(action)- Build
transition(optionally store in memory) agent.update(transition)state = next_state- If
terminated or truncated, end the episode
Here is rigid pseudo-code (I recommend copying it as a project template):
def train(env, agent, num_episodes: int, max_steps: int):
for ep in range(num_episodes):
state, info = env.reset()
ep_return = 0.0
for t in range(max_steps):
action = agent.sample(state)
next_state, reward, terminated, truncated, info = env.step(action)
agent.update(state, action, reward, next_state, terminated)
ep_return += reward
state = next_state
if terminated or truncated:
break
print(f"episode={ep} return={ep_return:.1f} eps={agent.epsilon:.3f}")pythonEvaluation Loop (Define Testing)#
Evaluation looks very similar to training, but two things must change:
- No updates: evaluation is only for measuring performance
- No sampling: use
predictfor pure exploitation
def evaluate(env, agent, num_episodes: int, max_steps: int):
returns = []
for ep in range(num_episodes):
state, info = env.reset()
ep_return = 0.0
for t in range(max_steps):
action = agent.predict(state)
next_state, reward, terminated, truncated, info = env.step(action)
ep_return += reward
state = next_state
if terminated or truncated:
break
returns.append(ep_return)
print(f"[eval] episode={ep} return={ep_return:.1f}")
return sum(returns) / len(returns)pythonEnvironment: Gym Is Enough (For Custom Envs, Only Care About reset/step)#
Most of the time I won’t build my own environment: Gym/Gymnasium is enough.
If you must write a custom environment, the key is aligning these two interfaces:
reset(): start an episode and return the initial statestep(action): take an action and return(next_state, reward, terminated, truncated, info)
Q-Learning: From “Skeleton” to a Runnable Algorithm#
The Three Core Functions of Q-Learning#
In tabular Q-Learning we maintain a table . Plugging it into the skeleton, you’ll find:
predict(state): choosesample(state): add exploration on top ofpredict(epsilon-greedy / UCB, etc.)update(...): the Bellman (TD) update
epsilon-greedy (The Most Common Exploration Strategy)#
Training is usually: explore more early and gradually converge later, i.e., decay from large to small.
Typically we set a triple:
epsilon_start: initial exploration rate (often 0.95)epsilon_end: minimum exploration rate (often 0.01; keep a bit of exploration)epsilon_decay: decay speed (too fast causes premature convergence/overfitting; too slow delays convergence)
A simple decay rule:
A “Directly Reusable” Q-Learning Class#
The code below is my minimal implementation (discrete state/action). It’s written as a class to keep an interface consistent with deep algorithms like DQN.
import random
from collections import defaultdict
from dataclasses import dataclass
@dataclass
class QLearningConfig:
gamma: float = 0.99
lr: float = 0.1
epsilon_start: float = 0.95
epsilon_end: float = 0.01
epsilon_decay: float = 0.995
class QLearningAgent:
def __init__(self, n_actions: int, cfg: QLearningConfig):
self.n_actions = n_actions
self.cfg = cfg
self.Q = defaultdict(lambda: [0.0 for _ in range(n_actions)])
self.epsilon = cfg.epsilon_start
def sample(self, state):
# Exploration + exploitation
if random.random() < self.epsilon:
return random.randrange(self.n_actions)
return self.predict(state)
def predict(self, state):
q = self.Q[state]
return int(max(range(self.n_actions), key=lambda a: q[a]))
def update(self, state, action, reward, next_state, terminated: bool):
q_sa = self.Q[state][action]
next_q_max = 0.0 if terminated else max(self.Q[next_state])
target = reward + self.cfg.gamma * next_q_max
self.Q[state][action] = q_sa + self.cfg.lr * (target - q_sa)
# decay epsilon once per step (or per episode, both ok; step is finer-grained)
self.epsilon = max(self.cfg.epsilon_end, self.epsilon * self.cfg.epsilon_decay)
def save(self, path: str):
import json
with open(path, 'w', encoding='utf-8') as f:
json.dump({str(k): v for k, v in self.Q.items()}, f, ensure_ascii=False)
def load(self, path: str):
import json
with open(path, 'r', encoding='utf-8') as f:
obj = json.load(f)
self.Q = defaultdict(lambda: [0.0 for _ in range(self.n_actions)])
for k, v in obj.items():
self.Q[k] = vpythonSarsa: Only One update Different From Q-Learning#
The core difference between Sarsa and Q-Learning in one sentence:
- Sarsa: update using the actually executed next action (on-policy)
- Q-Learning: update using the assumed optimal next action (off-policy)
Written as update targets (only look at the difference):
- Q-Learning:
- Sarsa: , where is sampled by the current policy at
Minimal Sarsa Implementation (Only Show update Difference)#
Sarsa’s code structure is almost the same as Q-Learning; the key difference is that update needs next_action:
class SarsaAgent(QLearningAgent):
def update(self, state, action, reward, next_state, next_action, terminated: bool):
q_sa = self.Q[state][action]
next_q = 0.0 if terminated else self.Q[next_state][next_action]
target = reward + self.cfg.gamma * next_q
self.Q[state][action] = q_sa + self.cfg.lr * (target - q_sa)
self.epsilon = max(self.cfg.epsilon_end, self.epsilon * self.cfg.epsilon_decay)pythonAccordingly, the training loop needs one line changed: get next_action before updating.
action = agent.sample(state)
next_state, reward, terminated, truncated, info = env.step(action)
next_action = agent.sample(next_state)
agent.update(state, action, reward, next_state, next_action, terminated)
state = next_state
action = next_actionpythonSummary#
If you only remember one thing, it should be this:
- The training/evaluation loops can be fixed as templates; you mainly change
sample/predict/update
In the next post, I’ll upgrade this template to a “deep RL” version: introduce a Replay Buffer and target networks, and then explain DQN and several common improvements (Double / Dueling / Noisy / PER).
My Usual Hyperparameter/Debugging Order (Very Practical)#
Finally, I’ll leave a short “troubleshooting workflow” that I use most often. The downside of reinforcement learning is: unlike supervised learning, you can’t easily see overfitting at a glance. It’s more like a stew—a problem at any step can show up as “reward doesn’t increase”. I usually locate issues in this order:
- Make sure the environment runs end-to-end: Can a random policy finish an episode? Are
terminated/truncatedreasonable? What is the rough reward scale? - Make logs specific enough: per-episode return, epsilon per step/episode, and whether the episode ends early. For tabular methods, I also print the scale of
max(Q[state])to see if it blows up. - Don’t decay epsilon too fast: my most common mistake is setting
epsilon_decaytoo aggressive, causing “confidently doing nonsense” after only dozens of episodes. If reward fluctuates early and then quickly gets stuck, suspect insufficient exploration first. - Match gamma and max_steps: when
gammais large, the effective return horizon is longer; ifmax_stepsis short, many environments become “impossible to learn”. The opposite also holds: too longmax_stepsmakes training slower with higher variance. - Verify on a minimal environment first: FrozenLake / Taxi help quickly validate whether your training loop is correct. Confirm the skeleton first; moving to a complex environment later saves a lot of time.