

RL Notes (1): What Does RL Learn?
Intro to RL: what it learns, I/O, exploration vs exploitation, and state vs observation.
Preface#
I’ve always wanted to connect reinforcement learning’s “concepts → algorithms → engineering intuition” into a route you can actually follow, instead of jumping around—reading DQN today and getting discouraged by PPO tomorrow.
So this time I organized the notes and thoughts I recorded during my own learning into a 12-post series, split by chapters. You can treat it as a minimal closed loop from beginner to advanced: first make the concepts clear, then get the classic algorithms running.
Also, note that this series is converted from my Mubu notes. If you prefer a more visual, illustrated reading experience, you can read it in my Mubu space. Due to length constraints, the Chapter 12 Mubu notes are there. If you find any logical mistakes, please let me know in the comments—thank you!
Table of contents for this series (one post per chapter)#
- Chapter 1: Reinforcement learning basics
- Chapter 2: MDP, MRP, and the Bellman equation
- Chapter 3: Tabular methods
- Chapter 4: Policy gradients and REINFORCE
- Chapter 5: PPO
- Chapter 6: DQN
- Chapter 7: Advanced DQN techniques
- Chapter 8: The dilemma of Q methods in continuous actions and ways forward
- Chapter 9: Actor-Critic / A2C
- Chapter 10: Sparse rewards (reward shaping / ICM)
- Chapter 11: Imitation learning (BC / DAgger / IRL)
- Chapter 12: Deep Deterministic Policy Gradient (DDPG)
How I suggest reading#
- Read Chapters 1-3 first: build the closed-loop intuition and the on-policy/off-policy intuition.
- Then read Chapters 4-6: understand the differences between the two main lines (PG/PPO vs DQN).
- Finally read Chapters 8-12: cover the more realistic difficulties like continuous control, sparse rewards, and imitation learning.
If you plan to turn these into code projects later, I also suggest following this order: get tabular methods working first → go deep on either DQN or PPO → then move on to actor-critic continuous control.
Start#
When I first seriously wrote reinforcement learning code, the thing that confused me most wasn’t the formulas—it was a very simple question: what is it actually learning?
In supervised learning, when you make a mistake, the teacher tells you “what the correct answer is.” In reinforcement learning, when you make a mistake, the environment at best gives you a late feedback (and often doesn’t even say “you’re bad”—it just gives you a zero). So many assumptions that feel “obvious in supervised learning” break down here: samples are not i.i.d., rewards can be delayed, and training feels like exploring in a dark room.
In this chapter, I’ll connect the most basic concepts in the order that makes sense to me: why RL is different from supervised learning, how to distinguish state vs observation, what roles policy / value function / model play, and what kinds of tasks the three common agent families (value-based, policy-based, actor-critic) are suited for.
Why reinforcement learning is “harder than supervised learning”#
There’s a line in the docs I strongly agree with: many difficulties in RL come from the fact that two key assumptions in supervised learning don’t hold here.
The first major change is data is no longer i.i.d.: the agent sees a continuous trajectory; consecutive frames are highly correlated. If you treat it as shuffled supervised data and feed it to a network, your gradients can be wildly biased. The second change is rewards can be delayed: an action you take now may only be judged tens of steps later, and the environment won’t tell you “what the correct action is.” Together, these amplify a lot of engineering details—your loss may not match causality, training curves may go down while the policy doesn’t improve; and you’ll rely more on monitoring to judge whether learning is happening, such as reward curves, episode length, and the value/Q ranges and variances.
Standard reinforcement learning vs deep reinforcement learning#
Standard reinforcement learning is more like: first represent the state as features you can handle, then use tables or relatively simple function approximators to estimate values or policies. Deep reinforcement learning (DRL) is more “end-to-end”: neural networks map states/observations directly to actions or values. This distinction is very practical in projects because it determines what you’re really tuning: tabular methods feel more like math problems—exploration strategies, learning rate, and discount factor are often the main characters; deep methods feel like “deep learning + reinforcement learning,” so besides RL instability, you also have to care about typical deep learning issues like optimizers, normalization, network capacity, and gradient explosion.
This matters in practice because it tells you what to focus on during tuning:
- Tabular methods are more like “math problems,” where exploration strategy, learning rate, and discount factor often dominate;
- Deep methods are more like “deep learning + reinforcement learning,” combining RL instability with neural-network training instability (gradient explosion, normalization, network capacity, etc.).
State and Observation#
The distinction in the docs is important:
- State: a complete description of the world with no hidden information.
- Observation: a partial description of the state, possibly missing information.
This affects whether you need “memory.” If your observation is not Markov (e.g., you only see a local view), you may need:
- frame stacking
- RNNs (LSTM/GRU)
- or explicitly constructing a belief state
The three major components of an RL agent: policy, value function, model#
1) Policy#
A policy answers: what action should we choose in state ?
- Stochastic policy: outputs an action distribution .
- Deterministic policy: outputs a single action .
In practice, I recommend starting with stochastic policies for exploration (especially in continuous actions), because exploration is part of the policy—you don’t need to “stick on epsilon” separately.
2) Value Function#
A value function predicts future rewards and evaluates whether “things are good right now.” The discount factor mentioned in the docs is one of the core hyperparameters:
- smaller : more “short-sighted,” focusing on short-term rewards.
- larger : more “far-sighted,” focusing on long-term return.
Two common types of value functions:
- state value:
- action value:
3) Model#
A model describes how the environment evolves: from to .
- model-based: tries to learn or uses known transitions and rewards.
- model-free: does not explicitly learn transitions; learns policy or values directly.
Three common agent families: value-based / policy-based / actor-critic#
If we classify by “what you explicitly learn,” I prefer to view them as three temperaments: value-based methods learn first, and the policy is often implicitly extracted (e.g., in discrete actions via ). Classic examples are Q-learning and DQN. Policy-based methods directly learn itself—examples include REINFORCE and PPO. Actor-critic combines the two: policy (actor) and value (critic) learn together and accelerate each other; A2C, DDPG, and SAC fall into this category.
Chapter summary: think through the “closed loop” first#
If you only remember one thing, I hope it’s this sentence:
Reinforcement learning always runs a loop: sample (interact) → estimate (value/advantage) → update (policy/value) → sample again.
In later chapters—MDPs, tabular methods, PPO/DQN—everything is essentially swapping different estimation and update methods within this loop.
In the next chapter, we’ll write this loop as mathematical objects: Markov processes, reward processes, MDPs, the Bellman equation. You’ll find that many “abstract-looking” symbols are actually answering very concrete engineering questions: how do I write future discounted rewards as an iterative update rule?