RL Notes (2): MDP, MRP, and the Bellman Equation • Xiaohei's Blog

Preface#

Note: this series is converted from my Mubu notes. If you prefer a more illustrated reading experience, you can read it in my Mubu space. Due to length constraints, the Chapter 12 Mubu notes are there. If you find any logical mistakes, please let me know in the comments—thank you!

Start#

This MDP chapter is easy to write as a “pile of symbols,” but I prefer to treat it as an engineering language: when you complain “the reward is too delayed, I don’t know which action to blame,” MDP gives you a coordinate system to disentangle the problem.

After reading this chapter, you should be able to answer three practical questions.

More importantly, I hope you read with three very down-to-earth questions in mind: what exactly is the Markov property and why it is a prerequisite for correctness in many algorithms; what the Bellman equation is doing in engineering and why people insist on iterating it again and again; and how to draw the boundary between “prediction” and “control”—i.e., in what scenarios policy iteration vs value iteration feels more natural.

The Markov property: compress “history” into the current state#

The docs’ description of the Markov property is classic: given the current state and all past states, the future depends only on the current state.

In more engineering terms:

If your state design is good enough, you don’t need to remember the entire history.

If the environment’s observation is not Markov (partially observable), you need to compensate outside the algorithm: frame stacking, RNNs, or constructing a belief state.

From Markov Chain to MRP: learn “evaluate under a fixed policy” first#

Markov Chain#

Only state transitions; no reward and no action.

Markov Reward Process (MRP)#

Add a reward function on top of a Markov chain. The key product is the state value:

V(s) = \mathbb{E}[G_t | s_t=s]

where the return $G_t$ is the discounted reward sum:

G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots

In practice, the discount factor $\gamma$ is basically the “how far-sighted are you” knob.

The Bellman equation: write “the future” as “recursion”#

The docs mention that the Bellman equation defines the relationship between the present and the future. Its significance is:

you don’t need to actually run to infinity to compute $G_t$ ;
you can update “current value” using “next-step value.”

Under MRP, you can understand it as:

V(s) = \mathbb{E}[r_{t+1} + \gamma V(s_{t+1}) | s_t=s]

This one sentence is the ancestor of dynamic programming, TD learning, and even deep RL.

Three value-estimation routes: DP / Monte Carlo / TD#

The docs lay out the three families nicely. I’ll add a bit of “how you choose when you write code”:

Dynamic Programming (DP): requires the environment to be fully known ( $P, R$ ) and allows traversing all states. Pros: stable, provable. Cons: rarely satisfied in reality.
Monte Carlo (MC): model-free; estimate using full-episode returns. Pros: unbiased. Cons: high variance; must wait until episode ends.
Temporal Difference (TD): between the two; samples while bootstrapping. Pros: online updates. Cons: biased but usually more efficient.

From MRP to MDP: add actions, and the problem changes from “evaluation” to “decision”#

Compared with MRP, MDP adds actions: the future depends not only on the current state, but also on the action taken in the current state.

The core objects in MDP are:

policy $\pi(a|s)$
value function $V^\pi(s)$ or $Q^\pi(s,a)$

Prediction vs Control#

This part in the docs is exam-frequent, but it’s also key in engineering:

Prediction: given an MDP + a fixed policy $\pi$ , compute $V^\pi$ .
Control: given an MDP (policy unknown), compute the optimal policy $\pi^*$ and optimal value $V^*$ .

In short:

prediction = you ask me whether this policy is good;
control = I find the best policy myself.

Solving control with DP: policy iteration vs value iteration#

Policy Iteration#

A two-step loop:

Policy evaluation: fix $\pi$ and compute $V^\pi$ .
Policy improvement: derive a better policy using $V^\pi$ (greedy w.r.t. $Q$ ).

Value Iteration#

Iterate the Bellman optimality equation directly to get $V^*$ , then extract $\pi^*$ .

Chapter summary: MDP is a map#

I always feel MDP is like a “map”: it tells you what variables exist in RL and what dependency relationships are prerequisites for algorithms to work.

Later you’ll see:

tabular methods (Sarsa/Q-learning) provide “online-iterative” control algorithms on top of MDP;
policy gradient/PPO is another route that avoids explicit $Q$ ;
DQN and Actor-Critic approximate these objects with neural networks.

Next chapter we’ll start from the most basic place: if the state space is small, what happens when you learn with a table directly?