RL Notes (4): Policy Gradient • Xiaohei's Blog

Preface#

Note: this series is converted from my Mubu notes. If you prefer a more illustrated reading experience, you can read it in my Mubu space. Due to length constraints, the Chapter 12 Mubu notes are there. If you find any logical mistakes, please let me know in the comments—thank you!

Start#

When I first switched from Q-learning to policy gradients, the best part was: I finally didn’t need to get a policy “indirectly” by going through $\arg\max$ .

In policy gradients, the idea is almost brutally simple:

If the return of a trajectory is positive, make the actions that appeared in that trajectory more likely to be sampled in their corresponding states; conversely, if the return is negative, push those action probabilities down. You can think of it as a kind of “after-the-fact praise/criticism” mechanism: after the episode ends, you look back—trajectories that performed well give “bonus points” to the decisions inside them.

In this chapter, we follow the docs’ path and turn this idea into something implementable: gradient ascent, baselines, assigning proper credit to each step, and the classic REINFORCE algorithm.

The core intuition of policy gradient#

This part in the docs is key: for a step $(s_t, a_t)$ in a trajectory $\tau$ , if we later find the trajectory reward is positive, we increase the probability of taking $a_t$ in $s_t$ ; otherwise we decrease it.

To make it trainable, we typically:

In implementation, we usually start with a parameterized policy $\pi_\theta(a|s)$ (a neural network outputting an action distribution) to represent “what I tend to do in a given state.” Then we write the objective as maximizing expected return $J(\theta)=\mathbb{E}[G]$ , and update parameters via gradient ascent:

\theta \leftarrow \theta + \eta \nabla_\theta J(\theta)

Here $\eta$ is the learning rate (you can use Adam / RMSProp).

Two common tricks: baseline and credit assignment#

Trick 1: add a baseline#

The docs mention a seemingly counterintuitive issue: an action not being sampled doesn’t mean it’s bad, but its probability may still be “pushed down.” More fundamentally:

The variance of return $G$ is often large, and if you multiply $G$ directly with log_prob, updates can be very noisy. A baseline provides a reference frame so you focus on “did I gain relative to the baseline,” which significantly reduces variance.

A baseline is “subtract something that doesn’t change the expectation but reduces variance.” The classic baseline is the state value $V(s)$ , which yields the advantage function:

A(s,a)=Q(s,a)-V(s)

Trick 2: assign an appropriate score to each step#

In the same episode, early actions may determine the route and later actions may determine the finish. We want each step to be weighted differently to reflect “how much this step contributes to the outcome.”

The most common approach is to use each step’s future discounted return:

G_t = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k+1}

This also leads directly to REINFORCE.

REINFORCE: Monte Carlo policy gradient#

The docs describe it well: REINFORCE updates per episode. First collect rewards for each step, then compute $G_t$ for each step, then use it to optimize the action outputs of each step.

In more “programmer” language:

You can think of the implementation as a straightforward sample-and-backprop routine: roll out a full episode and record $(s_t, a_t, r_{t+1}, \log\pi(a_t|s_t))$ ; then compute $G_t$ from back to front; finally write the loss in a gradient-descent form and backpropagate:

\mathcal{L} = - \sum_t \log\pi_\theta(a_t|s_t) \cdot G_t

Then backprop.

Chapter summary: first make the policy “learnable”#

The charm of policy gradients is that they handle continuous actions naturally and can encode exploration into the distribution itself. But they can also be noisier and more dependent on engineering details.

In the next chapter we’ll move to a more practical version used widely in the real world: PPO. You can think of PPO as “putting a seat belt on policy gradients”: making sure each update step doesn’t jump too far, which is a qualitative improvement for stability.