RL Practice (3): Policy Gradient + Actor-Critic

Preface#

In the first two posts, we stayed in the “value function” perspective: learn $Q(s,a)$ , then derive a policy via $\arg\max$ .

Policy Gradient goes the other way: directly optimize the policy $\pi_\theta(a|s)$ . This is very appealing in practice: on one hand it naturally fits continuous actions (no need to forcibly discretize actions and then fit with DQN); on the other hand, the policy itself is a probability distribution, so exploration is no longer an external epsilon patch, but part of the model output.

In this post, following the order I personally write code, I’ll connect the three most common policy methods into a clear route: first use REINFORCE to get the minimal closed loop of “policy distribution + log_prob + returns” running, then use PPO to make updates more restrained and stable, and finally use A2C to balance sampling efficiency and variance control in a more even way.

I’m personally willing to spend time learning this family mainly because it solves two awkward issues I repeatedly ran into with value-function methods: first, in continuous-action tasks, “discretizing actions” makes the policy clumsy; many times no matter how fine you discretize, it’s still worse than directly learning a continuous distribution. Second, in many environments, exploration is not satisfied by “being random once in a while”. After policy gradients write exploration into the distribution, it becomes easier to quantify “am I really exploring” via probability/entropy.

Of course, policy methods have their own pitfalls: the most common are incorrect distribution parameterization (e.g., std too small so there’s almost no exploration), incorrect return/advantage computation (a one-step mistake at the done boundary can make training completely distorted), and log_prob not matching the action (this kind of bug is the scariest: the loss can still go down, but what you learn is wrong). So in this post I’ll also write down these “landmines I stepped on”.

The Key to Policy Gradient: Get the Policy Distribution Right First#

Policy gradient formulas can be written very long, but when I actually start implementing, I only keep one sentence in mind: make the network output a distribution that is “sample-able, can compute log_prob, and can backprop”. As long as this is done right, whether it’s REINFORCE or PPO afterwards, the essence is multiplying log_prob by some weight (returns/advantages) and then backpropagating.

In practice, the two most common distributions are the following.

1) Softmax (Discrete Actions)#

The network outputs logits, and the policy distribution is:

\pi_\theta(a|s) = \text{softmax}(f_\theta(s))

In PyTorch, this corresponds to Categorical(logits=...).

2) Gaussian (Continuous Actions)#

A common approach is for the network to output the mean $\mu(s)$ , along with a variance (or log_std):

a \sim \mathcal{N}(\mu_\theta(s), \sigma^2)

In PyTorch, this corresponds to Normal(loc=mu, scale=sigma).

If your action space has only two actions (0/1), I prefer to view it as Bernoulli: the network outputs a probability $p\in(0,1)$ (via sigmoid), and then action ~ Bernoulli(p). This is more intuitive than softmax(2), and it’s also easier to debug which side the policy is leaning toward.

REINFORCE (Monte-Carlo Policy Gradient): The Minimal Runnable Policy Gradient#

REINFORCE is the “first brick” I use to get policy gradients working. Its structure is very clean: sample an entire trajectory, compute the discounted return $G_t$ for each step, then use $G_t$ to weight log_prob for gradient descent. After you write it, you’ll build muscle memory for “why save log_prob” and “why wait for the episode to end”.

Return $G_t$ : Backward Dynamic Programming#

After sampling a trajectory $(s_t,a_t,r_t)$ , compute from the end backward:

G_t = r_t + \gamma G_{t+1}

In code it’s usually written like this:

def compute_returns(rewards, gamma: float):
	G = 0.0
	returns = []
	for r in reversed(rewards):
		G = r + gamma * G
		returns.append(G)
	returns.reverse()
	return returns

python

Loss: Weight with log_prob#

A common REINFORCE form:

\mathcal{L}(\theta) = -\sum_t G_t \cdot \log \pi_\theta(a_t|s_t)

Engineering-wise:

Save log_prob during sampling
Compute returns after the episode ends
Take a weighted sum and backprop

PPO: Actor-Critic + Clipping (clip) Makes Updates Less Aggressive#

When I start training policies in slightly more complex environments, REINFORCE’s “instability” quickly discourages me. PPO is my most commonly used alternative: it is still Actor-Critic, but it constrains updates with a simple idea—don’t change the policy too drastically in one go.

In engineering implementations, I usually split it into two networks: the Actor outputs the policy distribution (supports sampling and log_prob), and the Critic outputs $V(s)$ as a baseline for advantage estimation.

The Core of PPO: update#

There are many implementation variants of PPO; here I only focus on what I think are the most “useful” core differences:

Sampling/prediction is the same as before, but you must output probabilities (or log_prob)
In update(), use PPO’s objective to update the Actor, and use value loss to update the Critic

If you tend to get lost while implementing, you can self-check with one sentence: PPO’s update takes in a rollout (states/actions/rewards/dones + old_log_probs + values), and outputs updated actor/critic parameters. As long as the data shapes align, advantages are correct, and log_prob matches the old policy, PPO will run.

PPO-Specific “Experience Replay”#

PPO also has a “buffer”, but it is completely different from DQN’s replay: DQN is off-policy, so the buffer can be large and the data can be kept for a long time; PPO is closer to on-policy, and usually only keeps the most recent rollout, uses it to update for several epochs, and then discards it.

Therefore PPO rollout buffers typically store:

states, actions, rewards
old_log_probs
values (critic outputs)
dones

A2C: Advantage Actor-Critic (Common Practice: Parallel Envs + n-step)#

In my mind, A2C (Advantage Actor-Critic) is like a “more practical REINFORCE”: it also updates the policy using sampled data, but it uses a Critic baseline to reduce variance, and often combines parallel environments to collect more trajectories at once to increase throughput.

In implementation it is still two networks, Actor + Critic. When updating the Actor, it commonly uses the advantage function $A_t = G_t - V(s_t)$ . When updating the Critic, it regresses to the return or a bootstrap target. Many pitfalls concentrate in return/advantage computation: done boundaries, whether the last state should bootstrap, and how to organize time dimension and env dimension in parallel environments.

You Can Understand the Training Function in Blocks#

I prefer splitting A2C training code by responsibility (otherwise it’s easy for it to become messier and messier):

Hyperparameters and environment setup: action/state dimensions, initialize networks, optimizer
Training-loop initialization: prepare various cache arrays
Collect trajectories (fixed steps) + periodic evaluation/logging
Compute returns and advantages
Compute loss and update parameters, return reward curves

Summary#

What I want to convey in this post is really just three things: first, the first step to writing policy gradients correctly is “distribution modeling”—it must be sample-able, can compute log_prob, and can backprop; second, REINFORCE uses a single trajectory to build the minimal closed loop, but it has high variance; third, PPO/A2C use engineering techniques like value/advantage to suppress training oscillations, so you can keep iterating on more complex tasks.

The 7 Signals I Always Check When Debugging Policy Gradients#

The most frustrating part of policy gradients is: everything “looks like” it’s moving (loss has values, gradients are backpropagating, parameters are updating), but the policy may just not get better. The following is my most-used self-check list:

Is log_prob finite? Once you see inf/NaN, check distribution parameters first (especially whether std/log_std exploded).
Is entropy in a reasonable range? If entropy quickly drops near 0, it usually means the policy becomes deterministic too early (exploration is gone). If entropy stays very high, it means the policy is basically sampling randomly.
The numeric values of std/log_std: in continuous actions, std too small ≈ not exploring; too large ≈ actions fly everywhere. Many implementations clamp log_std.
Scale of returns/advantages: I print the mean/variance of returns, and often normalize (especially for REINFORCE and PPO advantages).
Done boundaries and bootstrap: in A2C/PPO, advantage/return boundaries are very sensitive; an off-by-one mistake in done handling can shift the whole advantage estimate.
Do action and log_prob match the same sample? Don’t clip/transform actions after sampling and forget to synchronize log_prob (especially common in SAC).
Validate the closed loop on the simplest env first: tasks like CartPole quickly verify whether “distribution sampling + log_prob + update” is correct. Don’t start by throwing yourself at high-dimensional continuous control.

The next post enters the continuous-control trilogy: DDPG / TD3 / SAC. Like PPO/A2C, they all have Actor-Critic; but one is more off-policy, and another emphasizes entropy regularization.

Preface#

The Key to Policy Gradient: Get the Policy Distribution Right First#

1) Softmax (Discrete Actions)#

2) Gaussian (Continuous Actions)#

REINFORCE (Monte-Carlo Policy Gradient): The Minimal Runnable Policy Gradient#

Return GtG_tGt​: Backward Dynamic Programming#

Loss: Weight with log_prob#

PPO: Actor-Critic + Clipping (clip) Makes Updates Less Aggressive#

The Core of PPO: update#

PPO-Specific “Experience Replay”#

A2C: Advantage Actor-Critic (Common Practice: Parallel Envs + n-step)#

You Can Understand the Training Function in Blocks#

Summary#

The 7 Signals I Always Check When Debugging Policy Gradients#

Return $G_t$ : Backward Dynamic Programming#