RL Notes (9): Actor-Critic • Xiaohei's Blog

Preface#

Note: this series is converted from my Mubu notes. If you prefer a more illustrated reading experience, you can read it in my Mubu space. Due to length constraints, the Chapter 12 Mubu notes are there. If you find any logical mistakes, please let me know in the comments—thank you!

Start#

We’ve already felt the pain points of policy gradients in Chapters 4 and 5: they can learn, but variance is large and the curve jitters like an ECG.

Actor-Critic starts from a very simple idea:

the actor is responsible for the policy $\pi_\theta(a|s)$
the critic is responsible for the value $V(s)$ or $Q(s,a)$

The docs give an engineering-friendly summary: with a value function, actor-critic can do one-step updates without waiting for the episode to end.

What the actor and critic each do#

The docs are very clear:

actor: learn a policy to achieve as high a return as possible
critic: estimate the value of the current policy and evaluate how good the actor is

It’s a division-of-labor structure: the actor decides what to do; the critic tells it whether it did well.

Why actor-critic is more stable: reduce variance with advantages#

The docs mention Advantage Actor-Critic (A2C). The core of advantage is:

don’t use the raw return $G_t$ (high variance)
use the advantage $A_t$ (improvement relative to a baseline)

A common advantage form:

A_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)

This is basically the TD residual.

The A2C implementation loop (in the order I write code)#

If you write A2C as a training loop you’ll actually maintain, it usually looks like this: use the actor to sample actions from the current policy and interact with the environment; then let the critic estimate $V(s)$ ; then compute advantages $A_t$ via TD or GAE; use them as weights to update the actor (loss like $-\log\pi(a|s)\cdot A$ ); meanwhile the critic does regression, fitting $V(s)$ to the bootstrap target (loss like $\|V(s) - \text{target}\|^2$ ). Many implementations also add an entropy bonus—not to be “smarter,” but to prevent the policy from collapsing too early into an almost-deterministic output.

The docs mention a trick: you can constrain the entropy of the policy distribution so entropy doesn’t become too small, ensuring exploration.

A risk mentioned in the docs: both networks can be inaccurate#

The docs say A2C’s downside is that you estimate two networks, so the risk doubles. That’s not an exaggeration:

if the critic is wrong, the advantage is wrong, and the actor gets dragged off course
if the actor drifts, the critic’s data distribution also drifts

Common engineering mitigations:

don’t set value-loss weight (value_coef) too large, to avoid the critic dominating training
advantage normalization
gradient clipping (clip grad norm)

Chapter summary: Actor-Critic is the “general backbone” of deep RL#

Many famous continuous-control algorithms later (DDPG/TD3/SAC) are different implementations of actor-critic:

the critic learns Q
the actor produces (approximately) optimal actions

In harder tasks like sparse rewards and imitation learning, actor-critic also often serves as the foundation.

Next chapter we’ll talk about sparse rewards: when the environment gives almost no feedback, how do you keep the agent learning?