

Preface#
Note: this series is converted from my Mubu notes. If you prefer a more illustrated reading experience, you can read it in my Mubu space. Due to length constraints, the Chapter 12 Mubu notes are there. If you find any logical mistakes, please let me know in the comments—thank you!
Start#
When I first tried to bring DQN ideas into continuous control, the most direct frustration came from one sentence: “You want me to take in a continuous action space—how?” In discrete actions, you can enumerate a few actions and pick the best; in continuous actions, actions are uncountable, and max becomes a real optimization problem. You can approximate it by sampling actions, or do gradient ascent over actions at each step—but both routes are hard to make both accurate and fast.
In my view, DDPG is a very “engineering” compromise: it stops hard-computing and instead directly learns a network that outputs “which continuous action I should take in state ,” turning action selection into a single forward pass. With that, the Q network evaluates (critic), the policy network outputs actions (actor), and the two networks pull each other—forming the classic actor-critic backbone for continuous control.
Discrete vs continuous actions: why the “output layer” is different#
The docs start from the action-space difference, and I think that’s essential because many implementation bugs come from “you thought the action was a probability, but it’s actually a float.” Discrete actions are countable: the network usually outputs a set of logits, then softmax to get ; each action has a probability and probabilities sum to 1. Continuous actions are uncountable; a common approach is to output the action value directly, i.e., a deterministic policy .
To keep outputs within the environment’s allowed range, engineering implementations often add a tanh layer at the output: first squash network output to , then linearly scale it to the environment’s action bounds. The cart example in the docs is very practical: the raw network output might be 2.8, after tanh it becomes 0.99, then mapping from to yields the final action 1.98. This detail looks small, but it determines whether later noise injection and target policy smoothing perturb the action on the correct scale.
DDPG: Deep + Deterministic + Policy Gradient#
The docs break down the name clearly: Deep means neural networks are used; Deterministic means it outputs deterministic actions; Policy Gradient means actor updates use policy-gradient ideas. The more crucial line is: DDPG can be seen as a continuous-action extension of DQN. Interpreted from an implementation perspective, this boils down to three things: experience replay, target networks, and the actor-critic dual-network structure.
DDPG typically has four networks:
The actor outputs actions directly given a state; the critic gives a differentiable score for “how good is taking this action in this state.” To keep the TD target from drifting during training, we also use delayed-update target networks: a target actor and a target critic provide a more stable bootstrap target.
The critic update looks like DQN: regress Q values to a TD target. The actor update is “make actions that the critic thinks are better more likely to be output,” i.e., push actor parameters along the direction of .
Why DDPG must add noise: exploration under a deterministic policy#
The docs explain the reason plainly: DDPG trains a deterministic policy off-policy. Without noise, with fixed parameters, the same state always outputs the same action—early on it’s almost impossible to cover enough of the action space, and the learning signal becomes very sparse.
So during training we add noise to the action, making the behavior policy:
The docs mention “zero-mean Gaussian noise works well,” which is also the default in many implementations. Noise often decays over training: larger early for exploration, smaller later for higher-quality data. During evaluation, we remove noise to assess pure exploitation performance.
DDPG pain points: hyperparameter sensitivity and Q overestimation#
The docs point out typical drawbacks of DDPG: it’s very sensitive to hyperparameters, and the learned Q function may start to significantly overestimate, breaking the policy. This is common in practice: you see Q values keep increasing, but returns suddenly collapse—like the model “self-hypnotizes” into believing some actions are very good, then drags the actor off course.
This instability is largely addressed later by TD3.
TD3: three tricks to stabilize DDPG#
The docs list TD3’s three techniques clearly; I’ll translate them into more implementation-friendly intuition.
The first is clipped double Q-learning. TD3 learns two critics: and , and uses the smaller of the two when building the target. This is simple but immediately effective against overestimation, because you stop trusting the optimistic bias of any single critic.
The second is delayed policy updates. TD3 does not update the actor every time it updates the critic; instead, let the critic learn a few more steps to make Q estimates more reliable, then update the actor occasionally. The rule of thumb in the docs is “update the policy once every two critic updates,” which is also a common default.
The third is target policy smoothing. When computing the TD target, don’t use the target actor’s action directly; add a small clipped noise to it so the Q target is smoother in the action dimension, reducing the actor’s ability to exploit critic errors.
Chapter summary: the “shortest path” for continuous control#
If I compress the key points of this chapter into one engineering summary: DDPG hands “continuous-action argmax” to an actor network as an approximation, stabilizes the critic with replay + target networks, and restores exploration for deterministic policies with action noise; TD3 then patches DDPG’s most common instability points one by one with min over two critics, delayed actor updates, and target-action smoothing.
If you plan to implement this chapter in code next, I suggest first ensuring these three things are correct: action scaling (tanh + scale), noise scale (matched to action range), and done-boundary handling (don’t bootstrap on terminal states). Many “algorithm problems” ultimately reduce to these three implementation details.