RL Notes (8): Q Methods for Continuous Actions • Xiaohei's Blog

Preface#

Note: this series is converted from my Mubu notes. If you prefer a more illustrated reading experience, you can read it in my Mubu space. Due to length constraints, the Chapter 12 Mubu notes are there. If you find any logical mistakes, please let me know in the comments—thank you!

Start#

If you’ve done any robotics or control tasks, you’ll quickly notice: actions are continuous (torque, angles, speeds), but DQN’s worldview is discrete actions.

The docs make the point clearly in this chapter: DQN fits discrete actions because it needs to compute $\max_a Q(s,a)$ . In continuous actions, you can’t solve that max by enumeration.

So there are four ideas. I’ll explain them in an order from “most naive” to “most engineering-friendly”: sampling, doing gradient ascent over actions, designing more complex network structures, and—not using DQN (i.e., moving to Actor-Critic).

Approach 1: sample actions#

The docs are straightforward:

Sample $N$ candidate actions $\{a_1,\dots,a_N\}$ , plug them into $Q(s,a)$ to score them, and choose the largest-scoring action to execute. It’s as naive as brute-force search, so it’s easy to implement and works well as a baseline; but it quickly becomes expensive and inaccurate in high-dimensional action spaces.

Pros: easy to implement. Cons: inaccurate and computationally expensive; it blows up as dimensionality increases.

Approach 2: gradient ascent over actions (treat a as an optimization variable)#

The docs describe it as an optimization problem: maximize the objective $Q(s,a)$ .

Initialize an action $a$ , then iterate gradient ascent over $a$ to find a local maximum.

The problems are obvious:

It suffers from the classic local-vs-global optimum issue, and it can be slow in engineering—because every decision requires multiple optimization iterations, turning “choose an action” into a small inner optimization loop.

Approach 3: design more complex network architectures#

The docs mention “math is complex, but the idea is nice”: apply transformations to make handling actions more like discrete.

This class appears more in academic settings or special structures (e.g., quantization, hierarchical actions). In engineering, the more common route I see is the one below: use an actor to learn actions directly.

Approach 4: don’t use DQN (switch to other methods)#

The docs’ final “haha” is very real: often the most hassle-free solution for continuous actions is to stop forcing DQN, and use Actor-Critic.

The reason is that the actor-critic division of labor is natural: the actor outputs continuous actions (or a continuous action distribution) directly, and the critic evaluates and provides gradient signals. This way you don’t need to compute a hard max in continuous space at every step; you approximate it with a single network forward pass.

This is much more natural than “maximizing over continuous actions.”

Chapter summary: continuous actions push you toward Actor-Critic#

If you feel “forcing Q methods in continuous actions is awkward,” that means you’ve understood it correctly.

Next chapter we’ll enter Chapter 9 of the docs: Actor-Critic algorithms. You’ll see how they combine policy gradients and TD learning, making continuous control and stable training more feasible.