Xiaohei's Blog
headpicBlur image

Preface#

Note: this series is converted from my Mubu notes. If you prefer a more illustrated reading experience, you can read it in my Mubu space. Due to length constraints, the Chapter 12 Mubu notes are there. If you find any logical mistakes, please let me know in the comments—thank you!

Start#

Tabular Q-learning is beautiful, but it has a fatal premise: Q(s,a)Q(s,a) must fit in a table.

Once the state is an image, a continuous vector, or the combinatorics explode, tabular methods go bankrupt. DQN’s idea is simple:

Use a neural network to approximate Q(s,a)Q(s,a).

But when you actually implement it, you’ll find that directly applying the supervised-learning recipe becomes very unstable. The docs mention two key engineering components of DQN:

  • target network
  • experience replay

These two are almost the reason DQN can run.

What DQN is: deep Q-learning#

The definition in the docs is standard: DQN is Q-learning based on deep learning, combining value-function approximation with neural networks.

We’re still learning Q(s,a)Q(s,a), and we still choose actions greedily or via ε-greedy:

at=argmaxaQθ(st,a)a_t = \arg\max_a Q_\theta(s_t, a)

The difference is that QQ is no longer a table, but a network QθQ_\theta.

Why it becomes unstable: bootstrapping + non-i.i.d. samples + moving targets#

DQN’s training target is usually the TD target:

yt=rt+1+γmaxaQθ(st+1,a)y_t = r_{t+1} + \gamma \max_{a'} Q_{\theta^-}(s_{t+1}, a')

If you use the same network to simultaneously:

  • predict QθQ_\theta
  • generate the target yty_t

then the target will move as you train (a moving target), and divergence is easy.

Target Network#

The docs mention using a target network as the first stability patch:

  • online network: QθQ_\theta
  • target network: QθQ_{\theta^-} (parameters copied with delay)

Common practice:

  • every N steps, copy θ\theta to θ\theta^-
  • or do soft updates (smoother)

Experience Replay (Replay Buffer)#

The docs mention experience replay. It addresses the fact that samples are temporally correlated; direct online updates can bias gradients.

The replay buffer helps in two ways:

On one hand, randomly sampling mini-batches breaks temporal correlation so gradients look more like “approximately i.i.d.” supervised learning; on the other hand, it reuses the same experiences multiple times, improving sample efficiency—especially critical when environment interaction is expensive.

Engineering experience:

  • a buffer that’s too small overfits to recent experience
  • a buffer that’s too large becomes “too offline,” and learning slows down

DQN training loop (pseudo-code)#

You can view it as a loop of three things: sample, store, learn.

  1. choose actions from QθQ_\theta using ε-greedy
  2. interact with the environment to get (s,a,r,s,done)(s,a,r,s',done)
  3. store into the replay buffer
  4. sample a batch from the buffer, construct TD targets, and regress Qθ(s,a)Q_\theta(s,a) with MSE

Chapter summary: DQN is the beginning of “stability engineering”#

From this chapter on, you’ll notice that many innovations in deep RL are essentially addressing the same problem: training instability.

Next chapter we continue along DQN: Double / Dueling / PER / NoisyNet… Each “variant” targets a specific pain point.

RL Notes (6): DQN
https://xiaohei94.github.io/en/blog/rl-learning-6
Author 红鼻子小黑
Published at May 7, 2025
Comment seems to stuck. Try to refresh?✨