RL Notes (11): Imitation Learning • Xiaohei's Blog

Preface#

Note: this series is converted from my Mubu notes. If you prefer a more illustrated reading experience, you can read it in my Mubu space. Due to length constraints, the Chapter 12 Mubu notes are there. If you find any logical mistakes, please let me know in the comments—thank you!

Start#

Many people start learning RL from games: rewards are clear and the cost of failure is low.

But once you look at the real world (robots, autonomous driving, healthcare), you’ll find:

Rewards are often hard to specify clearly (for example, what exactly does “drive like a human” mean in computable metrics?), and exploration cost is extremely high (one crash can be total loss). These two constraints make pure trial-and-error RL much less viable.

The docs give a very practical answer in this chapter: imitation learning.

Its input is not reward, but expert demonstrations: how the expert acts is what you learn first.

Two main lines of imitation learning#

The docs mention two major approaches:

One line is Behavior Cloning (BC), the other is Inverse Reinforcement Learning (IRL). One looks like supervised learning; the other is more like “learn preferences first, then learn decisions.”

I usually understand it like this:

BC treats “imitation” as supervised learning and directly learns $\pi(a|s)$ ; IRL first infers a “reward function / preference” from expert behavior, then runs RL under that reward.

1) Behavior Cloning (BC): the most supervised-learning-like imitation#

The docs describe it clearly: whatever the expert does, the agent learns to do the same.

Implementation-wise, it’s a supervised learning problem:

Input is observation $s$ , label is expert action $a$ . If actions are discrete, you typically use cross-entropy; if actions are continuous, MSE or negative log-likelihood is more common.

2) DAgger: aggregate datasets to fix distribution shift#

The docs mention DAgger’s idea: record what the expert would do in the states the model actually encounters.

You can think of it as a loop of “make mistakes, then ask the expert”: run your current policy for a while so you actually reach the failure-prone states; then query the expert “what would you do here”; then add the new data to the dataset and retrain. Repeat this, and the data distribution becomes closer and closer to what your model will truly see at deployment.

This continually pulls the data distribution back to “where you really end up going.”

3) Inverse Reinforcement Learning (IRL): infer rewards from demonstrations#

The docs finally mention IRL’s motivation: behavior cloning doesn’t solve everything, so IRL is introduced.

IRL’s intuition:

Expert behavior usually implies some hidden “preference” or “reward.” IRL tries to infer that preference from demonstrations, then uses ordinary RL to learn a policy under that reward. It fits scenarios where it’s hard to write reward, but you can obtain sufficiently high-quality demonstrations.

This is valuable in settings where “reward is hard to specify but demonstrations are easier to obtain.”

Chapter summary: if you don’t want to write rewards, find demonstrations#

I summarize the value of imitation learning in one sentence:

In the real world, demonstrations are often cheaper than rewards.

BC helps you quickly get a usable policy, DAgger makes it less likely to drift, and IRL (or GAIL) lets you learn “expert-like” behavior without explicit rewards.

At this point, Chapter 11 of EasyRL-base completes a closed loop from foundational concepts to common algorithms to advanced topics (sparse rewards, imitation learning). If you continue writing, I suggest prioritizing: the continuous-control trio (DDPG/TD3/SAC) and the basic paradigms of offline RL—they often connect with imitation learning in real engineering.