Imitation Learning and Reinforcement Learning

In this pose, we’ll compare Imitation learning (IL) vs. reinforcement learning (RL). Excellent recent notes and course by Bohg and Pavone [1].

Method Type	RL	IL
Goal	determine closed-loop control policies that result in the maximization of an accumulated reward
Reward Function	Known	Unknown
Access to Expert Demonstrations?	Yes	No
Reward Frequency	Sparse	Dense
MDP w/ state space S and actions space A	Yes	Yes

There are some typical delineations we use to discuss these types of methods:

Reinforcement Learning	Imitation Learning
1. Model-Free: collecting system data to update a learned model 2. Model-Based: collecting system data to directly update a learned value function or policy	1. Directly imitate expert’s policy (e.g. BC or DAgger) 2. Indirectly imitate policy by learning expert’s reward function (IRL)

BC has been around since the 1980s [2]. It is simple and often efficient. A KL divergence can be used:

\[\begin{aligned} \mbox{arg} \underset{\mbox{min}}{\theta} \mathbb{E}_{(s,a^*) \sim P^*} L(a^*, \pi_{\theta}(s)) \\ \mbox{arg} \underset{\mbox{min}}{\theta} \mathbb{E}_{(s,a^*) \sim P^*} KL(a^*, \pi_{\theta}(s)) \end{aligned}\]

Why is behavior cloning (BC) insufficient at times?

First, it provides no way to understand the underlying reasons for the expert behavior (e.g. intentions).
Second, the expert may be suboptimal.
It assumes IID data, i.e. (state, action) pairs. At training time \(s \sim P^*\), but at test time, \(s \sim P(s \mid \pi_\theta)\). This leads to distribution mismatch between training and testing (unless \(P^* \approx P_\theta\)).
No long term planning.

RL

Model-Based: estimate \(T(x^\prime \mid x, u)\), then use model to plan.

Model-Free:

Value-based approach: estimate optimal value (or Q) function from data.
Policy-based approach: instead, use data to determine how to improve policy (e.g. Policy Gradient).
Actor-Critic approach: learn both (1) and (2).

No fixed dataset in RL – actively gather that data that will be used for learning (through exploration).

Method	Q-Learning	Policy Gradient
Supported Action Spaces	Discrete	Discrete or Continuous
Data Efficiency	“Off-policy” (learn from all interaction data)	“On-policy” (can only use trajectories from current policy)

CMU Class https://cmudeeprl.github.io/403website_s22/lectures/

References

[1] Jeanette Bohg, Marco Pavone, Dorsa Sadigh. Principles of Robot Autonomy Notes (PDF). Slides. RL Slides.

[2] D. Pomerleau. ALVINN. NeurIPS, 1988. PDF.

[3] Stuart Russell. Learning agents for uncertain environments. COLT, 1998. PDF

[4] Ng, A. Y., Russell, S. et al. [2000], Algorithms for inverse reinforcement learning. ICML 2000. PDF.

[5] Saurabh Arora, Prashant Doshi. A survey of inverse reinforcement learning: Challenges, methods and progress. PDF.

[6] Sergey Levine. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. PDF.