Imitation Learning vs. Reinforcement Learning
Imitation Learning and Reinforcement Learning
In this pose, we’ll compare Imitation learning (IL) vs. reinforcement learning (RL). Excellent recent notes and course by Bohg and Pavone [1].
Method Type | RL | IL |
---|---|---|
Goal | determine closed-loop control policies that result in the maximization of an accumulated reward | |
Reward Function | Known | Unknown |
Access to Expert Demonstrations? | Yes | No |
Reward Frequency | Sparse | Dense |
MDP w/ state space S and actions space A | Yes | Yes |
There are some typical delineations we use to discuss these types of methods:
Reinforcement Learning | Imitation Learning |
---|---|
1. Model-Free: collecting system data to update a learned model 2. Model-Based: collecting system data to directly update a learned value function or policy | 1. Directly imitate expert’s policy (e.g. BC or DAgger) 2. Indirectly imitate policy by learning expert’s reward function (IRL) |
BC has been around since the 1980s [2]. It is simple and often efficient. A KL divergence can be used:
\[\begin{aligned} \mbox{arg} \underset{\mbox{min}}{\theta} \mathbb{E}_{(s,a^*) \sim P^*} L(a^*, \pi_{\theta}(s)) \\ \mbox{arg} \underset{\mbox{min}}{\theta} \mathbb{E}_{(s,a^*) \sim P^*} KL(a^*, \pi_{\theta}(s)) \end{aligned}\]Why is behavior cloning (BC) insufficient at times?
- First, it provides no way to understand the underlying reasons for the expert behavior (e.g. intentions).
- Second, the expert may be suboptimal.
- It assumes IID data, i.e. (state, action) pairs. At training time \(s \sim P^*\), but at test time, \(s \sim P(s \mid \pi_\theta)\). This leads to distribution mismatch between training and testing (unless \(P^* \approx P_\theta\)).
- No long term planning.
RL
Model-Based: estimate \(T(x^\prime \mid x, u)\), then use model to plan.
Model-Free:
- Value-based approach: estimate optimal value (or Q) function from data.
- Policy-based approach: instead, use data to determine how to improve policy (e.g. Policy Gradient).
- Actor-Critic approach: learn both (1) and (2).
No fixed dataset in RL – actively gather that data that will be used for learning (through exploration).
Method | Q-Learning | Policy Gradient |
---|---|---|
Supported Action Spaces | Discrete | Discrete or Continuous |
Data Efficiency | “Off-policy” (learn from all interaction data) | “On-policy” (can only use trajectories from current policy) |
CMU Class https://cmudeeprl.github.io/403website_s22/lectures/
##
References
[1] Jeanette Bohg, Marco Pavone, Dorsa Sadigh. Principles of Robot Autonomy Notes (PDF). Slides. RL Slides.
[2] D. Pomerleau. ALVINN. NeurIPS, 1988. PDF.
[3] Stuart Russell. Learning agents for uncertain environments. COLT, 1998. PDF
[4] Ng, A. Y., Russell, S. et al. [2000], Algorithms for inverse reinforcement learning. ICML 2000. PDF.
[5] Saurabh Arora, Prashant Doshi. A survey of inverse reinforcement learning: Challenges, methods and progress. PDF.
[6] Sergey Levine. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. PDF.