Reinforcement Learning Notes I - Introduction & Imitation Learning

Introductions, MDPs, Imitation Learning

Introduction

我们把 \(\left<s_t, a_t\right>\) 打包起来，其实他就构成了一个 Markov chain:

\[p_\theta (\tau) = p_\theta (s_1) \prod_{t=1}^T \pi_\theta (a_t \mid s_t) \cdot \mathbb{P}(s_{t+1} \mid s_t, a_t)\]

对于 learning objective：

\[\theta^{*} = \arg \max_{\theta} \mathcal{J}\left(\theta\right) = \arg \max_{\theta} \mathbb{E}_{\tau \sim {\pi}_{\theta} \left(\tau\right)}\left[r\left(\tau\right)\right]\]

我们定义我现在在 \(s_t\)，做了 action \(a_t\)，然后按照 \(\pi\) 所得到的期望 reward 为

\[\mathcal{Q}^{\pi} \left(s_t, a_t\right) = \sum_{t'=t}^T \mathbb{E}_{\left(s_t', a_t'\right) \sim \pi}\left[r\left(s_t', a_t'\right) \mid s_t, a_t\right]\]

然后我们定义 \(\mathcal{V}\) 表示现在我在 \(s_t\) 的时候遵循 \(\pi\) 所得到的期望 reward：

\[\mathcal{V}^{\pi} \left(s_t\right) = \sum_{t'=t}^T \mathbb{E}_{\left(s_t', a_t'\right) \sim \pi}\left[r\left(s_t', a_t'\right) \mid s_t\right] = \mathbb{E}_{a_t \sim \pi\left(a_t \mid s_t\right)}\left[\mathcal{Q}^{\pi} \left(s_t, a_t\right)\right]\]

Types of Algorithms

Policy Gradients: 跟往常一样，就是求导然后去做优化
Value-based: 去直接计算最优解的 \(\mathcal{Q}\) 和 \(\mathcal{V}\)，然后通过这个推出 \(\pi\)
Actor-critic: 通过计算当前 \(\hat{\theta}\) 的 \(\hat{\mathcal{Q}}\) 和 \(\hat{\mathcal{V}}\)，然后通过这个去进行优化，得到最终的 \(\pi\)
Model-based RL: 使用模型来代表各种参数，然后进行优化

Imitation Learning

数据 \(\mathcal{D} = \left\{(s_i, a_i)\right\}_{i=1}^N\)，然后我们要设计 \(\pi\) 去你和 \(a\)

首先最直觉的想法是直接去最优化

\[\theta^* = \arg\min_\theta \frac{1}{\left|\mathcal{D}\right|} \sum_{(s, a)\in \mathcal{D}} \left\|\hat{a} - a\right\|^2\]

但这样肯定是不对的。比如说我现在前面有一个障碍物，一半人要左转，一半人要右转。结果这样拟合出来，最优的 \(\hat{a}\) 就成直行撞墙了。

所以说我们考虑生成模型，也就是说去拟合 \(p_\theta (a \mid s)\)：

\[\theta^* = \arg\min_\theta \mathbb{E}_{(s, a)\sim \mathcal{D}}\left[-\log p_\theta (a \mid s)\right]\]

那怎么搞这个 \(p\) 呢？离散的好说，对于连续的，我们一般有三种方法：

Mixture of Gaussians
Discretize + Autoregressive
Diffusion

但是这些 imitation learning 有一个问题，就是 \(p_\pi(s)\neq p_\mathcal{D}(s)\)，这就导致了我们学习的东西，和我们现实中遇见的是稍有不同的。这边有个东西叫做“covariate shift”。详细解释了这一现象。

DAgger: 现根据 \(p_\theta\) 走到 \(s'\)，然后去询问 expert 应该怎么做，然后把这个数据加到 \(\mathcal{D}\) 里面去。

Reinforcement Learning Notes I - Introduction & Imitation Learning

Introduction

Types of Algorithms

Imitation Learning

Paper Reading

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Training Diffusion Models with Reinforcement Learning