强化学习 Proximal Policy Optimization (PPO)

技术分享 3年前 (2023-08-27) 0 999+

关注

参考: 李宏毅老师课件

PPO: Default reinforcement learning algorithm at OpenAI

PPO = Policy Gradient 从 On-policy 到 Off-policy, 再加一些constraint

Policy Gradient

Basic Conception

Actor: 动作执行者(智能体)
Env: 环境
Reward Function: 奖励函数
Policy (pi) : a network with parameter (theta).

Input: 当前的 Env.

Output: actor 要采取的下一个 action 的分布.
Trajectory (tau): 一系列的 Env 和 Action, (set{s_1,a_1,s_2,a_2, dots})

在参数为 (theta) 情况下, 发生(tau)的概率: (p_{theta}(tau)=p(s_1)p_{theta}(a_1|s_1)p(s_2|s_1,a_1)p_{theta}(a_2|s_2)cdots)

Optimization

Object

给定 (tau), 可以计算 (tau) 的 reward, ({R(tau)}).

(tau)(tau)({R(tau)})

对于参数为 (theta) 的 Policy下, Trajectory (tau) 是采样得到的, 因此实际上需要计算的是 reward 的期望值(overline{R_theta}). 我们希望 (overline{R_theta}) 越大越好.

Policy Gradient

Reward 的期望:

[begin{equation} begin{aligned} overline{R_theta}=sum_tau R(tau)p_theta(tau) end{aligned} end{equation} ]

求 (theta) 的梯度:

[begin{equation} begin{aligned} nabla overline R_theta &= sum_tau R(tau)nabla p_theta(tau) \ &=sum_tau R(tau) p_theta(tau) frac{nabla p_theta(tau)}{p_theta(tau)}quad &&text{分子分母同乘} p_theta(tau)\ &=sum_tau R(tau) p_theta(tau) {nabla log p_theta(tau)}\ &=E_{tausim p_theta(tau)}[R(tau)nabla log p_theta(tau)]\ &approx frac 1 N sum_{n=1}^{N} R(tau^n)nabla log p_theta(tau^n)\ &= frac 1 N sum_{n=1}^{N} sum_{t=1}^{T_n} R(tau^n)nabla log p_theta(a^n_t|s^n_t) end{aligned} end{equation} ]

由 (nabla log p_theta(tau)=frac{nabla p_theta(tau)}{p_theta(tau)}), 可得到第三行公式.
此处可延伸出一个公式:

[begin{equation} nabla f(x) = f(x) nabla log f(x) end{equation} ]

由(sum_tau p_theta(tau)f(tau)=E_{tausim p_theta(tau)}[f(tau)]), 可得第四行

通过采样的方式估计期望值, 采样 (N) 个 Trajectory, 既第五行公式

最后将 (p_theta(tau)) 展开代入, 得第六行公式

Implementation

最大化 Reward 的期望 (overline{R_theta}), 由公式(2)中梯度的计算, 可以反推出目标函数在实现时定义如下:

[begin{equation} begin{aligned} J(theta) = frac 1 N sum_{n=1}^{N} sum_{t=1}^{T_n} R(tau^n) log p_theta(a^n_t|s^n_t) end{aligned} end{equation} ]

最大化 (object) 等价于最小化 (loss):

[begin{equation} begin{aligned} loss = -frac 1 N sum_{n=1}^{N} sum_{t=1}^{T_n} R(tau^n) log p_theta(a^n_t|s^n_t) end{aligned} end{equation} ]

其中, (a^n_t, s^n_t) 是在参数为 (theta) 的 policy 下采样得到的.

与交叉熵损失对比: 其实就是将采样得到的 (a^n_t) 视作grand truth计算交叉熵, 区别在于针对不同的 Trajectory (tau^n), 要多乘了一个 (R(tau^n))

Tips

Add a baseline

(R(tau^n)) 可能总为正数, 这样在 training时, 相当于告诉 model, 不论时什么action 都要将它的概率提升.

(R(tau^n))

理想情况下, 这样是没有问题的, 因为 Reward 即使总是正的, 也有大有小.

当时实际上, action 是采样得到的, 这会导致如果有的 action 没有被采样到, 它的概率相对于被采样到的 action 就会下降, 而这时, 并不能表示当前环境下采取这个 action 不好.

改进: 减去一个 baseline, (b).

Assign Suitable Credit

再来看一下目标函数:

[begin{equation} begin{aligned} J(theta) = frac 1 N sum_{n=1}^{N} sum_{t=1}^{T_n} R(tau^n) log p_theta(a^n_t|s^n_t) end{aligned} end{equation} ]

对于同一个 Trajectory (tau) 中, 针对每个状态 (s) 下, 执行动作 (a), 都有相同的 Reward 系数. 这是不合理的.
例如图的左边, 在 (s_b) 执行 (a_2) 不是一个好的选择, 他会导致接下来进入 (s_c), 并执行 (a_3), 得到 -2 分.
由此, 提出改进1.

改进1: 每个时刻的 reward 改为, 当前时刻到结束时刻的 reward 的总和

某时刻的 action, 经过越长时间, 它的影响力就越小. 也就是与该 action 间隔很久的 reward 与该 action 的关系很小. 由此提出改进2.

改进2: 加一个衰减系数.

最后, 将整个系数项称为 Advantage Function, (A^theta(s_t, a_t)).其含义为, 在某 state 下, (a_t) 相较于其他的 action, 有多好. (这个 (A), 通常可以是用一个网络来预测的 ???)

最终, 得梯度公式:

[begin{equation} nabla overline R_theta approx frac 1 N sum_{n=1}^{N} sum_{t=1}^{T_n} A^theta(s_t, a_t) nablalog p_theta(a^n_t|s^n_t) end{equation} ]

On-policy (rightarrow) Off-policy

On-policy

梯度计算公式:

[begin{equation} nabla overline R_theta =E_{tausim p_theta(tau)}[R(tau)nabla log p_theta(tau)]\ end{equation} ]

目前为止的做法其实是一种 on-policy 的方法:

每次更新梯度前, 都需要从 (pi_theta) 中采样 (tau).
参数更新后, 又需要用更新后的参数重新采样 (tau).

目标是: 从另一个 policy, (pi_{theta'}) 中采样数据, 用来训练 (pi_theta). 这样就可以重复利用这些采样得到的数据.

Importance Sampling(重要性采样)

(x) 服从 (p) 分布时, 计算 (f(x)) 期望 (E_{xsim p}[f(x)]) 的做法: 一般是从 (p) 中采样一些 (x), 带入 (f(x)) 求平均, 用这个值来估计所求期望.

现在, 假设无法从 (p) 中直接采样 (x), 但可以从另一个分布 (q) 中采样 (x). 可以对 (E_{xsim p}[f(x)]) 做如下变形:

[begin{equation} begin{aligned} E_{xsim p}[f(x)] &= int f(x)p(x) , dx\ &=int f(x)frac{p(x)}{q(x)}q(x) , dx\ &= E_{xsim q}[f(x)frac{p(x)}{q(x)}] end{aligned} end{equation} ]

这样, 我们就可以用 (q) 中采样的数据来估计期望值 (E_{xsim p}[f(x)]). 这就是 Importance Sampling.

Issue of Importance Sampling
理论上, 我们已经得出两个期望值是相等的:

[begin{equation} E_{xsim p}[f(x)] = E_{xsim q}[f(x)frac{p(x)}{q(x)}]. end{equation} ]

那么它们的方差是否相等呢? (Var_{xsim p}[f(x)] == Var_{xsim q}[f(x)frac{p(x)}{q(x)}] ?)

由公式

[begin{equation} Var[x] = E[x^2]-(E[x])^2 end{equation} ]

可以得出:

[begin{equation} begin{aligned} Var_{xsim p}[f(x)]&=E_{xsim p}[f^2(x)]-(E_{xsim p}[f(x)])^2\ Var_{xsim q}[f(x)frac{p(x)}{q(x)}] &=E_{xsim q}[(f(x)frac{p(x)}{q(x)})^2]-(E_{xsim q}[f(x)frac{p(x)}{q(x)}])^2\ &=int (f(x)frac{p(x)}{q(x)})^2q(x) , dx - (E_{xsim p}[f(x)])^2\ &=int f^2(x)frac{p(x)}{q(x)}p(x) , dx - (E_{xsim p}[f(x)])^2\ &=E_{xsim p}[f^2(x)frac{p(x)}{q(x)}]-(E_{xsim p}[f(x)])^2 end{aligned} end{equation} ]

对比发现, 第一项中后者比前者多乘了一个 (frac{p(x)}{q(x)}), 也就是说当 (p) 与 (q) 相差很多时, 它们的方差也会差很多.

这样就会出现一问题: 理论上, 无论 (p,q) 的分布是什么样的, 当我们从 (p) 和 (q) 采样足够多次时, 是可以得到 (E_{xsim p}[f(x)] = E_{xsim q}[f(x)frac{p(x)}{q(x)}]) 的.
但是当 (p,q) 差距过大, 而我们采样的次数又不够多时, 因为它们之间的方差差距很大, 所以最后很可能导致期望差距很大.

一个直观的例子:

图中 (p,q)两个分布的差异很大.

(p,q)

当我们采样次数不够多, 导致没有采样到最左边那个样本时, 就会出现实际上 (E_{xsim p}[f(x)]) 应是一个负值, 但我们用 (E_{xsim q}[f(x)frac{p(x)}{q(x)}]) 计算出来的却是一个正值.

而当我们采样到最左边那个样本时, 因为此时 (frac{p(x)}{q(x)}) 的值将会非常大, 所以可以把 (E_{xsim q}[f(x)frac{p(x)}{q(x)}]) 拉回负值.

Off-policy

将 Importance Sampling 用在 policy gradient 中, 我们就可以得到:

[begin{equation} begin{aligned} nabla overline R_theta &=E_{tausim p_theta(tau)}[R(tau)nabla log p_theta(tau)]\ &=E_{tausim p_{theta'}(tau)}[frac{p_{theta}(tau)}{p_{theta'}(tau)}R(tau)nabla log p_theta(tau)] end{aligned} end{equation} ]

这样, 我们就可以从 (theta') 中采样数据, 然后多次利用这些数据来更新 (theta).

结合公式(7), 得

[begin{equation} begin{aligned} nabla overline R_theta &=E_{tausim p_{theta'}(tau)}[frac{p_{theta}(tau)}{p_{theta'}(tau)}R(tau)nabla log p_theta(tau)]\ &=E_{(s_t,a_t)simpi_{theta'}}[frac{p_theta(s_t, a_t)}{p_{theta'}(s_t, a_t)}A^{theta'}(s_t, a_t) nablalog p_theta(a^n_t|s^n_t)]quad &&text{由公式(7)得}\ &=E_{(s_t,a_t)simpi_{theta'}}[frac{p_theta(a_t|s_t)p_theta(s_t)}{p_{theta'}(a_t|s_t)p_{theta'}(s_t)}A^{theta'}(s_t, a_t) nablalog p_theta(a^n_t|s^n_t)]\ &=E_{(s_t,a_t)simpi_{theta'}}[frac{p_theta(a_t|s_t)}{p_{theta'}(a_t|s_t)}A^{theta'}(s_t, a_t) nablalog p_theta(a^n_t|s^n_t)]quad &&text{假设}p_theta(s_t)=p_{theta'}(s_t)\ end{aligned} end{equation} ]

再由公式(3)得:

[begin{equation} nabla overline R_theta=E_{(s_t,a_t)simpi_{theta'}}[frac{nabla p_theta(a_t|s_t)}{p_{theta'}(a_t|s_t)}A^{theta'}(s_t, a_t)] end{equation} ]

反推目标函数:

[begin{equation} J^{theta'}(theta)=E_{(s_t,a_t)simpi_{theta'}}[frac{p_theta(a_t|s_t)}{p_{theta'}(a_t|s_t)}A^{theta'}(s_t, a_t)] end{equation} ]

Add constraint

目前为止, 我们利用 Importance Sampling 完成了 Policy Gradient 从 On-policy 到 Off-policy 的优化.

但是 Importance Sampling 在实际应用中有一个不得不考虑的限制, 就是我们无法保证能采样足够多的数据, 这时当两个分布 (p_theta, p_{theta'})差异过大时, 难以保证期望相等.

PPO做的事情, 简单说就是, 限制两个分布 (p_theta, p_{theta'}) 不能差太多.

[begin{equation} J_{PPO}^{theta'}(theta)=J^{theta'}(theta)-beta KL(theta, theta') end{equation} ]

注: 此处 KL 散度指的不是将两个模型的参数看作分布,拉近两个模型的参数的距离. 而是两个模型行为上的距离, 就是当两个模型输入同样的 state 时, 希望输出的 action 的分布尽可能像

Conclusion

PPO algorithm

PPO2

PPO2: 简化 PPO 的计算.

首先, 我们将横坐标 (x) 设为 (frac{p_theta(a_t|s_t)}{p_{theta^k}(a_t|s_t)}), 则函数 (y=x) 与 (y=clip(x, 1-epsilon, 1+epsilon)) 的图像分别为图中的绿线和蓝线.
其中, (clip(x, a, b)=left{begin{aligned}a,quad &xle a\ x, quad &a<x<b\ b, quad &x ge bend{aligned}right.)

(x)(frac{p_theta(a_t|s_t)}{p_{theta^k}(a_t|s_t)})(y=x)(y=clip(x, 1-epsilon, 1+epsilon))(clip(x, a, b)=left{begin{aligned}a,quad &xle a\ x, quad &a<x<b\ b, quad &x ge bend{aligned}right.)